Working with big SAS datasets using R and sparklyr


In general, R loads all data into memory while SAS allocates memory dynamically to keep data on disk. This makes SAS a better solution for handling very large datasets.

I often need to work with large SAS data files that are prepared in the information system of my department. However, I always try to fit everything to my R workflow. This is because I like to manipulate data with dplyr and perform statistical analysis with all the available packages in R.

To this purpose I found the perfect solution with sparklyr.

First of all we need to install and load the packages.

spark_install(version = "2.0.1", hadoop_version = "2.7")

Then I connect to a local instance of the installed Spark

sc <- spark_connect(master = "local")

Finally it is possible to read the SAS files, manipulate them via dplyr and store in the R memory via collect command.

df %

df_manipulated_r <- collect(df_manipulated)

The command spark_read_sas return an object of class tbl_spark, which is a reference to a Spark data frame based on which dplyr functions can be executed.

The collect function returns a local data frame from the remote source of manipulated spark nibbles allowing for storage in the local memory.

This should be the file on which perform the data analysis and visualization steps.

Here some resources:

Big data in R

Importing 30GB of data into R with sparklyr

sparklyr: R interface for Apache Spark