Working with big SAS datasets using R and sparklyr

CUSTOM WIRES

In general, R loads all data into memory while SAS allocates memory dynamically to keep data on disk. This makes SAS a better solution for handling very large datasets.

I often need to work with large SAS data files that are prepared in the information system of my department. However, I always try to fit everything to my R workflow. This is because I like to manipulate data with dplyr and perform statistical analysis with all the available packages in R.

To this purpose I found the perfect solution with sparklyr.

First of all we need to install and load the packages.

library(sparklyr)
library(spark.sas7bdat)
library(dplyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")

Then I connect to a local instance of the installed Spark

sc <- spark_connect(master = "local")

Finally it is possible to read the SAS files, manipulate them via dplyr and store in the R memory via collect command.

df %
select()

df_manipulated_r <- collect(df_manipulated)

The command spark_read_sas return an object of class tbl_spark, which is a reference to a Spark data frame based on which dplyr functions can be executed.

The collect function returns a local data frame from the remote source of manipulated spark nibbles allowing for storage in the local memory.

This should be the file on which perform the data analysis and visualization steps.

Here some resources:

Big data in R

Importing 30GB of data into R with sparklyr

github.com/bnosac/spark.sas7bdat

sparklyr: R interface for Apache Spark

ELK+R Stack

Elasticsearch is a search engine based on the Lucene library. It provides a distributed full-text search engine with an HTTP web interface and schema-free JSON documents.

Elasticsearch is becoming the bigger player in the technology for documents search in the noSQL space and is actually experiencing a great development phase (6 versions in few years and an exponentially growth of the community).

I have installed elasticsearch version 6.5.4 and kibana (aligned version 6.5.4) on my Mac where I have installed R software version 3.5.

First impact: I have worked few hours in trying to have everything fine installed on my machine. In order to work with such technologies a limited set of hacking skills are required.

For installing both elasticsearch and kibana I have followed the instructions on the elastic website.

I don’t spend here much time on installation issue due to the fact that they are all strongly dependent on operating systems and personal skills. Documentation and online forums will assist you in case of any problem.

After installation kibana was alive and kicking at http://localhost:5601/app/kibana

So, time is come to feed elasticsearch with some data.

I have suddenly thought to the NYC flight dataset available in the nycflights13 package, including on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

library(nycflights13)
data(flights)
flights

I have installed the elastic package from CRAN

The connect command established the connection with my local elasticsearch

library(elastic)
connect()

Then I sent the data frame to elasticsearch with the simple bulk command

docs_bulk(flights, index = "flights_nyc_2013_idx")

The index argument provide the index name to use and is strictly required for data.frame input (optional for file inputs).

Opening kibana everything was ok and ready to play with

schermata 2019-01-24 alle 16.09.31

Then I have started to work on kibana for creating a dashboard for having useful insights from data. Not surprisingly june, july and December were the months at greater risk of delayed arrivals. Visualization and dashboard are ready for being included in websites trought specific iframe.

schermata 2019-01-24 alle 15.50.16

My fi(R)st day with Jupyter Lab

Today was my first day with Jupyter Lab . I knew about Jupyter Lab listening episode 44 of  DataFramed, the DataCamp‘s official podcast presented by Hugo Bowne-Anderson. In this episode, Project Jupiter was described in the context of interactive computing by Brian Granger, professor of physics and data science at Cal Poly State University in San Luis Obispo, CA.

Brian is also co-founder of Project Jupyter and of the Altair project for data visualization. The episode is available at DataCamp website and other apps/platforms such as SoundCloud or CastBox.

As other statisticians and data scientists I had the opportunity to work with Python and Python notebooks but I really like R for data analysis, in particular because it is possible to do it in R Studio or other similar platforms. So, when I knew about the opportunity to work with a platform able to run Python and R notebooks(as well as Julia and many others) I started to be impatient!

JupyterLab is defined is a web-based user interface for Project Jupiter. It is possible to try it on the web without installation on Binder.

Installation is very simple. My option was to use only the terminal of my Mac, starting with pip:

pip instal jupyterlab

JupyterLab requires the Jupyter Notebook version 4.3 or later. To check the version of  the notebook package installed:

jupyter notebook --versionThen

Then I started JupyterLab:

jupyter lab

JupyterLab opened automatically in the browser with the default workspace being http(s)://<server:port>/<lab-location>/lab.

I started really soon to appreciate the beauty and simplicity of the style, the opportunity of having a file system view on the left side of the screen, the availability of a text editor and the mixing of code and markdown cells in the notebooks.

It is really nice to have the opportunity to have a look at the cvs file in which you have stored the data of your project and the opening is really user friendly with the point and click option available for the delimiter (tab, comma or semicolon).

So far so good!

But my main reason for starting with Jupiter Lab was working with different programming languages in one web-based open-source platform.

I started to install an R notebook by means of the R kernel for the Jupiter environment. Unfortunately, it was not easy at all! After several tentatives I succeed using anaconda:

conda install -c r r-irkernel

Then I started R from the terminal and installed the R kernel in the R session opened in the terminal:

R
IRkernel::installspec(user=FALSE)

Reloading Jupiter Lab I could see the Python and R notebooks side by side!

I have chosen the dark theme because I found more relaxing for my view!

Schermata 2018-10-26 alle 00.36.52

Just another data science blog! But it’s mine!

“When you’ve written the same code 3 times, write a function,
When you’ve given the same in-person advice 3 times, write a blog post”

David Robson strongly advice data scientists to start their own blog in order to share codes, examples, thoughts.

My main purpose is to receive feedbacks, connect with other people and improve my skills.

Hope you enjoy!