# Causal inference with DAGs in R

Directed acyclic graphs (DAGs) are a powerful tool to understand and deal with causal inference. The book “Causal inference in statistics: a primer” is a useful reference to start.

A DAG is a visual encoding of a joint distribution of a set of variables. In a DAG all the variables are depicted as vertices and connected by arrows or directed paths, sequences of arrows in which every arrow points to some direction. DAGs are acyclic because no directed path can form a closed loop.

The dagitty package is an effective tool for drawing and analyzing DAGs. Available functions include identification of minimal sufficient adjustment sets for estimating causal effects.

Let’s now focus on the following example. We are interesting in draw causal inference of the treatment (T) effect on a certain outcome (Y). The analysis can be biased due to the presence of several confounders (X1, X2, X3).

Let’s presume some relationships and code them with dagitty functions.

``````library(dagitty)
dag <- dagitty("dag {   X1 -> X2
X1 -> Y
X3 -> X2
X2 -> Y
X2 -> T -> Y
X3 -> T
}")
plot( graphLayout( dag ) )``````

X3 is a parent of X2 and T, X2 is an ancestor of Y, Y is a child of X2 and Y is a descendant of X2.

Let’s now make things clearer providing relative coordinates.

``````coordinates( dag ) <-  list(
x=c(X1=3, X2=3, X3=1, T=2, Y=4),
y=c(X1=1, X2=2, X3=2, T=3, Y=3) )
plot( dag )``````

We can now ask with a function if we are adjusting with the correct set of variables.

``````exposures(dag) <- c("T")
outcomes(dag) <- c("Y")
[1] FALSE
[1] FALSE
[1] TRUE``````

In order to draw unbiased causal inference we could adjust or match on both X1 and X2. Adjusting only for X1 or X2 will not remove the potential source of bias. Adjusting for X1, X2 and X3 is not required.

We can do the same with ggdag.

``````library(ggdag)
tidy_dag <- tidy_dagitty(dag)
ggdag(tidy_dag) +
theme_dag()``````

We can now ask for the children or parents of some variables.

``````ggdag_parents(tidy_ggdag, "T", text_col = "black")
ggdag_children(tidy_ggdag, "T", text_col = "black")``````

But, very nice feature, we can ask for the minimal adjustment sets of covariates.

``````ggdag_adjustment_set(tidy_ggdag, node_size = 14, text_col = "black") +
theme(legend.position = "bottom")``````

And we have two suggestions from ggdag: X1 and X2 or X2 and X3.

# Beautiful iNZights!

Wow! I was listening a recent episode of the “Storytelling with data” podcast by Cole Nussbaumer Knaflic. It was an interview with Alberto Cairo who came up recommending an awesome tool for data visualization: iNZight!

Interestingly iNZight is entirely based on R code and in its latest releases it speaks tidyverse language and some features are much improved! ggplot2 is the basis for the plotting environment.

iNZight was initially designed for New Zealand high schools, aiming at educating students to effectively explore data and understand some statistical concepts.

The development team should be very familiar to R users!

As I said, iNZight it’s now really powerful and has lot of features and can handle multivariate, time series and complex survey data.

Of course, iNZight is free and it’s easy downloadable and usable in many platforms and devices. In fact, there is a Desktop version that can be installed on Mac, Windows or Linux but also a Tablet version.

I have started with the classical gapminder dataset, which is readily available as example in the app), because I wanted to play a little with the software using very well-know data. In few steps I succeeded in creating a nice plot saved in a static (tiff, png, pdf, jpeg) and interactive version (html). I plotted mobile subscribers versus income in 2008. In the interactive version the user can navigate the visualization clicking each point and labels highlithing graph portions and getting detailed information.

The interactive version of the graph lets query it in various ways like hovering over points or clicking them. The plot could be downloaded as Interactive HTML files which can be given to others without any living connection with iNZight to work.

# Using R in the pharmaceutical industry

Although SAS is the main player in the pharmaceutical industry, R is gaining popularity. In this post I will just mention some initiatives, documents and podcast episodes dealing with using R in pharmaceutical research and development programs.

Recently, it was a pleasure to listen the episode of the Effective Statistician podcast about the rise of R and the role played in it by the Application and Implementation of Methodologies in Statistics (AIMS) group within the asociation of
Statisticians in the Pharmaceutical Industry (PSI)
.

I think that it is worth to mention once again that FDA has clearly affirmed in the Statistical Software Clarifying Statement (6 May 2015) that it is not required the “use of any specific software for statistical analyses, and statistical”. However, the software used for statistical analyses “should be fully documented in the submission, including version and build identification”.

The main objectives of the AIMS group about R in pharma are educating statisticians and statistical programmers and validating and documenting R. The R validation project documentation is stored on a dedicated GitHub repository.

The second annual R/Pharma conference will take place this year in August at Harvard University. The call for papers give an idea of the opportunities and challenges of the use of R in pharma: reproducible research, regulatory compliance and validation, safety monitoring, clinical trials, drug discovery, research & development, PK/PD/pharmacometrics, genomics, diagnostics, immunogenicity.

A useful resource, dense of suggestions and ideas, is a document issued by Adrian Olszewski.

Cytel provided in a blog post a list of clear pros of R compared to SAS: ability to create effective visualisations; flexibility to combine with other tools; quick release of cutting edge methods; naturality of being a suport for collaboration and therefore innovation, as open source tool.

In conclusion, the rise of R in pharma is a fact and a lot of initiatives are in place. However, this rise is slow and mostly limited to visualization and machine learning applications, due to the position gained by SAS as validated standard in the industry.

# Working with big SAS datasets using R and sparklyr

In general, R loads all data into memory while SAS allocates memory dynamically to keep data on disk. This makes SAS a better solution for handling very large datasets.

I often need to work with large SAS data files that are prepared in the information system of my department. However, I always try to fit everything to my R workflow. This is because I like to manipulate data with dplyr and perform statistical analysis with all the available packages in R.

To this purpose I found the perfect solution with sparklyr.

First of all we need to install and load the packages.

```library(sparklyr)
library(spark.sas7bdat)
library(dplyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")
```

Then I connect to a local instance of the installed Spark

```sc <- spark_connect(master = "local")
```

Finally it is possible to read the SAS files, manipulate them via dplyr and store in the R memory via collect command.

```df <- read_sas(file.path(my_directory, "myfile.sas7bdat"))
df_manipulated <- df %>%
filter()
df_manipulated <- collect(df_manipulated)
```

The command spark_read_sas return an object of class tbl_spark, which is a reference to a Spark data frame based on which dplyr functions can be executed.

The collect function returns a local data frame from the remote source of manipulated spark nibbles allowing for storage in the local memory.

This should be the file on which perform the data analysis and visualization steps.

Here some resources:

Big data in R

Importing 30GB of data into R with sparklyr

github.com/bnosac/spark.sas7bdat

sparklyr: R interface for Apache Spark

# ELK+R Stack

Elasticsearch is a search engine based on the Lucene library. It provides a distributed full-text search engine with an HTTP web interface and schema-free JSON documents.

Elasticsearch is becoming the bigger player in the technology for documents search in the noSQL space and is actually experiencing a great development phase (6 versions in few years and an exponentially growth of the community).

I have installed elasticsearch version 6.5.4 and kibana (aligned version 6.5.4) on my Mac where I have installed R software version 3.5.

First impact: I have worked few hours in trying to have everything fine installed on my machine. In order to work with such technologies a limited set of hacking skills are required.

For installing both elasticsearch and kibana I have followed the instructions on the elastic website.

I don’t spend here much time on installation issue due to the fact that they are all strongly dependent on operating systems and personal skills. Documentation and online forums will assist you in case of any problem.

After installation kibana was alive and kicking at http://localhost:5601/app/kibana

So, time is come to feed elasticsearch with some data.

I have suddenly thought to the NYC flight dataset available in the nycflights13 package, including on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

```library(nycflights13)
data(flights)
flights
```

I have installed the elastic package from CRAN

The connect command established the connection with my local elasticsearch

```library(elastic)
connect()
```

Then I sent the data frame to elasticsearch with the simple bulk command

```docs_bulk(flights, index = "flights_nyc_2013_idx")
```

The index argument provide the index name to use and is strictly required for data.frame input (optional for file inputs).

Opening kibana everything was ok and ready to play with

Then I have started to work on kibana for creating a dashboard for having useful insights from data. Not surprisingly june, july and December were the months at greater risk of delayed arrivals. Visualization and dashboard are ready for being included in websites trought specific iframe.

# My fi(R)st day with Jupyter Lab

Today was my first day with Jupyter Lab . I knew about Jupyter Lab listening episode 44 of  DataFramed, the DataCamp‘s official podcast presented by Hugo Bowne-Anderson. In this episode, Project Jupiter was described in the context of interactive computing by Brian Granger, professor of physics and data science at Cal Poly State University in San Luis Obispo, CA.

Brian is also co-founder of Project Jupyter and of the Altair project for data visualization. The episode is available at DataCamp website and other apps/platforms such as SoundCloud or CastBox.

As other statisticians and data scientists I had the opportunity to work with Python and Python notebooks but I really like R for data analysis, in particular because it is possible to do it in R Studio or other similar platforms. So, when I knew about the opportunity to work with a platform able to run Python and R notebooks(as well as Julia and many others) I started to be impatient!

JupyterLab is defined is a web-based user interface for Project Jupiter. It is possible to try it on the web without installation on Binder.

Installation is very simple. My option was to use only the terminal of my Mac, starting with pip:

```pip instal jupyterlab
```

JupyterLab requires the Jupyter Notebook version 4.3 or later. To check the version of  the notebook package installed:

```jupyter notebook --versionThen
```

Then I started JupyterLab:

```jupyter lab
```

JupyterLab opened automatically in the browser with the default workspace being http(s)://<server:port>/<lab-location>/lab.

I started really soon to appreciate the beauty and simplicity of the style, the opportunity of having a file system view on the left side of the screen, the availability of a text editor and the mixing of code and markdown cells in the notebooks.

It is really nice to have the opportunity to have a look at the cvs file in which you have stored the data of your project and the opening is really user friendly with the point and click option available for the delimiter (tab, comma or semicolon).

So far so good!

But my main reason for starting with Jupiter Lab was working with different programming languages in one web-based open-source platform.

I started to install an R notebook by means of the R kernel for the Jupiter environment. Unfortunately, it was not easy at all! After several tentatives I succeed using anaconda:

```conda install -c r r-irkernel
```

Then I started R from the terminal and installed the R kernel in the R session opened in the terminal:

```R
IRkernel::installspec(user=FALSE)
```

Reloading Jupiter Lab I could see the Python and R notebooks side by side!

I have chosen the dark theme because I found more relaxing for my view!

# Just another data science blog! But it’s mine!

“When you’ve written the same code 3 times, write a function,
When you’ve given the same in-person advice 3 times, write a blog post”

David Robson strongly advice data scientists to start their own blog in order to share codes, examples, thoughts.

My main purpose is to receive feedbacks, connect with other people and improve my skills.

Hope you enjoy!