R, PKI and certificate issues

I was looking for a way to encrypt data in R. Hadley Wickham’s secure package looked like it’d do the trick, using my SSH keys. So I set up my vault, and added my key. So far, so good. Then I tried to store some data:

Hmmmm. It failed to load my private key. No real surprise – it hadn’t asked for my password. Looking at the code, it’s using the PKI  package under the hood. The PKI.load.key() function has a password parameter, but the secure package isn’t passing it. And I can’t specify another cert (one without a password) even if I wanted to – the certificate path is hardcoded. Damn.

I don’t want to remove the password from my cert. I’m using it in too many places. Places I value password protection, like GitHub.

So much for using secure.

But inspired by secure, I took a look at using PKI  directly. First, I tried loading my public key:

Great. PKI  doesn’t understand the format of my public key. Not for the first time, stackexchange came to the rescue. The following commands fixed the problem:

Now for the private key (note that I have RStudio prompt for the password, rather than hardcoding it. That would need a more elegant solution if I were to productionise this code):

Damnit. This took a bit more hunting around to fix. PKI  strips out and ignores what looks like some pretty vital information at the top of the cert:

So I had to convert my private key to a format PKI  understands. I eventually tracked down what I needed here:

The old and new keyfiles seem to be compatible, thankfully – replacing the old one doesn’t seem to have caused me any issues (since I used my old password for the new cert).

I’m finally in a position to use the package.

What Type of Distribution Fits My Data?

As part of my capstone projects at uni (my iLab), I’m trying to work out the length of stay for hospital patients in Australia, based on historical data. I was looking at the data the other day, trying to work out the statistical distribution of the target variable. Certain models make assumptions about the data (for example, linear models assume the residuals are normally distributed), and I wanted to be confident I wasn’t violating any such assumptions.

I realised I didn’t know how to do that.

It’s probably worth noting at this stage that if you just want to know whether your data fits a normal distribution, you can use the Shapiro-Wilk test of normality.

I found this image at the end of this blog which gives a nice decision tree to point you in the right direction. But I wanted something a bit more rigorous.

I had a vague recollection one of my colleagues had done something similar last semester, so I reached out to her and she sent me her team’s code. Based on what they’d done, I took a look at the fitdistrplus package, which seems to do what I’m after. An article in the Journal of Statistical Software goes through the package in a bit more detail than the R documentation. I’d recommend it as a starting point.

Code

If you want to follow along, my code is here.

Data Source and Preparation

I’ve signed a non-disclosure agreement with my client, so I can’t use their data here. Fortunately, I found a similar dataset online – the Centers for Disease Control and Prevention’s National Hospital Discharge Survey. I used the most recent (2010) data, which you can get here.

I’ll admit here that the data has some patients with very long hospital stays (of the order of months), which made many of the charts unreadable. So I trimmed them, using a somewhat arbitrary 95% of the data:

Choosing Possible Distributions

The first step is to narrow down the candidate distributions to consider. My data looks like this (created with the fitdistrplus::plotdist()  function):

 

One way to work out possible distributions is to use the decision tree I mentioned above, but an easier way is with the fitdistrplus::descdist()  function. Running for my data gives this plot:

 

The distributions to consider are those closest to the blue Observation. So my data’s unlikely to be uniform, normal, or exponential. It lies in the beta range (but as my data isn’t all in the 0-1 range, I disregarded that). But it’s also fairly close to the gamma and lognormal lines, so I’ll check those out. I’ll also check out the Weibull distribution (as the note says, it’s close to gamma and lognormal), and just for contrast, I’ll leave the normal distribution in there too.

It’s worth noting that the descdist()  output only mentions a handful of distributions, but fitdistrplus  can handle any distribution that has the d<name> , p<name> , and q<name>  functions defined in R (such as dnorm , pnorm and qnorm  for the normal distribution).  That includes distributions in third party packages. See the CRAN Distributions for a list, noting that fitdistrplus  is for fitting univariate distributions.

Fitting Distributions

The fitting itself is done with the fitdistrplus::fitdist() function. Running for the Weibull distribution gives something like this:

 

Ideally, the “Empirical and theoretical dens” and “Empirical and theoretical CDFs” will be closely aligned, and the Q-Q and P-P plots lie close to the diagonal lines. According to the Journal of Statistical Software article, “the Q-Q plot emphasises the lack-of-fit at the distribution tails while the P-P plot emphasises the lack-of-fit at the distribution centre”.

The equivalent charts with all four distributions for comparison:

We can see that while the lognormal distribution seems to fit the data well in the centre of the distribution (as evidenced by the P-P plot), it fits badly at the extremes (see the Q-Q plot).

It’s not obvious from the plots which, if any, of the distributions fits the data. Time to look at the figures.

Goodness of Fit

The fitdistrplus::gofstat()  function gives us numerical goodness of fit statistics, with the best-fitting distribution listed first.

So the Weibull distribution is closest to our data. But is our data truly described by a Weibull distribution? The easiest way to tell is to look at the kstest  property of the object returned by gofstat() , which performs a Kolmogorov-Smirnov test. Kolmogorov-Smirnov tests the hypothesis that the empirical data could be drawn from the fitted distribution .

Note that for discrete, rather than continuous, distributions, a Chi-squared test should be used instead.

 

So sadly, even Weibull distribution fails to describe our data. Happily, I had more success with my client’s data.

R, Tidyverse and Databases

This semester at uni I’ve been doing a capstone project (an iLab). Due to the volume of data, I’ve slapped it into a local MySQL instance (as I already had MySQL installed). To get the dataset down to a manageable size before loading it into R, I’ve been toying around with tidyverse’s dbplyr. It uses the same standard approach as the rest of the tidyverse family, something I’m already comfortable with.

This post does assume some familiarity with SQL. Rather than risk violating my client’s non-disclosure agreement, I’ve created a small database of the early days of NASA, with the following structure:

If you want to follow along, the code to create and populate the MySQL database is here, and the R code is here. In both cases, change the password from xxxxxxxx to something more secure.

You’ll need to install the dbplyr dbplyr package too.

Connecting to the database

Lazy loading

Once defined, our  tbls can be used much like any other R dataframe.

There are, however, a few things to be aware of. First, the tbls don’t contain the complete dataset at this point – it’s easier to think of them as promises to fetch the data from the database when needed. Looking more closely at the tbl makes this clearer:

 

We see it’s not a standard R dataframe, but an implementation of  tbl called tbl_lazy . It’s the “lazy” part of the name that signifies the data won’t be loaded until required.

Generated SQL

The show_query()  function illustrates exactly what form the SQL will take when it’s run. For example:

or

 

As the second example shows, dbplyr  will convert chained commands into a single piece of SQL where possible, having the database do the work, rather than retrieving data into memory and manipulating it there.

Examining the data

As mentioned, dbplyr  won’t fetch the data from the database until it’s needed. But it will pull in a little so we can examine the data. For example, looking at the astronauts tbl :

Note that the first line says [?? x 3] , and the last line says # ... with more rows , without specifying how many more rows. These are both indications that the full query hasn’t been run. In fact, looking at the database logs, the query that has been run is SELECT * FROM ``astronauts`` LIMIT 10 . The use of the limit clause is dbplyr ’s way of getting enough data to display, without needing to pull in all the data.

Note that the glimpse()  function in the dplyr  package isn’t quite as helpful:

Observations: 25  gives the impression that the size of the dataset is known, where it is not. This time, the database logs show SELECT * FROM ``astronauts`` LIMIT 25 , and

shows there are in fact 35 astronauts in the table.

Forcing data to be loaded

The previous command shows how to force R to load the data into memory: the collect()  function. It converts the promise of data into a concrete data frame. Compare

with

Complex queries

As mentioned above, dbplyr  will try to do as much processing as possible in the database. For example, in order to determine the astronauts who flew multiple missions in Gemini or Apollo, you’d do something like this with dplyr:

 

 

Looking at how dbplyr  handles this, we find it’s all pushed into the database, and run as a single query:

 

 

A big, ugly, query to be sure. And while I could write it more succinctly by hand, I’ve found the query plans are identical, and I need not use SQL if I already know dplyr .

Gotchas

dplyr  users will find it’s not a completely painless transition to dbplyr  however. Take, for example, finding all the astronauts whose surnames start with “C”. In dplyr , you might use something like:

Unfortunately, this will return an error reporting that FUNCTION nasa.GREPL does not exist .

One alternative is to collect()  the data before applying the filter:

That may not be efficient if we’re doing a lot of processing after the filter()  command, and it flies in the face of our goal to do as much processing as possible in the database.

In this case, we’d like to use the SQL LIKE  operator. We could guess that something like filter(surname %like% 'C%')  might do the trick. Rather than running the full (possibly expensive) query, we can use dbplyr ’s translate_sql()  function, which will show how an R command will be converted to SQL. For example:

If translate_sql()  doesn’t know what to do with a piece of code, it’ll pass the code to the database more or less as is. This was the cause of the FUNCTION nasa.GREPL does not exist  error we got above:

But it does mean that even if translate_sql()  doesn’t know about the SQL LIKE  operator, our guess of filter(surname %like% 'C%')  will be passed through to the database as we need:

 

 

Query plans

As well as show_query() , dbplyr  provides the explain()  function, which gives detail about how the database optimiser will run the query. This is useful to check that (for example) the database will make efficient use of indices.