lapply, models, and the call property

Note: I feel like I’ve read a similar article recently, but couldn’t track it down, so I figured I’d dig around and see what I could work out.

Some time last year I wanted to generate a bunch of different linear models against a dataset. No problem – I created a list of the formulae, and used lapply to create the models. Something like this:

model_formulae <- list(
  'Y ~ X1',
  'Y ~ I(X2*X2)',
  'Y ~ X1 + I(X2*X2)'
)

trained_models <- lapply(model_formulae,
                         function(formula, data) { lm(formula, data) },
                         data)

So far, so good. But when I looked at the results:

trained_models[[1]]
## 
## Call:
## lm(formula = formula, data = data)
## 
## Coefficients:
## (Intercept)           X1  
##     -108.48        89.19

The call property didn't give me the details of which formula was used for that model. Sure, I could work it out from the indices of model_formulae and trained_models. I could even name the elements of model_formulae to make it easier. But naming things is hard, and frankly I feel like I shouldn't have to.

I tried a few things, such as the answers to this Stack Overflow question. While the question wasn't about my specific need, it seemed close enough that one of the answers might help. Sadly, none of them did.

I was starting to worry that I'd have to fall back on eval(parse(...)), a construct that's a pet hate of mine (and I'm not alone): it's harder to debug than normal code, and an IDE can't do much to help find errors. Every time I've seen it (thankfully not often), there's been a better way to code the same logic without it. So I wanted to avoid it if possible.

A bit more digging led to this discussion, which held the solution - using do.call():

trained_models <- lapply(model_formulae,
                         function(model_formula, data) {
                           do.call("lm", list(formula = as.formula(model_formula), data = quote(data)))
                         },
                         data)

It's a bit more verbose*, and I concede less readable, than the original version. But the results are what I'm after:

trained_models[[1]]
## 
## Call:
## lm(formula = Y ~ X1, data = data)
## 
## Coefficients:
## (Intercept)           X1  
##     -108.48        89.19

so I'll use it next time I hit this problem.

 

* The as.formula() call isn't strictly necessary, but I prefer the output to show the formula as a formula, not a string.

R, PKI and certificate issues

I was looking for a way to encrypt data in R. Hadley Wickham’s secure package looked like it’d do the trick, using my SSH keys. So I set up my vault, and added my key. So far, so good. Then I tried to store some data:

> secure::encrypt("google", key = "abcdasdf", secret = "asdfsad")
Error in PKI::PKI.load.key(file = "~/.ssh/id_rsa") :
  error:0907B00D:PEM routines:PEM_READ_BIO_PRIVATEKEY:ASN1 lib

Hmmmm. It failed to load my private key. No real surprise – it hadn’t asked for my password. Looking at the code, it’s using the PKI  package under the hood. The PKI.load.key() function has a password parameter, but the secure package isn’t passing it. And I can’t specify another cert (one without a password) even if I wanted to – the certificate path is hardcoded. Damn.

I don’t want to remove the password from my cert. I’m using it in too many places. Places I value password protection, like GitHub.

So much for using secure.

But inspired by secure, I took a look at using PKI  directly. First, I tried loading my public key:

> key_pub <- PKI.load.key(format = "PEM", private = FALSE, file = "~/.ssh/id_rsa.pub")
Error in PKI.load.key(format = "PEM", private = FALSE, file = "~/.ssh/id_rsa.pub") :
cannot find public RSA key in PEM format

Great. PKI  doesn’t understand the format of my public key. Not for the first time, stackexchange came to the rescue. The following commands fixed the problem:

cp ~/.ssh/id_rsa.pub ~/.ssh/id_rsa.pub.old
ssh-keygen -f ~/.ssh/id_rsa.pub.old -e -m pem > ~/.ssh/id_rsa.pub

Now for the private key (note that I have RStudio prompt for the password, rather than hardcoding it. That would need a more elegant solution if I were to productionise this code):

> key_prv <- PKI.load.key(format = "PEM", private = TRUE, file = "~/.ssh/id_rsa", password = rstudioapi::askForPassword("Private key password"))
Error in PKI.load.key(format = "PEM", private = TRUE, file = "~/.ssh/id_rsa", :
error:0D0680A8:asn1 encoding routines:ASN1_CHECK_TLEN:wrong tag

Damnit. This took a bit more hunting around to fix. PKI  strips out and ignores what looks like some pretty vital information at the top of the cert:

$ head id_rsa
-----BEGIN RSA PRIVATE KEY-----
Proc-Type: 4,ENCRYPTED
DEK-Info: DES-EDE3-CBC,....

So I had to convert my private key to a format PKI  understands. I eventually tracked down what I needed here:

cp ~/.ssh/id_rsa ~/.ssh/id_rsa.old
openssl pkcs8 -topk8 -v2 des3 -in ~/.ssh/id_rsa.old -out ~/.ssh/id_rsa

The old and new keyfiles seem to be compatible, thankfully – replacing the old one doesn’t seem to have caused me any issues (since I used my old password for the new cert).

I’m finally in a position to use the package.

What Type of Distribution Fits My Data?

Update 4 June 2018: I recently came across this article by John D. Cook, which may give some food for thought before jumping in trying to find distributions fitting your data.


As part of my capstone projects at uni (my iLab), I’m trying to work out the length of stay for hospital patients in Australia, based on historical data. I was looking at the data the other day, trying to work out the statistical distribution of the target variable. Certain models make assumptions about the data (for example, linear models assume the residuals are normally distributed), and I wanted to be confident I wasn’t violating any such assumptions.

I realised I didn’t know how to do that.

It’s probably worth noting at this stage that if you just want to know whether your data fits a normal distribution, you can use the Shapiro-Wilk test of normality.

I found this image at the end of this blog which gives a nice decision tree to point you in the right direction. But I wanted something a bit more rigorous.

I had a vague recollection one of my colleagues had done something similar last semester, so I reached out to her and she sent me her team’s code. Based on what they’d done, I took a look at the fitdistrplus package, which seems to do what I’m after. An article in the Journal of Statistical Software goes through the package in a bit more detail than the R documentation. I’d recommend it as a starting point.

Code

If you want to follow along, my code is here.

Data Source and Preparation

I’ve signed a non-disclosure agreement with my client, so I can’t use their data here. Fortunately, I found a similar dataset online – the Centers for Disease Control and Prevention’s National Hospital Discharge Survey. I used the most recent (2010) data, which you can get here.

I’ll admit here that the data has some patients with very long hospital stays (of the order of months), which made many of the charts unreadable. So I trimmed them, using a somewhat arbitrary 95% of the data:

data <- data %>% filter(DaysOfCare <= quantile(DaysOfCare, 0.95))

Choosing Possible Distributions

The first step is to narrow down the candidate distributions to consider. My data looks like this (created with the fitdistrplus::plotdist()  function):

 

One way to work out possible distributions is to use the decision tree I mentioned above, but an easier way is with the fitdistrplus::descdist()  function. Running for my data gives this plot:

 

The distributions to consider are those closest to the blue Observation. So my data’s unlikely to be uniform, normal, or exponential. It lies in the beta range (but as my data isn’t all in the 0-1 range, I disregarded that). But it’s also fairly close to the gamma and lognormal lines, so I’ll check those out. I’ll also check out the Weibull distribution (as the note says, it’s close to gamma and lognormal), and just for contrast, I’ll leave the normal distribution in there too.

It’s worth noting that the descdist()  output only mentions a handful of distributions, but fitdistrplus  can handle any distribution that has the d<name> , p<name> , and q<name>  functions defined in R (such as dnorm , pnorm and qnorm  for the normal distribution).  That includes distributions in third party packages. See the CRAN Distributions for a list, noting that fitdistrplus  is for fitting univariate distributions.

Fitting Distributions

The fitting itself is done with the fitdistrplus::fitdist() function. Running for the Weibull distribution gives something like this:

 

Ideally, the “Empirical and theoretical dens” and “Empirical and theoretical CDFs” will be closely aligned, and the Q-Q and P-P plots lie close to the diagonal lines. According to the Journal of Statistical Software article, “the Q-Q plot emphasises the lack-of-fit at the distribution tails while the P-P plot emphasises the lack-of-fit at the distribution centre”.

The equivalent charts with all four distributions for comparison:

We can see that while the lognormal distribution seems to fit the data well in the centre of the distribution (as evidenced by the P-P plot), it fits badly at the extremes (see the Q-Q plot).

It’s not obvious from the plots which, if any, of the distributions fits the data. Time to look at the figures.

Goodness of Fit

The fitdistrplus::gofstat()  function gives us numerical goodness of fit statistics, with the best-fitting distribution listed first.

> gofstat(fitted_distributions, fitnames = names(fitted_distributions))
Goodness-of-fit statistics
Weibull Log Normal Normal Gamma
Kolmogorov-Smirnov statistic 7.438194e-02 0.01543517 0.1495439 0.06032522
Cramer-von Mises statistic 2.253009e+02 9.55129061 1107.1847636 136.42824146
Anderson-Darling statistic 1.413476e+03 116.62285733 6430.9223396 794.34782377

Goodness-of-fit criteria
Weibull Log Normal Normal Gamma
Akaike's Information Criterion 638172.7 623553.8 705760.7 630117.3
Bayesian Information Criterion 638192.4 623573.6 705780.5 630137.1

So the Weibull distribution is closest to our data. But is our data truly described by a Weibull distribution? The easiest way to tell is to look at the kstest  property of the object returned by gofstat() , which performs a Kolmogorov-Smirnov test. Kolmogorov-Smirnov tests the hypothesis that the empirical data could be drawn from the fitted distribution .

Note that for discrete, rather than continuous, distributions, a Chi-squared test should be used instead.

> gofstat(fitted_distributions, fitnames = names(fitted_distributions))$kstest
Weibull Log Normal Normal Gamma
"rejected" "rejected" "rejected" "rejected"

 

So sadly, even Weibull distribution fails to describe our data. Happily, I had more success with my client’s data.

R, Tidyverse and Databases

This semester at uni I’ve been doing a capstone project (an iLab). Due to the volume of data, I’ve slapped it into a local MySQL instance (as I already had MySQL installed). To get the dataset down to a manageable size before loading it into R, I’ve been toying around with tidyverse’s dbplyr. It uses the same standard approach as the rest of the tidyverse family, something I’m already comfortable with.

This post does assume some familiarity with SQL. Rather than risk violating my client’s non-disclosure agreement, I’ve created a small database of the early days of NASA, with the following structure:

If you want to follow along, the code to create and populate the MySQL database is here, and the R code is here. In both cases, change the password from xxxxxxxx to something more secure.

You’ll need to install the dbplyr dbplyr package too.

Connecting to the database

# Connect to the database (MYSQL* constants have been previously defined)
dbcon <- src_mysql(
dbname = MYSQL_DB,
host = MYSQL_HOST,
port = MYSQL_PORT,
user = MYSQL_USER,
password = MYSQL_PASS)

# Create references to our database tables (TABLE_NAME_* constants are strings already defined)
programs <- tbl(dbcon, TABLE_NAME_PROGRAMS)
astronauts <- tbl(dbcon, TABLE_NAME_ASTRONAUTS)
missions <- tbl(dbcon, TABLE_NAME_MISSIONS)
roles <- tbl(dbcon, TABLE_NAME_ROLES)
astronaut_missions <- tbl(dbcon, TABLE_NAME_ASTRONAUT_MISSIONS)

Lazy loading

Once defined, our tbls can be used much like any other R dataframe.

There are, however, a few things to be aware of. First, the tbls don’t contain the complete dataset at this point – it’s easier to think of them as promises to fetch the data from the database when needed. Looking more closely at the tbl makes this clearer:

str(astronauts)

## List of 2
## $ src:List of 2
## ..$ con :Formal class 'MySQLConnection' [package "RMySQL"] with 1 slot
## .. .. ..@ Id: int [1:2] 0 0
## ..$ disco:
## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x :Classes 'ident', 'character' chr "astronauts"
## ..$ vars: chr [1:3] "id" "surname" "first_name"
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"

 

We see it’s not a standard R dataframe, but an implementation of tbl called tbl_lazy . It’s the “lazy” part of the name that signifies the data won’t be loaded until required.

Generated SQL

The show_query()  function illustrates exactly what form the SQL will take when it’s run. For example:

show_query(astronauts)
## <SQL>
## SELECT *
## FROM `astronauts`

or

astronauts %>%
count() %>%
show_query()

## <SQL>
## SELECT count(*) AS `n`
## FROM `astronauts`

 

As the second example shows, dbplyr  will convert chained commands into a single piece of SQL where possible, having the database do the work, rather than retrieving data into memory and manipulating it there.

Examining the data

As mentioned, dbplyr  won’t fetch the data from the database until it’s needed. But it will pull in a little so we can examine the data. For example, looking at the astronauts tbl :

astronauts
## # Source: table [?? x 3]
## # Database: mysql 5.7.11-log [nasa@localhost:/nasa]
## id surname first_name
##
## 1 19 Aldrin Buzz
## 2 23 Anders Bill
## 3 14 Armstrong Neil
## 4 25 Bean Alan
## 5 12 Borman Frank
## 6 4 Carpenter Scott
## 7 16 Cernan Eugene
## 8 20 Chaffee Roger
## 9 17 Collins Michael
## 10 10 Conrad Pete
## # ... with more rows

Note that the first line says [?? x 3] , and the last line says # … with more rows , without specifying how many more rows. These are both indications that the full query hasn’t been run. In fact, looking at the database logs, the query that has been run is SELECT * FROM “astronauts“ LIMIT 10 . The use of the limit clause is dbplyr ’s way of getting enough data to display, without needing to pull in all the data.

Note that the glimpse()  function in the dplyr  package isn’t quite as helpful:

glimpse(astronauts)
## Observations: 25
## Variables: 3
## $ id 19, 23, 14, 25, 12, 4, 16, 20, 17, 10, 6, 22, 33, 2...
## $ surname "Aldrin", "Anders", "Armstrong", "Bean", "Borman", ...
## $ first_name "Buzz", "Bill", "Neil", "Alan", "Frank", "Scott", "...

Observations: 25  gives the impression that the size of the dataset is known, where it is not. This time, the database logs show SELECT * FROM “astronauts“ LIMIT 25 , and

astronauts %>%
count() %>%
collect()

## # A tibble: 1 x 1
## n
##
## 1 35

shows there are in fact 35 astronauts in the table.

Forcing data to be loaded

The previous command shows how to force R to load the data into memory: the collect()  function. It converts the promise of data into a concrete data frame. Compare

str(astronauts)
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'MySQLConnection' [package "RMySQL"] with 1 slot
## .. .. ..@ Id: int [1:2] 0 0
## ..$ disco:
## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x :Classes 'ident', 'character' chr "astronauts"
## ..$ vars: chr [1:3] "id" "surname" "first_name"
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"

with

str(astronauts %>% collect())
## Classes 'tbl_df', 'tbl' and 'data.frame': 35 obs. of 3 variables:
## $ id : int 19 23 14 25 12 4 16 20 17 10 ...
## $ surname : chr "Aldrin" "Anders" "Armstrong" "Bean" ...
## $ first_name: chr "Buzz" "Bill" "Neil" "Alan" ...

Complex queries

As mentioned above, dbplyr  will try to do as much processing as possible in the database. For example, in order to determine the astronauts who flew multiple missions in Gemini or Apollo, you’d do something like this with dplyr:

multiple_missions <- astronaut_missions %>%
    inner_join(missions, by=c("mission_id" = "id"), suffix = c("_am", "_m")) %>%
    inner_join(programs, by=c("program_id" = "id"), suffix = c("_m", "_p")) %>%
    filter(name_p %in% c("Gemini", "Apollo")) %>%
    group_by(name_p, astronaut_id) %>%
    summarise(flights = n()) %>%
    ungroup() %>%
    filter(flights >= 2) %>%
    inner_join(astronauts, by=c("astronaut_id" = "id"), suffix = c("_am", "_a")) %>%
    arrange(desc(flights), name_p, surname, first_name) %>%
    select(program_name = name_p, surname, first_name, flights)

 

multiple_missions

## # Source: lazy query [?? x 4]
## # Database: mysql 5.7.11-log [nasa@localhost:/nasa]
## # Ordered by: desc(flights), name_p, surname, first_name
## program_name surname first_name flights
##
## 1 Apollo Cernan Eugene 2
## 2 Apollo Lovell Jim 2
## 3 Apollo Scott David 2
## 4 Apollo Young John 2
## 5 Gemini Conrad Pete 2
## 6 Gemini Lovell Jim 2
## 7 Gemini Stafford Tom 2
## 8 Gemini Young John 2

 

Looking at how dbplyr  handles this, we find it’s all pushed into the database, and run as a single query:

multiple_missions %>%
    show_query()

 

## <SQL>
## SELECT `name_p` AS `program_name`, `surname` AS `surname`, `first_name` AS `first_name`, `flights` AS `flights`
## FROM (SELECT *
## FROM (SELECT `TBL_LEFT`.`name_p` AS `name_p`, `TBL_LEFT`.`astronaut_id` AS `astronaut_id`, `TBL_LEFT`.`flights` AS `flights`, `TBL_RIGHT`.`surname` AS `surname`, `TBL_RIGHT`.`first_name` AS `first_name`
## FROM (SELECT *
## FROM (SELECT `name_p`, `astronaut_id`, count(*) AS `flights`
## FROM (SELECT *
## FROM (SELECT `TBL_LEFT`.`mission_id` AS `mission_id`, `TBL_LEFT`.`role_id` AS `role_id`, `TBL_LEFT`.`astronaut_id` AS `astronaut_id`, `TBL_LEFT`.`name` AS `name_m`, `TBL_LEFT`.`program_id` AS `program_id`, `TBL_LEFT`.`class` AS `class`, `TBL_LEFT`.`command_module` AS `command_module`, `TBL_LEFT`.`lunar_module` AS `lunar_module`, `TBL_LEFT`.`launch_date` AS `launch_date`, `TBL_LEFT`.`splashdown_date` AS `splashdown_date`, `TBL_LEFT`.`lunar_landing_date` AS `lunar_landing_date`, `TBL_LEFT`.`lunar_liftoff_date` AS `lunar_liftoff_date`, `TBL_RIGHT`.`name` AS `name_p`
## FROM (SELECT `TBL_LEFT`.`mission_id` AS `mission_id`, `TBL_LEFT`.`role_id` AS `role_id`, `TBL_LEFT`.`astronaut_id` AS `astronaut_id`, `TBL_RIGHT`.`name` AS `name`, `TBL_RIGHT`.`program_id` AS `program_id`, `TBL_RIGHT`.`class` AS `class`, `TBL_RIGHT`.`command_module` AS `command_module`, `TBL_RIGHT`.`lunar_module` AS `lunar_module`, `TBL_RIGHT`.`launch_date` AS `launch_date`, `TBL_RIGHT`.`splashdown_date` AS `splashdown_date`, `TBL_RIGHT`.`lunar_landing_date` AS `lunar_landing_date`, `TBL_RIGHT`.`lunar_liftoff_date` AS `lunar_liftoff_date`
## FROM `astronaut_missions` AS `TBL_LEFT`
## INNER JOIN `missions` AS `TBL_RIGHT`
## ON (`TBL_LEFT`.`mission_id` = `TBL_RIGHT`.`id`)
## ) `TBL_LEFT`
## INNER JOIN `programs` AS `TBL_RIGHT`
## ON (`TBL_LEFT`.`program_id` = `TBL_RIGHT`.`id`)
## ) `mcqvbvfalf`
## WHERE (`name_p` IN ('Gemini', 'Apollo'))) `gnbzqrugnr`
## GROUP BY `name_p`, `astronaut_id`) `jsfgxpaqfn`
## WHERE (`flights` >= 2.0)) `TBL_LEFT`
## INNER JOIN `astronauts` AS `TBL_RIGHT`
## ON (`TBL_LEFT`.`astronaut_id` = `TBL_RIGHT`.`id`)
## ) `bnlhfwyrus`
## ORDER BY `flights` DESC, `name_p`, `surname`, `first_name`) `zhxhmqlgye`

 

A big, ugly, query to be sure. And while I could write it more succinctly by hand, I’ve found the query plans are identical, and I need not use SQL if I already know dplyr .

Gotchas

dplyr  users will find it’s not a completely painless transition to dbplyr  however. Take, for example, finding all the astronauts whose surnames start with “C”. In dplyr , you might use something like:

astronauts %>%
    filter(grepl("^C", surname))

Unfortunately, this will return an error reporting that FUNCTION nasa.GREPL does not exist .

One alternative is to collect()  the data before applying the filter:

astronauts %>%
    collect() %>%
    filter(grepl("^C", surname))

That may not be efficient if we’re doing a lot of processing after the filter()  command, and it flies in the face of our goal to do as much processing as possible in the database.

In this case, we’d like to use the SQL LIKE  operator. We could guess that something like filter(surname %like% ‘C%’)  might do the trick. Rather than running the full (possibly expensive) query, we can use dbplyr ’s translate_sql()  function, which will show how an R command will be converted to SQL. For example:

dbplyr::translate_sql(x == 1 && (y 3))
## <SQL> "x" = 1.0 AND ("y" 3.0)

If translate_sql()  doesn’t know what to do with a piece of code, it’ll pass the code to the database more or less as is. This was the cause of the FUNCTION nasa.GREPL does not exist  error we got above:

dbplyr::translate_sql(grepl("^C", surname))
## <SQL> GREPL('^C', "surname")

But it does mean that even if translate_sql()  doesn’t know about the SQL LIKE  operator, our guess of filter(surname %like% ‘C%’)  will be passed through to the database as we need:

dbplyr::translate_sql(surname %like% 'C%')
## <SQL> "surname" LIKE 'C%'

 

astronauts %>%
filter(surname %like% 'C%')

## # Source: lazy query [?? x 3]
## # Database: mysql 5.7.11-log [nasa@localhost:/nasa]
## id surname first_name
##
## 1 4 Carpenter Scott
## 2 16 Cernan Eugene
## 3 20 Chaffee Roger
## 4 17 Collins Michael
## 5 10 Conrad Pete
## 6 6 Cooper Gordo
## 7 22 Cunningham Walt

 

Query plans

As well as show_query() , dbplyr  provides the explain()  function, which gives detail about how the database optimiser will run the query. This is useful to check that (for example) the database will make efficient use of indices.

astronauts %>%
    filter(surname %like% 'C%') %>%
    explain()

 

## <SQL>
## SELECT *
## FROM `astronauts`
## WHERE (`surname` LIKE 'C%')

##

## <PLAN>
## id select_type table partitions type possible_keys key
## 1 1 SIMPLE astronauts range astronaut_name astronaut_name
## key_len ref rows filtered Extra
## 1 62 7 100 Using where; Using index