lapply, models, and the call property

Note: I feel like I’ve read a similar article recently, but couldn’t track it down, so I figured I’d dig around and see what I could work out.

Some time last year I wanted to generate a bunch of different linear models against a dataset. No problem – I created a list of the formulae, and used lapply to create the models. Something like this:

model_formulae <- list(
  'Y ~ X1',
  'Y ~ I(X2*X2)',
  'Y ~ X1 + I(X2*X2)'
)

trained_models <- lapply(model_formulae,
                         function(formula, data) { lm(formula, data) },
                         data)

So far, so good. But when I looked at the results:

trained_models[[1]]

##
## Call:
## lm(formula = formula, data = data)
##
## Coefficients:
## (Intercept)           X1
##     -108.48        89.19

The call property didn’t give me the details of which formula was used for that model. Sure, I could work it out from the indices of model_formulae and trained_models. I could even name the elements of model_formulae to make it easier. But naming things is hard, and frankly I feel like I shouldn’t have to.

I tried a few things, such as the answers to this Stack Overflow question. While the question wasn’t about my specific need, it seemed close enough that one of the answers might help. Sadly, none of them did.

I was starting to worry that I’d have to fall back on eval(parse(...)), a construct that’s a pet hate of mine (and I’m not alone): it’s harder to debug than normal code, and an IDE can’t do much to help find errors. Every time I’ve seen it (thankfully not often), there’s been a better way to code the same logic without it. So I wanted to avoid it if possible.

A bit more digging led to this discussion, which held the solution – using do.call():

trained_models <- lapply(model_formulae,
                         function(model_formula, data) {
                           do.call("lm", list(formula = as.formula(model_formula), data = quote(data)))
                         },
                         data)

It’s a bit more verbose*, and I concede less readable, than the original version. But the results are what I’m after:

trained_models[[1]]

##
## Call:
## lm(formula = Y ~ X1, data = data)
##
## Coefficients:
## (Intercept)           X1
##     -108.48        89.19

so I’ll use it next time I hit this problem.

* The as.formula() call isn’t strictly necessary, but I prefer the output to show the formula as a formula, not a string.