This semester at uni I’ve been doing a capstone project (an iLab). Due to the volume of data, I’ve slapped it into a local MySQL instance (as I already had MySQL installed). To get the dataset down to a manageable size before loading it into R, I’ve been toying around with tidyverse’s
dbplyr. It uses the same standard approach as the rest of the tidyverse family, something I’m already comfortable with.
This post does assume some familiarity with SQL. Rather than risk violating my client’s non-disclosure agreement, I’ve created a small database of the early days of NASA, with the following structure:

If you want to follow along, the code to create and populate the MySQL database is here, and the R code is here. In both cases, change the password from
xxxxxxxx to something more secure.
You’ll need to install the
dbplyr dbplyr package too.
Connecting to the database
|
# Connect to the database (MYSQL* constants have been previously defined) dbcon <- src_mysql( dbname = MYSQL_DB, host = MYSQL_HOST, port = MYSQL_PORT, user = MYSQL_USER, password = MYSQL_PASS) |
|
# Create references to our database tables (TABLE_NAME_* constants are strings already defined) programs <- tbl(dbcon, TABLE_NAME_PROGRAMS) astronauts <- tbl(dbcon, TABLE_NAME_ASTRONAUTS) missions <- tbl(dbcon, TABLE_NAME_MISSIONS) roles <- tbl(dbcon, TABLE_NAME_ROLES) astronaut_missions <- tbl(dbcon, TABLE_NAME_ASTRONAUT_MISSIONS) |
Lazy loading
Once defined, our
tbls can be used much like any other R dataframe.
There are, however, a few things to be aware of. First, the
tbls don’t contain the complete dataset at this point – it’s easier to think of them as promises to fetch the data from the database when needed. Looking more closely at the
tbl makes this clearer:
|
str(astronauts) ## List of 2 ## $ src:List of 2 ## ..$ con :Formal class 'MySQLConnection' [package "RMySQL"] with 1 slot ## .. .. ..@ Id: int [1:2] 0 0 ## ..$ disco: ## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src" ## $ ops:List of 2 ## ..$ x :Classes 'ident', 'character' chr "astronauts" ## ..$ vars: chr [1:3] "id" "surname" "first_name" ## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op" ## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl" |
We see it’s not a standard R dataframe, but an implementation of
tbl called
tbl_lazy . It’s the “lazy” part of the name that signifies the data won’t be loaded until required.
Generated SQL
The
show_query() function illustrates exactly what form the SQL will take when it’s run. For example:
|
show_query(astronauts) ## <SQL> ## SELECT * ## FROM `astronauts` |
or
|
astronauts %>% count() %>% show_query() ## <SQL> ## SELECT count(*) AS `n` ## FROM `astronauts` |
As the second example shows,
dbplyr will convert chained commands into a single piece of SQL where possible, having the database do the work, rather than retrieving data into memory and manipulating it there.
Examining the data
As mentioned,
dbplyr won’t fetch the data from the database until it’s needed. But it will pull in a little so we can examine the data. For example, looking at the astronauts
tbl :
|
astronauts ## # Source: table [?? x 3] ## # Database: mysql 5.7.11-log [nasa@localhost:/nasa] ## id surname first_name ## ## 1 19 Aldrin Buzz ## 2 23 Anders Bill ## 3 14 Armstrong Neil ## 4 25 Bean Alan ## 5 12 Borman Frank ## 6 4 Carpenter Scott ## 7 16 Cernan Eugene ## 8 20 Chaffee Roger ## 9 17 Collins Michael ## 10 10 Conrad Pete ## # ... with more rows |
Note that the first line says
[?? x 3] , and the last line says
# ... with more rows , without specifying how many more rows. These are both indications that the full query hasn’t been run. In fact, looking at the database logs, the query that has been run is
SELECT * FROM ``astronauts`` LIMIT 10 . The use of the limit clause is
dbplyr ’s way of getting enough data to display, without needing to pull in all the data.
Note that the
glimpse() function in the
dplyr package isn’t quite as helpful:
|
glimpse(astronauts) ## Observations: 25 ## Variables: 3 ## $ id 19, 23, 14, 25, 12, 4, 16, 20, 17, 10, 6, 22, 33, 2... ## $ surname "Aldrin", "Anders", "Armstrong", "Bean", "Borman", ... ## $ first_name "Buzz", "Bill", "Neil", "Alan", "Frank", "Scott", "... |
Observations: 25 gives the impression that the size of the dataset is known, where it is not. This time, the database logs show
SELECT * FROM ``astronauts`` LIMIT 25 , and
|
astronauts %>% count() %>% collect() ## # A tibble: 1 x 1 ## n ## ## 1 35 |
shows there are in fact 35 astronauts in the table.
Forcing data to be loaded
The previous command shows how to force R to load the data into memory: the
collect() function. It converts the promise of data into a concrete data frame. Compare
|
str(astronauts) ## List of 2 ## $ src:List of 2 ## ..$ con :Formal class 'MySQLConnection' [package "RMySQL"] with 1 slot ## .. .. ..@ Id: int [1:2] 0 0 ## ..$ disco: ## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src" ## $ ops:List of 2 ## ..$ x :Classes 'ident', 'character' chr "astronauts" ## ..$ vars: chr [1:3] "id" "surname" "first_name" ## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op" ## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl" |
with
|
str(astronauts %>% collect()) ## Classes 'tbl_df', 'tbl' and 'data.frame': 35 obs. of 3 variables: ## $ id : int 19 23 14 25 12 4 16 20 17 10 ... ## $ surname : chr "Aldrin" "Anders" "Armstrong" "Bean" ... ## $ first_name: chr "Buzz" "Bill" "Neil" "Alan" ... |
Complex queries
As mentioned above,
dbplyr will try to do as much processing as possible in the database. For example, in order to determine the astronauts who flew multiple missions in Gemini or Apollo, you’d do something like this with dplyr:
|
multiple_missions <- astronaut_missions %>% inner_join(missions, by=c("mission_id" = "id"), suffix = c("_am", "_m")) %>% inner_join(programs, by=c("program_id" = "id"), suffix = c("_m", "_p")) %>% filter(name_p %in% c("Gemini", "Apollo")) %>% group_by(name_p, astronaut_id) %>% summarise(flights = n()) %>% ungroup() %>% filter(flights >= 2) %>% inner_join(astronauts, by=c("astronaut_id" = "id"), suffix = c("_am", "_a")) %>% arrange(desc(flights), name_p, surname, first_name) %>% select(program_name = name_p, surname, first_name, flights) |
|
multiple_missions ## # Source: lazy query [?? x 4] ## # Database: mysql 5.7.11-log [nasa@localhost:/nasa] ## # Ordered by: desc(flights), name_p, surname, first_name ## program_name surname first_name flights ## ## 1 Apollo Cernan Eugene 2 ## 2 Apollo Lovell Jim 2 ## 3 Apollo Scott David 2 ## 4 Apollo Young John 2 ## 5 Gemini Conrad Pete 2 ## 6 Gemini Lovell Jim 2 ## 7 Gemini Stafford Tom 2 ## 8 Gemini Young John 2 |
Looking at how
dbplyr handles this, we find it’s all pushed into the database, and run as a single query:
|
multiple_missions %>% show_query() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
## <SQL> ## SELECT `name_p` AS `program_name`, `surname` AS `surname`, `first_name` AS `first_name`, `flights` AS `flights` ## FROM (SELECT * ## FROM (SELECT `TBL_LEFT`.`name_p` AS `name_p`, `TBL_LEFT`.`astronaut_id` AS `astronaut_id`, `TBL_LEFT`.`flights` AS `flights`, `TBL_RIGHT`.`surname` AS `surname`, `TBL_RIGHT`.`first_name` AS `first_name` ## FROM (SELECT * ## FROM (SELECT `name_p`, `astronaut_id`, count(*) AS `flights` ## FROM (SELECT * ## FROM (SELECT `TBL_LEFT`.`mission_id` AS `mission_id`, `TBL_LEFT`.`role_id` AS `role_id`, `TBL_LEFT`.`astronaut_id` AS `astronaut_id`, `TBL_LEFT`.`name` AS `name_m`, `TBL_LEFT`.`program_id` AS `program_id`, `TBL_LEFT`.`class` AS `class`, `TBL_LEFT`.`command_module` AS `command_module`, `TBL_LEFT`.`lunar_module` AS `lunar_module`, `TBL_LEFT`.`launch_date` AS `launch_date`, `TBL_LEFT`.`splashdown_date` AS `splashdown_date`, `TBL_LEFT`.`lunar_landing_date` AS `lunar_landing_date`, `TBL_LEFT`.`lunar_liftoff_date` AS `lunar_liftoff_date`, `TBL_RIGHT`.`name` AS `name_p` ## FROM (SELECT `TBL_LEFT`.`mission_id` AS `mission_id`, `TBL_LEFT`.`role_id` AS `role_id`, `TBL_LEFT`.`astronaut_id` AS `astronaut_id`, `TBL_RIGHT`.`name` AS `name`, `TBL_RIGHT`.`program_id` AS `program_id`, `TBL_RIGHT`.`class` AS `class`, `TBL_RIGHT`.`command_module` AS `command_module`, `TBL_RIGHT`.`lunar_module` AS `lunar_module`, `TBL_RIGHT`.`launch_date` AS `launch_date`, `TBL_RIGHT`.`splashdown_date` AS `splashdown_date`, `TBL_RIGHT`.`lunar_landing_date` AS `lunar_landing_date`, `TBL_RIGHT`.`lunar_liftoff_date` AS `lunar_liftoff_date` ## FROM `astronaut_missions` AS `TBL_LEFT` ## INNER JOIN `missions` AS `TBL_RIGHT` ## ON (`TBL_LEFT`.`mission_id` = `TBL_RIGHT`.`id`) ## ) `TBL_LEFT` ## INNER JOIN `programs` AS `TBL_RIGHT` ## ON (`TBL_LEFT`.`program_id` = `TBL_RIGHT`.`id`) ## ) `mcqvbvfalf` ## WHERE (`name_p` IN ('Gemini', 'Apollo'))) `gnbzqrugnr` ## GROUP BY `name_p`, `astronaut_id`) `jsfgxpaqfn` ## WHERE (`flights` >= 2.0)) `TBL_LEFT` ## INNER JOIN `astronauts` AS `TBL_RIGHT` ## ON (`TBL_LEFT`.`astronaut_id` = `TBL_RIGHT`.`id`) ## ) `bnlhfwyrus` ## ORDER BY `flights` DESC, `name_p`, `surname`, `first_name`) `zhxhmqlgye` |
A big, ugly, query to be sure. And while I could write it more succinctly by hand, I’ve found the query plans are identical, and I need not use SQL if I already know
dplyr .
Gotchas
dplyr users will find it’s not a completely painless transition to
dbplyr however. Take, for example, finding all the astronauts whose surnames start with “C”. In
dplyr , you might use something like:
|
astronauts %>% filter(grepl("^C", surname)) |
Unfortunately, this will return an error reporting that
FUNCTION nasa.GREPL does not exist .
One alternative is to
collect() the data before applying the filter:
|
astronauts %>% collect() %>% filter(grepl("^C", surname)) |
That may not be efficient if we’re doing a lot of processing after the
filter() command, and it flies in the face of our goal to do as much processing as possible in the database.
In this case, we’d like to use the SQL
LIKE operator. We could guess that something like
filter(surname %like% 'C%') might do the trick. Rather than running the full (possibly expensive) query, we can use
dbplyr ’s
translate_sql() function, which will show how an R command will be converted to SQL. For example:
|
dbplyr::translate_sql(x == 1 && (y 3)) ## <SQL> "x" = 1.0 AND ("y" 3.0) |
If
translate_sql() doesn’t know what to do with a piece of code, it’ll pass the code to the database more or less as is. This was the cause of the
FUNCTION nasa.GREPL does not exist error we got above:
|
dbplyr::translate_sql(grepl("^C", surname)) ## <SQL> GREPL('^C', "surname") |
But it does mean that even if
translate_sql() doesn’t know about the SQL
LIKE operator, our guess of
filter(surname %like% 'C%') will be passed through to the database as we need:
|
dbplyr::translate_sql(surname %like% 'C%') ## <SQL> "surname" LIKE 'C%' |
|
astronauts %>% filter(surname %like% 'C%') ## # Source: lazy query [?? x 3] ## # Database: mysql 5.7.11-log [nasa@localhost:/nasa] ## id surname first_name ## ## 1 4 Carpenter Scott ## 2 16 Cernan Eugene ## 3 20 Chaffee Roger ## 4 17 Collins Michael ## 5 10 Conrad Pete ## 6 6 Cooper Gordo ## 7 22 Cunningham Walt |
Query plans
As well as
show_query() ,
dbplyr provides the
explain() function, which gives detail about how the database optimiser will run the query. This is useful to check that (for example) the database will make efficient use of indices.
|
astronauts %>% filter(surname %like% 'C%') %>% explain() |
|
## <SQL> ## SELECT * ## FROM `astronauts` ## WHERE (`surname` LIKE 'C%') ## ## <PLAN> ## id select_type table partitions type possible_keys key ## 1 1 SIMPLE astronauts range astronaut_name astronaut_name ## key_len ref rows filtered Extra ## 1 62 7 100 Using where; Using index |