class: left, middle, inverse, title-slide # Week 3: Introduction to SQL ## And Databases with R ### Joshua Goldberg
Data Understanding and Preparation: DABP12000 20A7
--- class: text-slide # Topics - Structured Query Language (SQL) Basics - Connect/Read/Write R ↔ MySQL - Comparing SQL and R (dplyr) --- class: text-slide, main-slide, center, middle, hide-count # Structured Query Language (SQL) Basics --- class: text-slide SQL is a declarative language: most programming languages are imperative, but SQL focuses on declarations (i.e., I want these columns from this table, but exclude these results) -- ```sql SELECT origin ,dest ,air_time FROM flights WHERE air_time > 300 ``` -- As a user of SQL, you get to focus on what you want from the language. <div class="my-footer"><span>https://blog.jooq.org/2016/03/17/10-easy-steps-to-a-complete-understanding-of-sql/</span></div> --- class: text-slide SQL syntax has a few orders to consider -- .box-1[LEXICAL] ```sql SELECT [DISTINCT] FROM WHERE GROUP BY HAVING UNION ORDER BY ``` -- .box-1[LOGICAL] ```sql FROM WHERE GROUP BY HAVING SELECT DISTINCT ORDER BY ``` <div class="my-footer"><span>https://blog.jooq.org/2016/03/17/10-easy-steps-to-a-complete-understanding-of-sql/</span></div> --- class: text-slide # 🚫 ```sql SELECT dep_delay / arr_delay AS airline_metric FROM airports -- airline_metric is not available in this expression! WHERE airline_metric = 10 GROUP BY tzone ``` -- # ✅ ```sql SELECT * FROM (SELECT dep_delay / arr_delay AS airline_metric FROM airports GROUP BY tzone) a WHERE a.airline_metric = 10 ``` <div class="my-footer"><span>https://blog.jooq.org/2016/03/17/10-easy-steps-to-a-complete-understanding-of-sql/</span></div> ??? Because of logical (execution) order `SELECT` comes after `WHERE` so you cannot filter on a projection that does not exist yet. Note, not all databases implement things the same way. Rule number 2, for instance, does not apply exactly in the above way to MySQL, PostgreSQL, and SQLite. --- class: text-slide # Tables are 👑 .pull-left[ ```sql SELECT * FROM airports FROM flights FROM weather FROM ... LIMIT 3; ``` ``` ## # A tibble: 3 × 8 ## faa name lat lon alt tz dst tzone ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New… ## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Chi… ## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chi… ``` ] -- .pull-right[...And are like variables in R ```sql SELECT * (SELECT * FROM airports WHERE tzone = 'America/New_York') new_york ``` ] <div class="my-footer"><span>https://blog.jooq.org/2016/03/17/10-easy-steps-to-a-complete-understanding-of-sql/</span></div> ??? SQL is mostly about table references, not columns. Make use of them. Don’t be afraid of writing derived tables or other complex table references. --- class: text-slide # `GROUP BY` modifies table references ```sql SELECT year ,carrier ,AVG(dep_delay) FROM flights GROUP BY year, carrier ``` - `GROUP BY` columns only are available for `SELECT` due to logical order - ⚠️ [However, MySQL does not follow this standard](https://blog.jooq.org/2012/08/05/mysql-bad-idea-384/) <div class="my-footer"><span>https://blog.jooq.org/2016/03/17/10-easy-steps-to-a-complete-understanding-of-sql/</span></div> --- class: text-slide # SQL Code Style Can you quickly read this code easily? ```sql SELECT faa, flight, hour, minute, day, year, lat, lon, alt, tz, dst, tzone, dep_delay, AVG(arr_delay) FROM flights, airports, airline, weather WHERE origin = '04G' | origin = '09J' & dest = '0S9' airline IN (SELECT carrier FROM airlines ORDER BY carrier LIMIT 1) GROUP BY ... ``` -- - Consistency is key -- - Follow a style [guide](https://about.gitlab.com/handbook/business-ops/data-team/platform/sql-style-guide/) --- class: text-slide, main-slide, center, middle, hide-count # Connect/Read/Write R ↔ MySQL .pull-left[.center[ <figure> <img src="Rlogo.png" alt="R logo" title="R logo" width="100%"> </figure> ]] .pull-right[.center[ <figure> <img class="mysql-image" src="mysql-logo.png" alt="MySQL logo" title="MySQL logo" height="50%" width="100%"> </figure> ]] --- class: text-slide .pull-left[ 🚫 Password in scripts ```r # Define a connection variable used con_mydb <- dbConnect( MySQL(), user = key_list("servicename")$username, # Never type your password in scripts * password = "myVeryHardPassword123", dbname = "mydb", host = "localhost" ) ``` ] .pull-right[ ✅ Use secure password retrieval ```r # Define a connection variable con_mydb <- dbConnect( MySQL(), # More secure method user = key_list("servicename")$username, * password = key_get("servicename"), # password = rstudioapi::askForPassword("Database password"), dbname = "mydb", host = "localhost" ) ``` ] - [RStudio securing credentials](https://db.rstudio.com/best-practices/managing-credentials) - [keyring R package](https://github.com/r-lib/keyring): `install.packages("keyring")` - `rstudioapi::askForPassword("Database password")` --- class: text-slide # A brief note on MySQL and Passwords Run this command from MySQL Workbench: ```sql -- Assumes username is root; replace yourpassword with a password of your choosing (make it a strong password) ALTER USER root@localhost IDENTIFIED WITH mysql_native_password BY 'yourpassword'; ``` MySQL remote access is [disabled by default](https://stackoverflow.com/questions/14779104/how-to-allow-remote-connection-to-mysql), but still pick a secure password with the above SQL command to protect from any potential bad actors. Always use strong passwords. --- class: text-slide # R and Database Backends - __DBI__: R Database Interface - __dbplyr__: A 'dplyr' Back End for Databases --- class: text-slide .pull-left[ The __DBI__ package allows you to interface with the database in R ```r # Print table names in database mydb DBI::dbListTables(con_mydb) ``` ``` ## [1] "airlines" "airports" "flights" "planes" "weather" ``` ```r dbReadTable(con_mydb, "airlines") ``` ] -- .pull-right[ Output from `dbReadTable`: <table> <thead> <tr> <th style="text-align:left;"> carrier </th> <th style="text-align:left;"> name </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 9E </td> <td style="text-align:left;"> Endeavor Air Inc. </td> </tr> <tr> <td style="text-align:left;"> AA </td> <td style="text-align:left;"> American Airlines Inc. </td> </tr> <tr> <td style="text-align:left;"> AS </td> <td style="text-align:left;"> Alaska Airlines Inc. </td> </tr> <tr> <td style="text-align:left;"> B6 </td> <td style="text-align:left;"> JetBlue Airways </td> </tr> <tr> <td style="text-align:left;"> DL </td> <td style="text-align:left;"> Delta Air Lines Inc. </td> </tr> </tbody> </table> ] -- ```r DBI::dbCreateTable(con_mydb, "iris", iris) # Set append = TRUE if table exists; you can also set overwrite = TRUE instead DBI::dbWriteTable(con_mydb, "iris", iris, append = TRUE, row.names = FALSE) # Drop table DBI::dbRemoveTable(con_mydb, "iris") # Query table DBI::dbGetQuery(con_mydb, "SELECT * FROM iris LIMIT 10") # Disconnect when you are finished dbDisconnect(con_mydb) ``` --- class: text-slide .pull-left[ The __dbplyr__ package allows you to interact with the database with dplyr code ```r tbl(con_mydb, "airports") ``` ``` *## # Source: table<airports> [?? x 8] *## # Database: mysql 8.0.23 [@localhost:/mydb] ## faa name lat lon alt tz dst tzone ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/… ## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/… ## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/… ## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/… ## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/… ## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/… ## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/… ## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/… ## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/… ## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/… ## # … with more rows ``` ] .pull-right[ When working with databases, __dbplyr__ tries to be as lazy as possible (more on this later) ] --- class: text-slide, main-slide, center, middle, hide-count # Connecting concepts .pull-left[ <figure> <img src="Rlogo.png" alt="R logo" title="R logo" width="100%"> </figure> ] .pull-right[ <figure> <img class="mysql-image" src="mysql-logo.png" alt="MySQL logo" title="MySQL logo" width="100%"> </figure> ] --- class: text-slide # Focusing on Data Manipulation Language (DML) -- - `SELECT`: modify tables by selecting specific columns, changing column names, or deriving new columns -- - `DISTINCT` -- - `CASE WHEN` -- - `COUNT`, `MIN`, `MAX`, `SUM`, `AVG` ??? SQL Command Types Data Definition Language (DDL) CREATE, ALTER, DROP, TRUNCATE, COMMENT, RENAME Data Manipulation Language (DML) SELECT, INSERT, UPDATE, DELETE, MERGE, CALL, DESCRIBE, EXPLAIN, LOCK TABLE Data Control Language (DCL) GRANT, REVOKE Transaction Control Language (TCL) COMMIT, SAVEPOINT, ROLLBACK, SET TRANSACTION --- class: text-slide # Focusing on Data Manipulation Language (DML) - `WHERE`: filter tables with comparison operators -- - `LIMIT`: control the number observations displayed after your query executes -- - `GROUP BY`: modify table based on groups -- - `HAVING`: filter groups based on condition of aggregation -- - `ORDER BY`: sort our data (ascending or descending) --- name: memory-lane class: hide-count background-image: url(memory-lane.jpeg) background-size: cover --- class: text-slide # Tidy data in R (Essentially third normal form) - Each variable must have its own column - Each observation must have its own row - Each value must have its own cell .center[ <img src="tidy-1.png" alt="Tidy data" title="Tidy data" width="75%"> ] <div class="my-footer"><span>Image from R4DS: https://r4ds.had.co.nz/tidy-data.html</span></div> ??? Examples now of mapping R to SQL; we do not have to re-learn concepts! --- class: text-slide # Mapping R (dplyr) to SQL: `SELECT`, `WHERE` Query the `nycflights13::flights` table to determine all the unique `origin`, `dest` combintions in the `year` 2013. Limit the rows to 10 observartions. ```r flights_db <- tbl(con_mydb, "flights") flights_db %>% head(2) ``` ``` ## # Source: lazy query [?? x 19] ## # Database: mysql 8.0.23 [@localhost:/mydb] ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <chr> ``` --- class: text-slide # Mapping R (dplyr) to SQL: `SELECT`, `WHERE` .pull-left[ .center[<img src="Rlogo.png" alt="R logo" title="R logo" height="100px" width="100px">] ```r flights_db %>% filter(year == 2013) %>% select(origin, dest) %>% distinct() %>% head(5) ``` ``` ## # Source: lazy query [?? x 2] ## # Database: mysql 8.0.23 [@localhost:/mydb] ## origin dest ## <chr> <chr> ## 1 EWR IAH ## 2 LGA IAH ## 3 JFK MIA ## 4 JFK BQN ## 5 LGA ATL ``` ] .pull-right[ .center[<img class="mysql-image" src="mysql-logo.png" alt="MySQL logo" title="MySQL logo" height="100px" width="100px">] ```sql SELECT DISTINCT origin ,dest FROM flights WHERE year = 2013 LIMIT 5 ``` ``` ## # A tibble: 5 × 2 ## origin dest ## <chr> <chr> ## 1 EWR IAH ## 2 LGA IAH ## 3 JFK MIA ## 4 JFK BQN ## 5 LGA ATL ``` ] --- name: confused-gif class: hide-count background-image: url(confused.gif) background-size: cover --- class: text-slide # dbplyr translates dplyr code to SQL .pull-left[ ```r flights_db %>% filter(year == 2013) %>% select(origin, dest) %>% distinct() %>% head(5) %>% show_query() ``` ``` ## <SQL> ## SELECT DISTINCT `origin`, `dest` ## FROM `flights` ## WHERE (`year` = 2013.0) ## LIMIT 5 ``` ] -- .pull-right[ - [Verb translation](https://dbplyr.tidyverse.org/articles/translation-verb.html) - [Expression translation](https://dbplyr.tidyverse.org/reference/translate_sql.html) - [Function translation](https://dbplyr.tidyverse.org/articles/translation-function.html) ] --- class: text-slide # Mapping R (dplyr) to SQL: `MAX`, `ORDER BY` Query the `nycflights13::flights` table to return the top five `arr_delay` by `year`, `origin`, and `dest`. --- class: text-slide # Mapping R (dplyr) to SQL: `MAX`, `ORDER BY` .pull-left[ .center[<img src="Rlogo.png" alt="R logo" title="R logo" height="100px" width="100px">] ```r flights_db %>% group_by(year, origin, dest) %>% summarise(max_arr_delay = max(arr_delay)) %>% arrange(desc(max_arr_delay)) %>% head(5) ``` ] .pull-right[ .center[<img class="mysql-image" src="mysql-logo.png" alt="MySQL logo" title="MySQL logo" height="100px" width="100px">] ```sql SELECT year ,origin ,dest ,MAX(arr_delay) AS max_arr_delay FROM flights a GROUP BY year ,origin ,dest ORDER BY max_arr_delay DESC LIMIT 5 ``` ] --- class: text-slide # Mapping R (dplyr) to SQL: `MAX`, `ORDER BY` .pull-left[ .center[<img src="Rlogo.png" alt="R logo" title="R logo" height="100px" width="100px">] ``` ## # Source: lazy query [?? x 4] ## # Database: mysql 8.0.23 [@localhost:/mydb] ## # Groups: year, origin ## # Ordered by: desc(max_arr_delay) ## year origin dest max_arr_delay ## <int> <chr> <chr> <dbl> ## 1 2013 JFK HNL 1272 ## 2 2013 JFK CMH 1127 ## 3 2013 EWR ORD 1109 ## 4 2013 JFK SFO 1007 ## 5 2013 JFK CVG 989 ``` ] .pull-right[ .center[<img class="mysql-image" src="mysql-logo.png" alt="MySQL logo" title="MySQL logo" height="100px" width="100px">] ``` ## # A tibble: 5 × 4 ## year origin dest max_arr_delay ## <int> <chr> <chr> <dbl> ## 1 2013 JFK HNL 1272 ## 2 2013 JFK CMH 1127 ## 3 2013 EWR ORD 1109 ## 4 2013 JFK SFO 1007 ## 5 2013 JFK CVG 989 ``` ] --- class: text-slide # dbplyr translation: `MAX`, `ORDER BY` .pull-left[ ```r flights_db %>% group_by(year, origin, dest) %>% summarise(max_arr_delay = max(arr_delay)) %>% arrange(desc(max_arr_delay)) %>% head(5) %>% show_query() ``` ``` ## <SQL> ## SELECT `year`, `origin`, `dest`, MAX(`arr_delay`) AS `max_arr_delay` ## FROM `flights` ## GROUP BY `year`, `origin`, `dest` ## ORDER BY `max_arr_delay` DESC ## LIMIT 5 ``` ] --- class: text-slide # Mapping R (dplyr) to SQL: `CASE`, `DISTINCT` Query the `nycflights13::flights` and modify `month` to be the name of the month rather than a number. Display the distinct months with the changes. --- class: text-slide # Mapping R (dplyr) to SQL: `CASE`, `DISTINCT` .pull-left[ .center[<img src="Rlogo.png" alt="R logo" title="R logo" height="100px" width="100px">] ```r flights_db %>% distinct(month) %>% mutate(month_name = case_when( month == 1 ~ "January", month == 2 ~ "February", month == 3 ~ "March", month == 4 ~ "April", month == 5 ~ "May", month == 6 ~ "June", # You get the point TRUE ~ "Other month" )) %>% arrange(month) ``` ] .pull-right[ .center[<img class="mysql-image" src="mysql-logo.png" alt="MySQL logo" title="MySQL logo" height="100px" width="100px">] ```sql SELECT DISTINCT month, CASE WHEN month = 1 THEN 'January' WHEN month = 2 THEN 'February' WHEN month = 3 THEN 'March' WHEN month = 4 THEN 'April' WHEN month = 5 THEN 'May' WHEN month = 6 THEN 'June' -- You get the point ELSE 'Other month' END month_name FROM flights ORDER BY month ``` ] --- class: text-slide # Mapping R (dplyr) to SQL: `CASE`, `DISTINCT` .pull-left[ .center[<img src="Rlogo.png" alt="R logo" title="R logo" height="100px" width="100px">] ``` ## # Source: lazy query [?? x 2] ## # Database: mysql 8.0.23 [@localhost:/mydb] ## # Ordered by: month ## month month_name ## <int> <chr> ## 1 1 January ## 2 2 February ## 3 3 March ## 4 4 April ## 5 5 May ## 6 6 June ## 7 7 Other month ## 8 8 Other month ## 9 9 Other month ## 10 10 Other month ## # … with more rows ``` ] .pull-right[ .center[<img class="mysql-image" src="mysql-logo.png" alt="MySQL logo" title="MySQL logo" height="100px" width="100px">] ``` ## # A tibble: 12 × 2 ## month month_name ## <int> <chr> ## 1 1 January ## 2 2 February ## 3 3 March ## 4 4 April ## 5 5 May ## 6 6 June ## 7 7 Other month ## 8 8 Other month ## 9 9 Other month ## 10 10 Other month ## 11 11 Other month ## 12 12 Other month ``` ] --- class: text-slide # dbplyr translation: `CASE`, `DISTINCT` .pull-left[ ``` ## <SQL> ## SELECT `month`, CASE ## WHEN (`month` = 1.0) THEN ('January') ## WHEN (`month` = 2.0) THEN ('February') ## WHEN (`month` = 3.0) THEN ('March') ## WHEN (`month` = 4.0) THEN ('April') ## WHEN (`month` = 5.0) THEN ('May') ## WHEN (`month` = 6.0) THEN ('June') ## ELSE ('Other month') ## END AS `month_name` ## FROM (SELECT DISTINCT `month` ## FROM `flights`) `q01` ## ORDER BY `month` ``` ] --- class: text-slide # Why use R <del>instead</del> of SQL? Do not think of using R as a trade-off between SQL. Remember, R and SQL are both popular languages for dealing with data (data analysis, retrieval, manipulation, etc). Learning both R and SQL and how they can interface should provide a considered boost in your productivity and skills as a citizen of the data world. Utilize both R and SQL by their strengths. In some cases, using only R is the most effective. In other cases, you will want to use SQL. In many cases, I suspect, you will end up using _both_. --- class: text-slide # Why use R <del>instead</del> of SQL? R can also help you reuse SQL code by [placing SQL code in R functions](https://glue.tidyverse.org/reference/glue_sql.html). For example, here's a query inside an R function to pull sales data and subset by region: ```r query_sales_data <- function(state, city, con_mydb) { query <- glue_sql("SELECT * FROM sales WHERE state = {state}", .con = con_mydb) DBI::dbGetQuery(con_mydb, query) } ``` See this [vignette](https://dbplyr.tidyverse.org/articles/sql.html) for more examples of using dbplyr. --- class: text-slide # One difference between dplyr and SQL dplyr's lexical and logical order are identical, while SQL's lexical and logical order are different (covered earlier in these slides). As a result, running R code may feel somewhat more intuitive since the steps of code execution are sequential (the same order that you write your R code). --- class: text-slide, last-slide # Conclusion - Understanding lexical and logical order for SQL can help prevent common bugs in code - R can automate SQL code generation and facilitate reusing SQL code by placing SQL code in an R functions - You can download/upload data between R and SQL; automating the download/upload process with a programming language can save time