dplyr with NYC flights

September 28, 2016

What is dplyr?

Tools for data exploration and transformation
Intuitive to write and easy to read
Super-fast on data frames

Installing

if (packageVersion("devtools") < 1.6) {
  install.packages("devtools")
}
install.packages(c("dplyr","nycflights13"))

Load our data package

library(nycflights13)
suppressMessages(library(dplyr))

Package nycflights13 has data about all flights that departed NYC in 2013 - 5 datasets

flights: Flights data
airlines: Airline names
airports: Airport metadata
planes : Plane metadata
weather: Hourly weather data

Look at `flights` data

# explore data
data(flights)
dim(flights)

## [1] 336776     19

head(flights,3)

## # A tibble: 3 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

336,776 rows… because of its size, this particular dataset has been wrapped, by default, using tbl_df (what's that?)

tbl_df

Prints nicely and prevents an accidental display of the whole dataset

tblflights <- tbl_df(flights)
head(tblflights,3) # Can also use print(flights,3) instead

## # A tibble: 3 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

But if one insists, we can always revert back to a regular dataframe

# convert to a normal data frame to see all of the columns
head(data.frame(tblflights),3)

##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
##   arr_delay carrier flight tailnum origin dest air_time distance hour
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5
##   minute           time_hour
## 1     15 2013-01-01 05:00:00
## 2     29 2013-01-01 05:00:00
## 3     40 2013-01-01 05:00:00

Basic single table (df) verbs

filter: for subsetting variables
select: for subsetting rows
arrange: for re-ordering rows
mutate: for adding new columns
summarise or summarize: for reducing each group to a smaller number of summary statistics

filter: Keep rows matching criteria

# base R: tblflights[tblflights$carrier=="AA" & tblflights$origin=="LGA", ]
filter(tblflights, carrier=="AA" & origin=="LGA")

## # A tibble: 15,459 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      558            600        -2      753
## 2   2013     1     1      559            600        -1      941
## 3   2013     1     1      623            610        13      920
## 4   2013     1     1      629            630        -1      824
## 5   2013     1     1      635            635         0     1028
## 6   2013     1     1      656            700        -4      854
## 7   2013     1     1      659            700        -1     1008
## 8   2013     1     1      724            730        -6     1111
## 9   2013     1     1      739            745        -6      918
## 10  2013     1     1      753            755        -2     1056
## # ... with 15,449 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

# same as filter(tblflights, carrier=="AA", origin=="LGA")

filter again

filter(tblflights, carrier=="AA" | carrier=="UA")

## # A tibble: 91,394 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      554            558        -4      740
## 5   2013     1     1      558            600        -2      753
## 6   2013     1     1      558            600        -2      924
## 7   2013     1     1      558            600        -2      923
## 8   2013     1     1      559            600        -1      941
## 9   2013     1     1      559            600        -1      854
## 10  2013     1     1      606            610        -4      858
## # ... with 91,384 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

filter again

filter(tblflights, carrier %in% c("AA", "UA"))

## # A tibble: 91,394 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      554            558        -4      740
## 5   2013     1     1      558            600        -2      753
## 6   2013     1     1      558            600        -2      924
## 7   2013     1     1      558            600        -2      923
## 8   2013     1     1      559            600        -1      941
## 9   2013     1     1      559            600        -1      854
## 10  2013     1     1      606            610        -4      858
## # ... with 91,384 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

select: Pick columns by name

# base R approach to select DepTime, ArrTime, and FlightNum columns
head(tblflights[, c("dep_time", "arr_time", "flight")])

## # A tibble: 6 × 3
##   dep_time arr_time flight
##      <int>    <int>  <int>
## 1      517      830   1545
## 2      533      850   1714
## 3      542      923   1141
## 4      544     1004    725
## 5      554      812    461
## 6      554      740   1696

select again

# dplyr approach
print(select(tblflights, dep_time, arr_time, flight),n=6)

## # A tibble: 336,776 × 3
##   dep_time arr_time flight
##      <int>    <int>  <int>
## 1      517      830   1545
## 2      533      850   1714
## 3      542      923   1141
## 4      544     1004    725
## 5      554      812    461
## 6      554      740   1696
## # ... with 3.368e+05 more rows

select again

# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
head(select(tblflights, flight:dest, contains("arr"), contains("dep")))

## # A tibble: 6 × 11
##   flight tailnum origin  dest arr_time sched_arr_time arr_delay carrier
##    <int>   <chr>  <chr> <chr>    <int>          <int>     <dbl>   <chr>
## 1   1545  N14228    EWR   IAH      830            819        11      UA
## 2   1714  N24211    LGA   IAH      850            830        20      UA
## 3   1141  N619AA    JFK   MIA      923            850        33      AA
## 4    725  N804JB    JFK   BQN     1004           1022       -18      B6
## 5    461  N668DN    LGA   ATL      812            837       -25      DL
## 6   1696  N39463    EWR   ORD      740            728        12      UA
## # ... with 3 more variables: dep_time <int>, sched_dep_time <int>,
## #   dep_delay <dbl>

Chaining over Nesting?

# nesting method to select carrier and dep_delay columns and filter for delays over 60 minutes
head(filter(select(tblflights, carrier, dep_delay), dep_delay > 60))

## # A tibble: 6 × 2
##   carrier dep_delay
##     <chr>     <dbl>
## 1      MQ       101
## 2      AA        71
## 3      MQ       853
## 4      UA       144
## 5      UA       134
## 6      EV        96

Chaining over Nesting

tblflights %>%
    select(carrier, dep_delay) %>%
    filter(dep_delay > 60) %>%
    head()

## # A tibble: 6 × 2
##   carrier dep_delay
##     <chr>     <dbl>
## 1      MQ       101
## 2      AA        71
## 3      MQ       853
## 4      UA       144
## 5      UA       134
## 6      EV        96

Chaining

Chaining increases readability significantly when there are many commands
Operator is automatically imported from the magrittr package
Can be used to replace nesting in R commands outside of dplyr

Chaining

# create two vectors and calculate Euclidian distance between them
x1 <- 1:5; x2 <- 2:6

# Usual 
sqrt(sum((x1-x2)^2))

## [1] 2.236068

# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()

## [1] 2.236068

arrange: Reorder rows

# base R approach to select carrier and dep_delay columns and sort by dep_delay
head(tblflights[order(tblflights$dep_delay), c("carrier", "dep_delay")])

## # A tibble: 6 × 2
##   carrier dep_delay
##     <chr>     <dbl>
## 1      B6       -43
## 2      DL       -33
## 3      EV       -32
## 4      DL       -30
## 5      F9       -27
## 6      MQ       -26

arrange

# dplyr approach
tblflights %>%
    select(carrier, dep_delay) %>%
    arrange(dep_delay) %>% # arrange(desc(dep_delay)) for descending order
    head()

## # A tibble: 6 × 2
##   carrier dep_delay
##     <chr>     <dbl>
## 1      B6       -43
## 2      DL       -33
## 3      EV       -32
## 4      DL       -30
## 5      F9       -27
## 6      MQ       -26

mutate: create new variables that are functions of existing variables

# base R approach to create a new variable - sum of squares of delays (arr and dep)
tblflights$delaysquare <- tblflights$dep_delay^2 + tblflights$arr_delay^2
head(tblflights[, c("dep_delay", "arr_delay", "delaysquare")])

## # A tibble: 6 × 3
##   dep_delay arr_delay delaysquare
##       <dbl>     <dbl>       <dbl>
## 1         2        11         125
## 2         4        20         416
## 3         2        33        1093
## 4        -1       -18         325
## 5        -6       -25         661
## 6        -4        12         160

mutate

# dplyr approach (prints the new variable but does not store it)
tblflights %>%
    select(dep_delay, arr_delay) %>%
    mutate(delaysquare = dep_delay^2 + arr_delay^2) %>%
    head()

## # A tibble: 6 × 3
##   dep_delay arr_delay delaysquare
##       <dbl>     <dbl>       <dbl>
## 1         2        11         125
## 2         4        20         416
## 3         2        33        1093
## 4        -1       -18         325
## 5        -6       -25         661
## 6        -4        12         160

# store the new variable: tblflights <- tblflights %>% 
#         mutate(delaysquare = dep_delay^2 + arr_delay^2)

summarise/summarize: Reduce multiple variables to values

Primarily useful with data that has been grouped by one or more variables
group_by creates the groups that will be operated on
summarise uses the provided aggregation function to summarise each group

# base R approaches to calculate the mean arrival delays at different airports
aggregate(arr_delay ~ origin, tblflights, mean)

##   origin arr_delay
## 1    EWR  9.107055
## 2    JFK  5.551481
## 3    LGA  5.783488

# or with(tblflights, tapply(arr_delay, origin, mean, na.rm=TRUE))

summarise

# dplyr approach: create a table grouped by origin, and then summarise each group by taking the mean of arr_delay
tblflights %>%
    group_by(origin) %>%
    summarise(avg_delay = mean(arr_delay, na.rm=TRUE))

## # A tibble: 3 × 2
##   origin avg_delay
##    <chr>     <dbl>
## 1    EWR  9.107055
## 2    JFK  5.551481
## 3    LGA  5.783488

summarise_each/mutate_each: apply the same summary/mutate function(s) to multiple columns at once

# for each carrier, calculate the mean arrival and departure delays at the different origin airports
tblflights %>%
    group_by(origin) %>%
    summarise_each(funs(mean(.,na.rm=TRUE)), arr_delay, dep_delay) %>%
    head()

## # A tibble: 3 × 3
##   origin arr_delay dep_delay
##    <chr>     <dbl>     <dbl>
## 1    EWR  9.107055  15.10795
## 2    JFK  5.551481  12.11216
## 3    LGA  5.783488  10.34688

summarise_each

# for each carrier, calculate the minimum and maximum of arrival and departure delays
tblflights %>%
    group_by(carrier) %>%
    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("_delay")) %>%
    head()

## # A tibble: 6 × 5
##   carrier dep_delay_min arr_delay_min dep_delay_max arr_delay_max
##     <chr>         <dbl>         <dbl>         <dbl>         <dbl>
## 1      9E           -24           -68           747           744
## 2      AA           -24           -75          1014          1007
## 3      AS           -21           -74           225           198
## 4      B6           -43           -71           502           497
## 5      DL           -33           -71           960           931
## 6      EV           -32           -62           548           577

mutate_each

tblflights %>%
    select(matches("_delay")) %>%
    head(3)

## # A tibble: 3 × 2
##   dep_delay arr_delay
##       <dbl>     <dbl>
## 1         2        11
## 2         4        20
## 3         2        33

tblflights %>%
    select(matches("_delay")) %>%
    mutate_each(funs(half=./2)) %>%
    head(3)

## # A tibble: 3 × 4
##   dep_delay arr_delay dep_delay_half arr_delay_half
##       <dbl>     <dbl>          <dbl>          <dbl>
## 1         2        11              1            5.5
## 2         4        20              2           10.0
## 3         2        33              1           16.5

n(): counts the number of rows in a group

# for each month, count the total number of flights each day and sort in order of busiest days
tblflights %>%
    group_by(month,day) %>%
    summarise(flight_count = n()) %>%
    arrange(desc(flight_count)) %>%
    head()

## Source: local data frame [6 x 3]
## Groups: month [3]
## 
##   month   day flight_count
##   <int> <int>        <int>
## 1    11    27         1014
## 2     7    11         1006
## 3     7     8         1004
## 4     7    10         1004
## 5    12     2         1004
## 6     7    18         1003

# rewrite more simply with the `tally` function
tblflights %>%
    group_by(month, day) %>%
    tally(sort = TRUE)

n_distinct(vector): counts the number of unique items in that vector

# for each destination, count the total number of flights and the number of distinct planes that flew there
tblflights %>%
    group_by(dest) %>%
    summarise(flight_count = n(), plane_count = n_distinct(tailnum)) %>%
    head()

## # A tibble: 6 × 3
##    dest flight_count plane_count
##   <chr>        <int>       <int>
## 1   ABQ          254         108
## 2   ACK          265          58
## 3   ALB          439         172
## 4   ANC            8           6
## 5   ATL        17215        1180
## 6   AUS         2439         993

Grouping without summarising

# for each destination, show the number of flights from the 3 origin airports 
tblflights %>%
    group_by(dest) %>%
    select(origin) %>%
    table() %>%
    head()

## Adding missing grouping variables: `dest`

##      origin
## dest    EWR   JFK   LGA
##   ABQ     0   254     0
##   ACK     0   265     0
##   ALB   439     0     0
##   ANC     8     0     0
##   ATL  5022  1930 10263
##   AUS   968  1471     0

Window Functions

Aggregation function (like mean) takes n inputs and returns 1 value
Window function takes n inputs and returns n values
Includes ranking and ordering functions (like min_rank), offset functions (lead and lag), and cumulative aggregates (like cummean).

Try this

# for each carrier, calculate which two days of the year they had their longest departure delays
# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value
tblflights %>%
    group_by(carrier) %>%
    select(month, day, dep_delay) %>%
    filter(min_rank(desc(dep_delay)) <= 2 & dep_delay!="NA") %>%
    arrange(carrier, desc(dep_delay)) %>%
    head()

## Adding missing grouping variables: `carrier`

## Source: local data frame [6 x 4]
## Groups: carrier [3]
## 
##   carrier month   day dep_delay
##     <chr> <int> <int>     <dbl>
## 1      9E     2    16       747
## 2      9E     7    24       430
## 3      AA     9    20      1014
## 4      AA    12     5       896
## 5      AS     5    23       225
## 6      AS     1    20       222

Other things to play with

# for each carrier, calculate which two days of the year they had their longest departure delays rewrite previous with the `top_n` function
tblflights %>%
    group_by(carrier) %>%
    select(month, day, dep_delay) %>%
    filter(dep_delay!="NA") %>%
    top_n(2) %>%
    arrange(carrier, desc(dep_delay)) %>% head()

# for each month, calculate the number of flights and the change from the previous month
tblflights %>%
    group_by(month) %>%
    summarise(flight_count = n()) %>%
    mutate(change = flight_count - lag(flight_count))

# rewrite previous with the `tally` function
tblflights %>%
    group_by(month) %>%
    tally() %>%
    mutate(change = n - lag(n))

Other Useful Convenience Functions

# randomly sample a fixed number of rows, without replacement
tblflights %>% sample_n(5)

# randomly sample a fraction of rows, with replacement
tblflights %>% sample_frac(0.25, replace=TRUE)

# base R approach to view the structure of an object
str(tblflights)

# dplyr approach: better formatting, and adapts to your screen width
glimpse(tblflights)

do : for doing arbitrary operations

model=tblflights %>% group_by(origin) %>% do(lm=lm(dep_delay~arr_delay+carrier,data=.)) 
model %>% summarise(rsq=summary(lm)$r.squared)

## # A tibble: 3 × 1
##         rsq
##       <dbl>
## 1 0.8635791
## 2 0.8181881
## 3 0.8416845

Binary verbs

4 joins from SQL

inner_join(x, y): matching x + y
left_join(x, y) : all x + matching y
semi_join(x, y) : all x with match in y
anti_join(x, y) : all x without match in y

Quick look at 3 of 5 nycflights13 datasets

tblflights %>% head(1)

## # A tibble: 1 × 20
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## # ... with 13 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, delaysquare <dbl>

airlines %>% head(1)

## # A tibble: 1 × 2
##   carrier              name
##     <chr>             <chr>
## 1      9E Endeavor Air Inc.

airports %>% head(1)

## # A tibble: 1 × 7
##     faa              name      lat       lon   alt    tz   dst
##   <chr>             <chr>    <dbl>     <dbl> <int> <dbl> <chr>
## 1   04G Lansdowne Airport 41.13047 -80.61958  1044    -5     A

inner_join(x,y): matching x + y

tblflights %>% inner_join(airlines) %>% head(1)

## Joining, by = "carrier"

## # A tibble: 1 × 21
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## # ... with 14 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, delaysquare <dbl>, name <chr>

tblflights %>% inner_join(airlines) %>% select(distance:name) %>% head(1)

## Joining, by = "carrier"

## # A tibble: 1 × 6
##   distance  hour minute           time_hour delaysquare
##      <dbl> <dbl>  <dbl>              <dttm>       <dbl>
## 1     1400     5     15 2013-01-01 05:00:00         125
## # ... with 1 more variables: name <chr>

left_join(x,y): all x + matching y

faatblflights=tblflights %>% select(origin) %>% mutate(faa=origin)
faatblflights %>% left_join(airports) %>% head(2)

## Joining, by = "faa"

## # A tibble: 2 × 8
##   origin   faa                name      lat       lon   alt    tz   dst
##    <chr> <chr>               <chr>    <dbl>     <dbl> <int> <dbl> <chr>
## 1    EWR   EWR Newark Liberty Intl 40.69250 -74.16867    18    -5     A
## 2    LGA   LGA          La Guardia 40.77725 -73.87261    22    -5     A

airports %>% left_join(faatblflights) %>% head(2)

## Joining, by = "faa"

## # A tibble: 2 × 8
##     faa                          name      lat       lon   alt    tz   dst
##   <chr>                         <chr>    <dbl>     <dbl> <int> <dbl> <chr>
## 1   04G             Lansdowne Airport 41.13047 -80.61958  1044    -5     A
## 2   06A Moton Field Municipal Airport 32.46057 -85.68003   264    -5     A
## # ... with 1 more variables: origin <chr>

semi_join(x, y) : all x with match in y

airports %>% left_join(faatblflights) %>% head(2)

## Joining, by = "faa"

## # A tibble: 2 × 8
##     faa                          name      lat       lon   alt    tz   dst
##   <chr>                         <chr>    <dbl>     <dbl> <int> <dbl> <chr>
## 1   04G             Lansdowne Airport 41.13047 -80.61958  1044    -5     A
## 2   06A Moton Field Municipal Airport 32.46057 -85.68003   264    -5     A
## # ... with 1 more variables: origin <chr>

airports %>% semi_join(faatblflights) %>% head(2)

## Joining, by = "faa"

## # A tibble: 2 × 7
##     faa                name      lat       lon   alt    tz   dst
##   <chr>               <chr>    <dbl>     <dbl> <int> <dbl> <chr>
## 1   EWR Newark Liberty Intl 40.69250 -74.16867    18    -5     A
## 2   LGA          La Guardia 40.77725 -73.87261    22    -5     A

anti_join(x, y) : all x without match in y

faatblflights %>% anti_join(airports)

## Joining, by = "faa"

## # A tibble: 0 × 2
## # ... with 2 variables: origin <chr>, faa <chr>

airports %>% anti_join(faatblflights)

## Joining, by = "faa"

## # A tibble: 1,393 × 7
##      faa                           name      lat        lon   alt    tz
##    <chr>                          <chr>    <dbl>      <dbl> <int> <dbl>
## 1    04G              Lansdowne Airport 41.13047  -80.61958  1044    -5
## 2    06A  Moton Field Municipal Airport 32.46057  -85.68003   264    -5
## 3    06C            Schaumburg Regional 41.98934  -88.10124   801    -6
## 4    06N                Randall Airport 41.43191  -74.39156   523    -5
## 5    09J          Jekyll Island Airport 31.07447  -81.42778    11    -4
## 6    0A9 Elizabethton Municipal Airport 36.37122  -82.17342  1593    -4
## 7    0G6        Williams County Airport 41.46731  -84.50678   730    -5
## 8    0G7  Finger Lakes Regional Airport 42.88356  -76.78123   492    -5
## 9    0P2   Shoestring Aviation Airfield 39.79482  -76.64719  1000    -5
## 10   0S9          Jefferson County Intl 48.05381 -122.81064   108    -8
## # ... with 1,383 more rows, and 1 more variables: dst <chr>

The fun has to stop here…

Source Code of this slidify presentation
Kevin Markham's tutorial
Official dplyr reference manual and vignettes on CRAN: vignettes are well-written and cover many aspects of dplyr
July 2014 webinar about dplyr (and ggvis) by Hadley Wickham and related slides/code: mostly conceptual, with a bit of code
dplyr tutorial by Hadley Wickham at the useR! 2014 conference: excellent, in-depth tutorial with lots of example code (Dropbox link includes slides, code files, and data files)
dplyr GitHub repo and list of releases

What is dplyr?

Installing

Load our data package

Look at flights data

tbl_df

But if one insists, we can always revert back to a regular dataframe

Basic single table (df) verbs

filter: Keep rows matching criteria

filter again

filter again

select: Pick columns by name

select again

select again

Chaining over Nesting?

Chaining over Nesting

Chaining

Chaining

arrange: Reorder rows

arrange

mutate: create new variables that are functions of existing variables

mutate

summarise/summarize: Reduce multiple variables to values

summarise

summarise_each/mutate_each: apply the same summary/mutate function(s) to multiple columns at once

summarise_each

mutate_each

n(): counts the number of rows in a group

n_distinct(vector): counts the number of unique items in that vector

Grouping without summarising

Window Functions

Try this

Other things to play with

Other Useful Convenience Functions

do : for doing arbitrary operations

Binary verbs

Quick look at 3 of 5 nycflights13 datasets

inner_join(x,y): matching x + y

left_join(x,y): all x + matching y

semi_join(x, y) : all x with match in y

anti_join(x, y) : all x without match in y

The fun has to stop here…

Look at `flights` data