dplyr is one of the R packages developed by Hadley Wickham to manipulate data stored in data frames. Data frame is a two-dimensional data structure, where each column can contain a different type of data, like numerical, character and factors. In case you wondered the meaning of the word “dplyr”, it is like “pliers” for data frames.
dplyr package offers numerous powerful functions to manipulate data in data frames and it is a must tool if you want to learn data science.
Here are the 6 basic dplyr functions that are most useful and gets you started with learning dplyr to make sense of your data. These 6 dplyr commands or verbs will help you solve most of the problems in data manipulations.
One of the coolest things in dplyr is this pipe operator written as “%>%”. This pipe operator allows us chain multiple dplyr commands together such that it takes output from one command and feeds as input to the next command.
Let us jump in to learning the 6 most useful dplyr commands.
To illustrate the capabilities of these dplyr functions, we will be using a real world data set on New York city flights in 2003. You can easily obtain the data from this R package nycflights13. Hadley Wickham uses this rich data set extensively in his book R for Data Science
# install dplyr package install.packages(dplyr) # install NYC flight package install.packages(nycflights13) # load dplyr package library(dplyr) # load NYC flight package library(nycflights13)
Exploring the data
Let us just check the dimension of the data set with “dim” function. We can see that the data frame has over 336,000 rows and 19 columns.
dim(flights) [1] 336776 19
And we can take quick glance at the data frame with head function
head(flights)
# A tibble: 6 x 19 year month day dep_t… sched_d… dep_d… arr_… 1 2013 1 1 517 515 2.00 830 2 2013 1 1 533 529 4.00 850 3 2013 1 1 542 540 2.00 923 4 2013 1 1 544 545 -1.00 1004 5 2013 1 1 554 600 -6.00 812 6 2013 1 1 554 558 -4.00 740 # ... with 12 more variables: sched_arr_time # , arr_delay , carrier , # flight , tailnum , origin , # dest , air_time , distance , # hour , minute , time_hour
filter(): Select rows based on their values
Let us filter the data frame for flights in the month November. Here we provide the whole data frame ‘flights’ as input to the filter function using the ‘pipe or magittr’ operator. Inside the filter function, we specify the condition, here we want select the rows with month values equal to 11.
nov_flights = flights %>% filter(month == 11) head(nov_flights,3)
# A tibble: 3 x 19 year month day dep_t… sched_… dep_del… arr_… sched… arr_d… 1 2013 11 1 5 2359 6.00 352 345 7.00 2 2013 11 1 35 2250 105 123 2356 87.0 3 2013 11 1 455 500 - 5.00 641 651 -10.0 # ... with 10 more variables: carrier , flight , # tailnum , origin , dest , air_time , # distance , hour , minute , time_hour
filter() with multiple conditions
The filter verb can be used to filter multiple conditions. Let us use filter function to get flights on Thanksgiving day in 2013. The filter verb can be used to filter multiple condition. Let us use filter function to get flights on November 28, Thanksgiving day in 2013
nov28 <- flights %>% filter(month == 11, day == 28)
select(): Select columns/variables based on their names
Let us select a few columns from the data frame. We can use the function ‘select()’ to do the selection. Just as before, we can provide the whole data frame ‘flights’ as input to the select() function using the ‘pipe or magittr’ operator. Inside the select function, we specify the columns that we want to select.
Let us first select one column and create a new data frame with just column.
nov28_flight = nov28 %>% select(flight) dim(nov28_flight)
select(): selelct multiple columns
Here is an example selecting two columns from a data frame.
nov28_flight_origin = nov28 %>% select(flight, origin) dim(nov28_flight_origin)
Yes, we can select as many columns as we like. Here is an example selecting four columns from a data frame.
nov28_4 = nov28 %>% select(flight, origin, distance, air_time) dim(nov28_4)
# A tibble: 3 x 4 flight origin distance air_time 1 1545 EWR 1400 227 2 1714 LGA 1416 227 3 1141 JFK 1089 160
mutate(): Create a new column/variable using other variables
dplyr’s mutate() function lets you create new columns from existing columns. For example, we can use existing columns; distance and air_time to get the speed. The mutate verb adds the new column to the input data frame.
nov28_with_speed = nov28_4 %>% mutate(speed = distance / air_time * 60)
# A tibble: A tibble: 634 x 5 flight origin distance air_time speed 1 1545 EWR 1400 227 370 2 1714 LGA 1416 227 374 3 1141 JFK 1089 160 408
arrange(): arranges rows
The arrange() function in dplyr allows you to arrange the rows based on specified column’s values. For example, we can use arrange() function with a column variable to sort the data frame. Let us use dplyr’s arrange() function to sort the Thanksgiving flights based on the speed column.
nov28_with_speed %>% arrange(speed)
We can see that we get a new data frame containing flights with slowest speed first.
# A tibble: 634 x 5 flight origin distance air_time speed 1 2028 LGA 96.0 47.0 123 2 1909 LGA 96.0 38.0 152 3 2941 JFK 94.0 25.0 226
By default, arrange() orders in ascending order. We can arrange by descending order, by using the function desc() with in arrange(). For example to order the flights with highest speed first, we would do
nov28_with_speed %>% arrange(desc(speed))
We can see the fastest flights are in the beginning.
# A tibble: 634 x 5 flight origin distance air_time speed 1 15 EWR 4963 602 495 2 5229 LGA 296 37.0 480 3 347 JFK 1598 201 477 4 936 JFK 1623 207 470 5 1 JFK 1598 204 470
arrange() can take multiple columns as input. In that case, “each additional column will be used to break ties in the values of preceding columns”. Let us see an example of arrange function with air_distance and speed as input.
nov28_with_speed %>% arrange(distance, speed)
We can see that flights with same distance is ordered by their speed.
# A tibble: 634 x 5 flight origin distance air_time speed 1 2941 JFK 94.0 25.0 226 2 2028 LGA 96.0 47.0 123 3 1909 LGA 96.0 38.0 152
summarize(): Get a single summary value from multiple values
summarize() function lets you get a summary of variables in the data frame. For example, if we want to get the mean speed of the 2013 Thanksgiving flights, we can feed the input data frame with speed to summarize function. Inside the summarize function, we write which variable we want to summarize and how we want to do it. In this case, we compute mean speed and assign to a new variable
nov28_with_speed %>% summarize(avg_speed = mean(speed,na.rm=TRUE))
# A tibble: 1 x 1 avg_speed 1 388
Just like before, we can summarize multiple quantities, like we can get both average speed and distance on the Thanksgiving day flights.
nov28_with_speed %>% summarize(avg_speed = mean(speed,na.rm=TRUE), avg_dist = mean(distance,na.rm=TRUE))
# A tibble: 1 x 2 avg_speed avg_dist 1 388 1056
We can summarize more than one variable. Here is example to summarize mean speed and mean distance of 2013 Thanksgiving day flights.
nov28_with_speed %>% summarize(avg_speed = mean(speed,na.rm=TRUE), avg_dist = mean(distance,na.rm=TRUE))
# A tibble: 1 x 2 avg_speed avg_dist 1 388 1056
group_by() function: group your data frame by a column(s)
group_by() function allows us to group the input data frame based on a single column or multiple columns and manipulate each such grouped data frame/structure. group_by() function adds a new dimension to your data manipulation or data science skills, when combined with the above five dplyr commands summarize(), select(), filter(), mutate(), and arrange().
Let us use group_() and summarize() in a simple example. Let us say we want to find mean speed of thanksgiving flights from each airport. group_by() function with column “origin” splits the data frame in to multiple smaller data frames under the hood, such that each small data frame contains only the rows corresponding to the a particular value of ‘origin’. In this example, we can imagine that the group_by() statement creates three smaller data frames for each value of the flight origin, EWR, JFK, and LGA.
After group_by(), we can summarize mean distance from each group by chaining to the summarize() verb.
nov28_with_speed %>% group_by(origin) %>% summarize(mean_dist = mean(distance))
This yields a table with three rows, one each for each NY airport, with mean distance values.
# A tibble: 3 x 2 origin mean_dist 1 EWR 1053 2 JFK 1245 3 LGA 823
We can use group_by() with multiple variable and then apply summarize on a column of interest. For example, to compute mean distance for all days in November from each flight origin, we can group_by() with day and origin as follows
nov_flights %>% group_by(day,origin) %>% summarize(mean_dist = mean(distance))
This will yield a table three rows (corresponding to airports) for each day in the month November.
# A tibble: 90 x 3 # Groups: day [?] day origin mean_dist 1 1 EWR 1078 2 1 JFK 1310 3 1 LGA 755 4 2 EWR 1132 5 2 JFK 1240 6 2 LGA 829