6 Most Useful dplyr Commands to Manipulate a Data Frame in R

dplyr is one of the R packages developed by Hadley Wickham to manipulate data stored in data frames. Data frame is a two-dimensional data structure, where each column can contain a different type of data, like numerical, character and factors. In case you wondered the meaning of the word “dplyr”, it is like “pliers” for data frames.

dplyr package offers numerous powerful functions to manipulate data in data frames and it is a must tool if you want to learn data science.

Here are the 6 basic dplyr functions that are most useful and gets you started with learning dplyr to make sense of your data. These 6 dplyr commands or verbs will help you solve most of the problems in data manipulations.

One of the coolest things in dplyr is this pipe operator written as “%>%”. This pipe operator allows us chain multiple dplyr commands together such that it takes output from one command and feeds as input to the next command.

Let us jump in to learning the 6 most useful dplyr commands.

To illustrate the capabilities of these dplyr functions, we will be using a real world data set on New York city flights in 2003. You can easily obtain the data from this R package nycflights13. Hadley Wickham uses this rich data set extensively in his book R for Data Science

. The flights data is available in the data frame named flights and it contains all 336,776 flights that departed from New York City in 2013.

# install dplyr package
install.packages(dplyr)
# install NYC flight package
install.packages(nycflights13)
# load dplyr package
library(dplyr)
# load NYC flight package
library(nycflights13)

Exploring the data

Let us just check the dimension of the data set with “dim” function. We can see that the data frame has over 336,000 rows and 19 columns.

dim(flights)
[1] 336776 19

And we can take quick glance at the data frame with head function

head(flights)

# A tibble: 6 x 19
   year month   day dep_t… sched_d… dep_d… arr_…
             
1  2013     1     1    517      515   2.00   830
2  2013     1     1    533      529   4.00   850
3  2013     1     1    542      540   2.00   923
4  2013     1     1    544      545  -1.00  1004
5  2013     1     1    554      600  -6.00   812
6  2013     1     1    554      558  -4.00   740
# ... with 12 more variables: sched_arr_time
#   , arr_delay , carrier ,
#   flight , tailnum , origin ,
#   dest , air_time , distance ,
#   hour , minute , time_hour

filter(): Select rows based on their values

Let us filter the data frame for flights in the month November. Here we provide the whole data frame ‘flights’ as input to the filter function using the ‘pipe or magittr’ operator. Inside the filter function, we specify the condition, here we want select the rows with month values equal to 11.

nov_flights = flights %>% filter(month == 11)
head(nov_flights,3)

# A tibble: 3 x 19
   year month   day dep_t… sched_… dep_del… arr_… sched… arr_d…
                  
1  2013    11     1      5    2359     6.00   352    345   7.00
2  2013    11     1     35    2250   105      123   2356  87.0 
3  2013    11     1    455     500  -  5.00   641    651 -10.0 
# ... with 10 more variables: carrier , flight ,
#   tailnum , origin , dest , air_time ,
#   distance , hour , minute , time_hour

filter() with multiple conditions

The filter verb can be used to filter multiple conditions. Let us use filter function to get flights on Thanksgiving day in 2013. The filter verb can be used to filter multiple condition. Let us use filter function to get flights on November 28, Thanksgiving day in 2013

nov28 <- flights %>% filter(month == 11, day == 28)

select(): Select columns/variables based on their names

Let us select a few columns from the data frame. We can use the function ‘select()’ to do the selection. Just as before, we can provide the whole data frame ‘flights’ as input to the select() function using the ‘pipe or magittr’ operator. Inside the select function, we specify the columns that we want to select.

Let us first select one column and create a new data frame with just column.

nov28_flight = nov28 %>% select(flight)
dim(nov28_flight)

select(): selelct multiple columns

Here is an example selecting two columns from a data frame.

nov28_flight_origin = nov28 %>% 
          select(flight, origin)
dim(nov28_flight_origin)

Yes, we can select as many columns as we like. Here is an example selecting four columns from a data frame.

nov28_4 = nov28 %>% 
          select(flight, origin, distance, air_time)
dim(nov28_4)

# A tibble: 3 x 4
  flight origin distance air_time
1   1545 EWR        1400      227
2   1714 LGA        1416      227
3   1141 JFK        1089      160

mutate(): Create a new column/variable using other variables

dplyr’s mutate() function lets you create new columns from existing columns. For example, we can use existing columns; distance and air_time to get the speed. The mutate verb adds the new column to the input data frame.

nov28_with_speed = nov28_4 %>% 
  mutate(speed = distance / air_time * 60)

# A tibble: A tibble: 634 x 5
   flight origin distance air_time speed
               
1   1545 EWR        1400    227     370
2   1714 LGA        1416    227     374
3   1141 JFK        1089    160     408

arrange(): arranges rows

The arrange() function in dplyr allows you to arrange the rows based on specified column’s values. For example, we can use arrange() function with a column variable to sort the data frame. Let us use dplyr’s arrange() function to sort the Thanksgiving flights based on the speed column.

nov28_with_speed %>% arrange(speed)

We can see that we get a new data frame containing flights with slowest speed first.

# A tibble: 634 x 5
   flight origin distance air_time speed
               
 1   2028 LGA        96.0     47.0   123
 2   1909 LGA        96.0     38.0   152
 3   2941 JFK        94.0     25.0   226

By default, arrange() orders in ascending order. We can arrange by descending order, by using the function desc() with in arrange(). For example to order the flights with highest speed first, we would do

nov28_with_speed %>% arrange(desc(speed))

We can see the fastest flights are in the beginning.

# A tibble: 634 x 5
   flight origin distance air_time speed
               
 1     15 EWR        4963    602     495
 2   5229 LGA         296     37.0   480
 3    347 JFK        1598    201     477
 4    936 JFK        1623    207     470
 5      1 JFK        1598    204     470

arrange() can take multiple columns as input. In that case, “each additional column will be used to break ties in the values of preceding columns”. Let us see an example of arrange function with air_distance and speed as input.

nov28_with_speed %>% arrange(distance, speed)

We can see that flights with same distance is ordered by their speed.

# A tibble: 634 x 5
   flight origin distance air_time speed
               
 1   2941 JFK        94.0     25.0   226
 2   2028 LGA        96.0     47.0   123
 3   1909 LGA        96.0     38.0   152

summarize(): Get a single summary value from multiple values

summarize() function lets you get a summary of variables in the data frame. For example, if we want to get the mean speed of the 2013 Thanksgiving flights, we can feed the input data frame with speed to summarize function. Inside the summarize function, we write which variable we want to summarize and how we want to do it. In this case, we compute mean speed and assign to a new variable

nov28_with_speed %>%
 summarize(avg_speed = mean(speed,na.rm=TRUE))

# A tibble: 1 x 1
  avg_speed
      
1       388

Just like before, we can summarize multiple quantities, like we can get both average speed and distance on the Thanksgiving day flights.

nov28_with_speed %>%
  summarize(avg_speed = mean(speed,na.rm=TRUE), 
            avg_dist = mean(distance,na.rm=TRUE))

# A tibble: 1 x 2
  avg_speed avg_dist
          
1       388     1056

We can summarize more than one variable. Here is example to summarize mean speed and mean distance of 2013 Thanksgiving day flights.

nov28_with_speed %>%
  summarize(avg_speed = mean(speed,na.rm=TRUE), 
            avg_dist = mean(distance,na.rm=TRUE))

# A tibble: 1 x 2
  avg_speed avg_dist
          
1       388     1056

group_by() function: group your data frame by a column(s)

group_by() function allows us to group the input data frame based on a single column or multiple columns and manipulate each such grouped data frame/structure. group_by() function adds a new dimension to your data manipulation or data science skills, when combined with the above five dplyr commands summarize(), select(), filter(), mutate(), and arrange().

Let us use group_() and summarize() in a simple example. Let us say we want to find mean speed of thanksgiving flights from each airport. group_by() function with column “origin” splits the data frame in to multiple smaller data frames under the hood, such that each small data frame contains only the rows corresponding to the a particular value of ‘origin’. In this example, we can imagine that the group_by() statement creates three smaller data frames for each value of the flight origin, EWR, JFK, and LGA.

After group_by(), we can summarize mean distance from each group by chaining to the summarize() verb.

nov28_with_speed %>% 
  group_by(origin) %>%
  summarize(mean_dist = mean(distance))

This yields a table with three rows, one each for each NY airport, with mean distance values.

# A tibble: 3 x 2
  origin mean_dist
        
1 EWR         1053
2 JFK         1245
3 LGA          823

We can use group_by() with multiple variable and then apply summarize on a column of interest. For example, to compute mean distance for all days in November from each flight origin, we can group_by() with day and origin as follows

nov_flights %>% 
  group_by(day,origin) %>%
  summarize(mean_dist = mean(distance))

This will yield a table three rows (corresponding to airports) for each day in the month November.

# A tibble: 90 x 3
# Groups: day [?]
     day origin mean_dist
          
 1     1 EWR         1078
 2     1 JFK         1310
 3     1 LGA          755
 4     2 EWR         1132
 5     2 JFK         1240
 6     2 LGA          829