4 ways to select columns from a dataframe with dplyr's select()

4 ways to select columns with dplyr select()

dplyr’s select() function is one of the core functionalities of dplyr that enables select one or more columns from a dataframe.

With dplyr’s version 1.0.0, select() function has gained new functionalities that makes it easy to select columns in multiple ways. One of the most common ways to select columns is to use their names. However, with dplyr version 1.0.0, we can select columns by their location.

In this post, we will see examples four ways to select columns from a dataframe. We will start with selecting columns by names, and then see examples of selecting columns by positions, selecting columns by their types, and selecting columns by using functions that looks for patterns in names.

Let us load tidyverse and make sure dplyr’s version is 1.0.0+.

library(tidyverse)
packageVersion("dplyr")
[1] ‘1.0.0’

We will use the penguins dataset to select columns in 4 different ways using select() function.

path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
penguins<- readr::read_csv(path2data)

We can see that we have columns of different types.

## Parsed with column specification:
## cols(
##   species = col_character(),
##   island = col_character(),
##   bill_length_mm = col_double(),
##   bill_depth_mm = col_double(),
##   flipper_length_mm = col_double(),
##   body_mass_g = col_double(),
##   sex = col_character()
## )

dplyr select(): How To Select Columns By Names?

Let us start by selecting columns of a dataframe by name, which is the most common way to select columns.

# select column by names
penguins %>%
  dplyr::select(species, island,flipper_length_mm)

## # A tibble: 6 x 3
##   species island    flipper_length_mm
##   <chr>   <chr>                 <dbl>
## 1 Adelie  Torgersen               181
## 2 Adelie  Torgersen               186
## 3 Adelie  Torgersen               195
## 4 Adelie  Torgersen                NA
## 5 Adelie  Torgersen               193
## 6 Adelie  Torgersen               190

dplyr select(): How To Select Columns By Their Positions?

Let us select the same columns as in the previous example, but this time use their position in the dataframe. For example, the column species is the first column in the dataframe and island is the second column in the dataframe.

We can simply specify the column position or location as argument to select() function.

# dplyr select column by position
penguins %>%
  select(1,2,5)

And we get the same results as above.

## # A tibble: 6 x 3
##   species island    flipper_length_mm
##   <chr>   <chr>                 <dbl>
## 1 Adelie  Torgersen               181
## 2 Adelie  Torgersen               186
## 3 Adelie  Torgersen               195
## 4 Adelie  Torgersen                NA
## 5 Adelie  Torgersen               193
## 6 Adelie  Torgersen               190

One of the nice things (or bad?) about selecting columns by position is that if you specify a column position that does not exist, dplyr’s select() function ignores and gives the result from the remaining vaild column position.

For example, here we specify column position zero, that does not exisit. However, select() function does not crash but gives results from the remaining valid column positions.

# dplyr select column by position ignores a missing column
penguins %>%
  select(0,2,5)

The resulting tibble has skipped 0’th position column that we requested.

## # A tibble: 6 x 2
##   island    flipper_length_mm
##   <chr>                 <dbl>
## 1 Torgersen               181
## 2 Torgersen               186
## 3 Torgersen               195
## 4 Torgersen                NA
## 5 Torgersen               193
## 6 Torgersen               190

dplyr select(): How To Select Columns By Their Types?

Often it is useful to select columns by their types. For example, you might want to select all columns that are numeric for further analysis.

To get all columns that are numeric, we can use where(is.numeric) as argument to select() function.

# dplyr select all columns that are numeric 
penguins %>%
  select(where(is.numeric))

And we get all columns that are numeric,

## # A tibble: 6 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           39.1          18.7               181        3750
## 2           39.5          17.4               186        3800
## 3           40.3          18                 195        3250
## 4           NA            NA                  NA          NA
## 5           36.7          19.3               193        3450
## 6           39.3          20.6               190        3650

Similarly, we can select all columns that are factor using “where(is.factor)” and all columns that are characters using “where(is.character)”.

Note that the use of where() function for selecting columns here is new in dplyr 1.0.0.

dplyr select(): How To Select Columns By Function of Names?

dplyr starts_with(): How To Select Columns whose names starts with a string ?

# dplyr select column whose names starts with
penguins %>%
  select(starts_with("bill"))

## # A tibble: 6 x 2
##   bill_length_mm bill_depth_mm
##            <dbl>         <dbl>
## 1           39.1          18.7
## 2           39.5          17.4
## 3           40.3          18  
## 4           NA            NA  
## 5           36.7          19.3
## 6           39.3          20.6

dplyr ends_with(): How To Select Columns whose names end with a string ?

The fourth way to select columns from a dataframe is to look for a string or a pattern in column names. For example, often we might want to select columns that starts with or ends with a string.

dplyr has special functions for that. For example, to select columns that starts with using starts_with() function and similarly we can select columns that ends with certain string using ends_with() function.

Here is an example, where we select columns that ends with the string “mm”.

Sometimes one might want to

# dplyr select column whose names ends with
penguins %>%
  select(ends_with("mm"))

Now we have all columns whose names ends with “mm”.

## # A tibble: 6 x 3
##   bill_length_mm bill_depth_mm flipper_length_mm
##            <dbl>         <dbl>             <dbl>
## 1           39.1          18.7               181
## 2           39.5          17.4               186
## 3           40.3          18                 195
## 4           NA            NA                  NA
## 5           36.7          19.3               193
## 6           39.3          20.6               190

And not just this. As dplyr’s document page suggests, we can also use any combination of the above approaches with boolean operators to select columns.

df %>% select(!where(is.factor)): selects all non-factor variables.

df %>% select(where(is.numeric) & starts_with(“x”)): selects all numeric variables that starts with “x”.

df %>% select(starts_with(“a”) | ends_with(“z”)): selects all variables that starts with “a” or ends with “z”.