dplyr’s select() function is one of the core functionalities of dplyr that enables select one or more columns from a dataframe.
With dplyr’s version 1.0.0, select() function has gained new functionalities that makes it easy to select columns in multiple ways. One of the most common ways to select columns is to use their names. However, with dplyr version 1.0.0, we can select columns by their location.
In this post, we will see examples four ways to select columns from a dataframe. We will start with selecting columns by names, and then see examples of selecting columns by positions, selecting columns by their types, and selecting columns by using functions that looks for patterns in names.
Let us load tidyverse and make sure dplyr’s version is 1.0.0+.
library(tidyverse) packageVersion("dplyr") [1] ‘1.0.0’
We will use the penguins dataset to select columns in 4 different ways using select() function.
path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv" penguins<- readr::read_csv(path2data)
We can see that we have columns of different types.
## Parsed with column specification: ## cols( ## species = col_character(), ## island = col_character(), ## bill_length_mm = col_double(), ## bill_depth_mm = col_double(), ## flipper_length_mm = col_double(), ## body_mass_g = col_double(), ## sex = col_character() ## )
dplyr select(): How To Select Columns By Names?
Let us start by selecting columns of a dataframe by name, which is the most common way to select columns.
# select column by names penguins %>% dplyr::select(species, island,flipper_length_mm)
## # A tibble: 6 x 3 ## species island flipper_length_mm ## <chr> <chr> <dbl> ## 1 Adelie Torgersen 181 ## 2 Adelie Torgersen 186 ## 3 Adelie Torgersen 195 ## 4 Adelie Torgersen NA ## 5 Adelie Torgersen 193 ## 6 Adelie Torgersen 190
dplyr select(): How To Select Columns By Their Positions?
Let us select the same columns as in the previous example, but this time use their position in the dataframe. For example, the column species is the first column in the dataframe and island is the second column in the dataframe.
We can simply specify the column position or location as argument to select() function.
# dplyr select column by position penguins %>% select(1,2,5)
And we get the same results as above.
## # A tibble: 6 x 3 ## species island flipper_length_mm ## <chr> <chr> <dbl> ## 1 Adelie Torgersen 181 ## 2 Adelie Torgersen 186 ## 3 Adelie Torgersen 195 ## 4 Adelie Torgersen NA ## 5 Adelie Torgersen 193 ## 6 Adelie Torgersen 190
One of the nice things (or bad?) about selecting columns by position is that if you specify a column position that does not exist, dplyr’s select() function ignores and gives the result from the remaining vaild column position.
For example, here we specify column position zero, that does not exisit. However, select() function does not crash but gives results from the remaining valid column positions.
# dplyr select column by position ignores a missing column penguins %>% select(0,2,5)
The resulting tibble has skipped 0’th position column that we requested.
## # A tibble: 6 x 2 ## island flipper_length_mm ## <chr> <dbl> ## 1 Torgersen 181 ## 2 Torgersen 186 ## 3 Torgersen 195 ## 4 Torgersen NA ## 5 Torgersen 193 ## 6 Torgersen 190
dplyr select(): How To Select Columns By Their Types?
Often it is useful to select columns by their types. For example, you might want to select all columns that are numeric for further analysis.
To get all columns that are numeric, we can use where(is.numeric) as argument to select() function.
# dplyr select all columns that are numeric penguins %>% select(where(is.numeric))
And we get all columns that are numeric,
## # A tibble: 6 x 4 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <dbl> <dbl> <dbl> <dbl> ## 1 39.1 18.7 181 3750 ## 2 39.5 17.4 186 3800 ## 3 40.3 18 195 3250 ## 4 NA NA NA NA ## 5 36.7 19.3 193 3450 ## 6 39.3 20.6 190 3650
Similarly, we can select all columns that are factor using “where(is.factor)” and all columns that are characters using “where(is.character)”.
Note that the use of where() function for selecting columns here is new in dplyr 1.0.0.
dplyr select(): How To Select Columns By Function of Names?
dplyr starts_with(): How To Select Columns whose names starts with a string ?
# dplyr select column whose names starts with penguins %>% select(starts_with("bill"))
## # A tibble: 6 x 2 ## bill_length_mm bill_depth_mm ## <dbl> <dbl> ## 1 39.1 18.7 ## 2 39.5 17.4 ## 3 40.3 18 ## 4 NA NA ## 5 36.7 19.3 ## 6 39.3 20.6
dplyr ends_with(): How To Select Columns whose names end with a string ?
The fourth way to select columns from a dataframe is to look for a string or a pattern in column names. For example, often we might want to select columns that starts with or ends with a string.
dplyr has special functions for that. For example, to select columns that starts with using starts_with() function and similarly we can select columns that ends with certain string using ends_with() function.
Here is an example, where we select columns that ends with the string “mm”.
Sometimes one might want to
# dplyr select column whose names ends with penguins %>% select(ends_with("mm"))
Now we have all columns whose names ends with “mm”.
## # A tibble: 6 x 3 ## bill_length_mm bill_depth_mm flipper_length_mm ## <dbl> <dbl> <dbl> ## 1 39.1 18.7 181 ## 2 39.5 17.4 186 ## 3 40.3 18 195 ## 4 NA NA NA ## 5 36.7 19.3 193 ## 6 39.3 20.6 190
And not just this. As dplyr’s document page suggests, we can also use any combination of the above approaches with boolean operators to select columns.