dplyr matches(): select columns using regular expression

This quick post has an example using a neat dplyr function matches() to select columns using regular expressions.

dplyr has a number of helper functions, contains(), starts_with() and others, for selecting columns based on certain condition. For example if you interested selecting columns based on how its starts with we can use start_with() function. However, these helpe function cannot handle regular expression. And I just came across matches() function from dplyr, the only function that can handle regular expression to select matching columns.

Here is a quick example showing how it works. First, load the packages needed.

library(tidyverse)
library(palmerpenguins)
packageVersion("dplyr")
## [1] '1.0.9'

We will use penguins dataset from palmer penguins R package. And to illustrate matches using regular expression, let us make a small change to one of the column names.

 penguins = penguins %>% 
  rename(cill_length_mm=bill_length_mm)

Penguins data looks like this. Notice the spelling mistake “bill” to “cill” for bill_length_mm column that we created.

penguins %>% head()

## # A tibble: 5 × 8
##   species island cill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## # … with 1 more variable: year <int>

dplyr matches(): How to Select columns whose names matches a regular expression?

We can use dplyr’s matches() function to select columns whose names matches with a string or a pattern using regular expression. The difference between matches() and contains() is that matches can take regular expression and match the pattern, while contains cannot do it.

We can use matches to select columns that contains matching string. The matching string does not have to be in the beginning or at the end like starts_with() or ends_with() functions. In the example below, we select a column based on string that is in the middle of the name.

penguins %>%
  select(matches("bill")) 

In this example, we have not used any regular expression. Therefore we could have used contains() function to get the same results.

## # A tibble: 344 × 1
##    bill_depth_mm
##            <dbl>
##  1          18.7
##  2          17.4
##  3          18  
##  4          NA  
##  5          19.3
##  6          20.6
##  7          17.8
##  8          19.6
##  9          18.1
## 10          20.2
## # … with 334 more rows

Using simple regular expression like “*” we can use matches() function to select two columns that matches our pattern.

penguins %>% 
  select(matches("*ill"))

## # A tibble: 344 × 2
##    cill_length_mm bill_depth_mm
##             <dbl>         <dbl>
##  1           39.1          18.7
##  2           39.5          17.4
##  3           40.3          18  
##  4           NA            NA  
##  5           36.7          19.3
##  6           39.3          20.6
##  7           38.9          17.8
##  8           39.2          19.6
##  9           34.1          18.1
## 10           42            20.2
## # … with 334 more rows

We can also add a bit more specificities if needed. For example, we select columns matching “cill” or “bill” using regular expression “[cb]ill” to select the two columns.

penguins %>% 
  select(matches("[cb]ill"))

## # A tibble: 344 × 2
##    cill_length_mm bill_depth_mm
##             <dbl>         <dbl>
##  1           39.1          18.7
##  2           39.5          17.4
##  3           40.3          18  
##  4           NA            NA  
##  5           36.7          19.3
##  6           39.3          20.6
##  7           38.9          17.8
##  8           39.2          19.6
##  9           34.1          18.1
## 10           42            20.2
## # … with 334 more rows