dplyr, R package part of tidyverse, provides a great set of tools to manipulate datasets in the tabular form. dplyr has a set of core functions for “data munging”. Here is the list of core functions from dplyr
- select() picks variables based on their names.
- mutate() adds new variables that are functions of existing variables
- filter() picks cases based on their values.
- summarise() reduces multiple values down to a single summary.
- arrange() changes the ordering of the rows.
And in this tidyverse tutorial, we will learn how to use dplyr’s select() function to pick/select variables/columns from a dataframe by their names. First we will start with how to select a single variable by its name and then we will see examples of selecting multiple variables/columns by their names.
Let us get started by loading tidyverse.
library("tidyverse")
For our tutorial on tidyverse, we will use the Palmer Penguins dataset collated by Allison Horst to illustrate how to use dplyr’s select() function to select variables by their names. Let us load the data from cmdlinetips.com’ github page.
path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv" penguins<- readr::read_csv(path2data)
We can take a glimpse of the data using glimpse() function.
glimpse(penguins) Rows: 344 Columns: 7 $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", … $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgers… $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37… $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17… $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 1… $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 33… $ sex <chr> "male", "female", "female", NA, "female", "male", "female", …
How To Select A Variable by name with dplyr select()?
We can select a variable from a data frame using select() function in two ways. One way is to specify the dataframe name and the variable/column name we want to select as arguments to select() function in dplyr.
In this example below, we select species column from penguins data frame. One big advantage with dplyr/tidyverse is the ability to specify the variable names without quotes.
select(penguins, species)
The result is a type of dataframe called tibble with just one column we selected.
## # A tibble: 344 x 1 ## species ## <chr> ## 1 Adelie ## 2 Adelie ## 3 Adelie ## 4 Adelie ## 5 Adelie ## 6 Adelie ## 7 Adelie ## 8 Adelie ## 9 Adelie ## 10 Adelie ## # … with 334 more rows
The second way to select a column from a dataframe is to use the pipe operator %>% available as part of tidyverse.
Here we first specify the name of the dataframe we want to work with and use the pipe %>% operator followed by select function with the column name we want to select.
penguins %>% select(species)
We get the same data frame as tibble with a single column as before.
## # A tibble: 344 x 1 ## species ## <chr> ## 1 Adelie ## 2 Adelie ## 3 Adelie ## 4 Adelie ## 5 Adelie ## 6 Adelie ## 7 Adelie ## 8 Adelie ## 9 Adelie ## 10 Adelie ## # … with 334 more rows
The use of pipe operator can be extremely useful when we further down stream operations after selecting. Therefore, in the examples below we will the pipe operator way to select multiple columns
How To Select Two Variables by name with dplyr select()?
If we want to select two variables/columns from a dataframe, we specify the two names as arguments. In this example we select species and island columns from the dataframe using the pipe operator.
penguins %>% select(species, island)
## # A tibble: 344 x 2 ## species island ## <chr> <chr> ## 1 Adelie Torgersen ## 2 Adelie Torgersen ## 3 Adelie Torgersen ## 4 Adelie Torgersen ## 5 Adelie Torgersen ## 6 Adelie Torgersen ## 7 Adelie Torgersen ## 8 Adelie Torgersen ## 9 Adelie Torgersen ## 10 Adelie Torgersen ## # … with 334 more rows
How To Select Multiple Variables by name with dplyr select()?
Similarly, if we have more variables to select, we specify the names as argument to select() function in dplyr as shown below.
5.3 Select Multiple Columns penguins %>% select(species, body_mass_g, sex) ## # A tibble: 344 x 3 ## species body_mass_g sex ## <chr> <dbl> <chr> ## 1 Adelie 3750 male ## 2 Adelie 3800 female ## 3 Adelie 3250 female ## 4 Adelie NA <NA> ## 5 Adelie 3450 female ## 6 Adelie 3650 male ## 7 Adelie 3625 female ## 8 Adelie 4675 male ## 9 Adelie 3475 <NA> ## 10 Adelie 4250 <NA> ## # … with 334 more rows
1 comment
Comments are closed.