In this tutorial, we will see examples of using count() function from dplyr to explore variables in a dataframe. One of the first things to do after loading a data is to perform simple exploratory data analysis. One typically starts data exploration with a quick look at the data with functions like glimpse() or head().
As a next step, you might want to know more about a specific variable. For example, if you have categorical variable, you might want to count the number of observations for each value of the categorical variable. dplyr’s count() function enables to count one or more variables easily.
Let us first load tidyverse suite of R packages.
library("tidyverse")
We will use the fantastic Penguins dataset to illustrate the three ways to see data in a dataframe. Let us load the data from cmdlinetips.com‘s github page.
path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv" penguins<- readr::read_csv(path2data)
We can see that some of the variables in the penguins dataframe, like species, island, and sex, are character variables.
## Parsed with column specification: ## cols( ## species = col_character(), ## island = col_character(), ## bill_length_mm = col_double(), ## bill_depth_mm = col_double(), ## flipper_length_mm = col_double(), ## body_mass_g = col_double(), ## sex = col_character() ## )
Count Observations by Single Group
Let us explore variables the categorical/character variables with count() in dplyr.
For example, if we want to know the number of observations for each of the penguin species, we can use count() function as follows.
count(penguins, species)
And we get a new tibble with species as one column and the number of observations as another column.
## # A tibble: 3 x 2 ## species n ## <chr> <int> ## 1 Adelie 152 ## 2 Chinstrap 68 ## 3 Gentoo 124
Count Observations by Single Group using pipe operator
There is another way to use tidyverse functions that can be extremely useful later. In the above example, we provided the name of dataframe and the variable in the dataframe as input to count() function to compte the number of penguins in each species.
Instead we can use the pipe operator %>% to connect the data frame to count() function. For example, we can write the name dataframe first, use the pipe operator %>% next and then write count() function with the variable name inside. The way to understand this is that we provide the content of dataframe through the pipe to the count function.
penguins %>% count(species)
And we get exactly the same results as before.
## # A tibble: 3 x 2 ## species n ## <chr> <int> ## 1 Adelie 152 ## 2 Chinstrap 68 ## 3 Gentoo 124
This framework can be extremely useful if we are performing multiple operations one after the other. We can simply feed the results from one to another using the %>% operator.
Count Observations by Single Group and Sort the Results
We can sort the results in descending order with sort=TRUE argument.
penguins %>% count(species, sort=TRUE)
## # A tibble: 3 x 2 ## species n ## <chr> <int> ## 1 Adelie 152 ## 2 Gentoo 124 ## 3 Chinstrap 68
Count Observations by Two Groups
count() function in dplyr can be used to count observations by multiple groups. Here is an example, where we count observations by two variables.
penguins %>% count(species,island)
We get number of observations for each combinations of the two variables. In this example, we get the number of penguins for penguin species in each island.
## # A tibble: 5 x 3 ## species island n ## <chr> <chr> <int> ## 1 Adelie Biscoe 44 ## 2 Adelie Dream 56 ## 3 Adelie Torgersen 52 ## 4 Chinstrap Dream 68 ## 5 Gentoo Biscoe 124
Count Observations by Two Groups and Sort the Results
With sort=TRUE argument, we can also sort the results from count() with two groups.
penguins %>% count(species,island, sort=TRUE)
## # A tibble: 5 x 3 ## species island n ## <chr> <chr> <int> ## 1 Gentoo Biscoe 124 ## 2 Chinstrap Dream 68 ## 3 Adelie Dream 56 ## 4 Adelie Torgersen 52 ## 5 Adelie Biscoe 44
Check out more on count() function at dplyr’s website.