As part of exploring a new data, often you might want to count unique values of one or more columns in a dataframe. Pandas value_counts() can get counts of unique values of columns in a Pandas dataframe. Starting from Pandas version 1.1.0, we can use value_counts() on a Pandas Series and dataframe as well.
In this tutorial, we will see examples of using Pandas value_counts on a single variable in a dataframe (i.e. Series value_counts()) first and then see how to use value_counts on a dataframe, i.e. with multiple variables.
Let us load Pandas and check the version of Pandas. With Pandas version 1.1.0 and above we can use value_counts() on Pandas dataframe and Series.
import pandas as pd import numpy as np pd.__version__ 1.1.0
We will be using Penguins data set to count values for one or more variables. Let us load the data directly from github page.
data_url = "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv" df = pd.read_csv(data_url) df.head() species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 3 Adelie Torgersen NaN NaN NaN NaN NaN 4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female
For our examples illustrating value_counts(), we will mainly focus first on categorical variables and get their counts/frequency.
Pandas value_counts() on a single column
With Pandas value_counts() function we can compute the frequency of a variable from dataframe as shown below. In the example below, we are interested in “island” column and ask what are the counts of each unique island in the dataset. We can see that there are three different islands in the data and also count/frequency of each of them.
df.island.value_counts() Biscoe 168 Dream 124 Torgersen 52 Name: island, dtype: int64
By default, the resulting Series object from Pandas value_counts()
will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Here is another example of using value_counts() on another categorical variable and their counts.
df.species.value_counts() Adelie 152 Gentoo 124 Chinstrap 68 Name: species, dtype: int64
Pandas’ value_counts() to get proportion
By using normalize=True argument to Pandas value_counts() function, we can get the proportion of each value of the variable instead of the counts.
df.species.value_counts(normalize = True)
We can see that the resulting Series has relative frequencies of the unique values.
Adelie 0.441860 Gentoo 0.360465 Chinstrap 0.197674 Name: species, dtype: float64
How to convert value_counts() output into a data frame
Pandas’ value_counts() returns a Series object. Here is a way to convert the output Series from value_counts() into a Pandas Data Frame. We can use reset_index() function and a get data frame.
df.species.value_counts().reset_index() index species 0 Adelie 152 1 Gentoo 124 2 Chinstrap 68
Pandas reset_index() function uses the variable name for the counts. Often you would like to rename column names to more suitable names. Here we assign new column names using columns method on the dataframe.
df_species_counts = df.species.value_counts().reset_index() df_species_counts.columns = ["island", "n"]
Now the results from value_counts() is a dataframe with suitable column names.
df_species_counts island n 0 Biscoe 168 1 Dream 124 2 Torgersen 52
Pandas value_counts() on multiple columns (or on a dataframe)
Sometimes you might want to tabulate counts of multiple variables. With Pandas version 1.1.0 and above we can use Pandas’ value_coiunts() function to get counts for multiple variable. For example, if we want to know the counts of each island and species combination, we can use value_counts as follows.
df[["island", "species"]].value_counts()
Pandas value_counts() on dataframe gives the result as multi-index Series.
island species Biscoe Gentoo 124 Dream Chinstrap 68 Adelie 56 Torgersen Adelie 52 Biscoe Adelie 44 dtype: int64
We can also use Pandas to_markdown() fucntion to nicely tabulate the results from value_counts()
print(df[["island", "species"]].value_counts().to_markdown()) | | 0 | |:------------------------|----:| | ('Biscoe', 'Gentoo') | 124 | | ('Dream', 'Chinstrap') | 68 | | ('Dream', 'Adelie') | 56 | | ('Torgersen', 'Adelie') | 52 | | ('Biscoe', 'Adelie') | 44 |
If you want the results from value_counts() as a simple data frame, not multi-indexed one, we can use reset_index() function to get a simple dataframe.
df[["island", "species"]].value_counts(sort=False).reset_index() island species 0 0 Biscoe Adelie 44 1 Biscoe Gentoo 124 2 Dream Adelie 56 3 Dream Chinstrap 68 4 Torgersen Adelie 52
Another useful feature of value_counts() is that we can get normalized values of counts as proportions, the relative frequency by dividing each value by the sum of all values., using normalize=True argument.
df[["island", "species"]].value_counts(normalize=True) island species Biscoe Gentoo 0.360465 Dream Chinstrap 0.197674 Adelie 0.162791 Torgersen Adelie 0.151163 Biscoe Adelie 0.127907 dtype: float64
If we want the result as a simple dataframe, we can use reset_index() function to convert the multi-indexed dataframe to a simple one.
df[["island", "species"]].value_counts(normalize=True).reset_index() island species 0 0 Biscoe Gentoo 0.360465 1 Dream Chinstrap 0.197674 2 Dream Adelie 0.162791 3 Torgersen Adelie 0.151163 4 Biscoe Adelie 0.127907
Pandas value_counts() with groupby()
If you are using pandas version below 1.1.0 and stil want to compute counts of multiple variables, the solution is to use Pandas groupby function. We basically select the variables of interest from the data frame and use groupby on the variables and compute size. In the example below we have computed the counts for each combination of unique values of two variables.
df[["island", "species"]].groupby(["island", "species"]).size() island species Biscoe Adelie 44 Gentoo 124 Dream Adelie 56 Chinstrap 68 Torgersen Adelie 52 dtype: int64
You can see that we get the same results, but with different order. As both value_counts() and groupby() functions have different sort options.