In this post, we will learn how to filter Pandas dataframe by column values. More specifically, we will subset a pandas dataframe based on one or more values of a specific column.
In this tutorial, we will see SIX examples of using Pandas dataframe to filter rows or select rows based values of a column(s). These examples cover common filtering scenarios where we want to filter rows what a column value equals something, the column value does not equal something, column value is not NA, or if the column values are in a list or not in a list.
Let us first load gapminder data as a dataframe into pandas.
# load pandas import pandas as pd data_url = 'http://bit.ly/2cLzoxH' # read data from url as pandas dataframe gapminder = pd.read_csv(data_url)
This data frame has over 6000 rows and 6 columns. One of the columns is year. Let us look at the first three rows of the data frame.
print(gapminder.head(3)) country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
Let us say we want to filter the data frame such that we get a smaller data frame with “year” values equal to 2002. That is, we want to subset the data frame based on values of year column. We keep the rows if its year value is 2002, otherwise we don’t.
1. How to Select Rows of Pandas Dataframe Based on a Single Value of a Column?
One way to filter by rows in Pandas is to use boolean expression. We first create a boolean variable by taking the column of interest and checking if its value equals to the specific value that we want to select/keep.
For example, let us filter the dataframe or subset the dataframe based on year’s value 2002. This conditional results in a boolean variable that has True when the value of year equals 2002, False otherwise.
# does year equals to 2002? # is_2002 is a boolean variable with True or False in it >is_2002 = gapminder['year']==2002
# is_2002 is a boolean variable with True or False in it >print(is_2002.head()) 0 False 1 False 2 False 3 False 4 False
We can then use this boolean variable to filter the dataframe. After subsetting we can see that new dataframe is much smaller in size.
# filter rows for year 2002 using the boolean variable >gapminder_2002 = gapminder[is_2002]
Checking the shape or dimension of the filtered dataframe
>print(gapminder_2002.shape) (142, 6)
We have successfully filtered pandas dataframe based on values of a column. Here, all the rows with year equals to 2002.
>print(gapminder_2002.head()) country year pop continent lifeExp gdpPercap 10 Afghanistan 2002 25268405.0 Asia 42.129 726.734055 22 Albania 2002 3508512.0 Europe 75.651 4604.211737 34 Algeria 2002 31287142.0 Africa 70.994 5288.040382 46 Angola 2002 10866106.0 Africa 41.003 2773.287312 58 Argentina 2002 38331121.0 Americas 74.340 8797.640716
In the above example, we used two steps, 1) create boolean variable satisfying the filtering condition 2) use boolean variable to filter rows. However, we don’t really have to create a new boolean variable and save it to do the filtering. Instead, we can directly give the boolean expression to subset the dataframe by column value as follows.
# filter rows for year 2002 using the boolean expression >gapminder_2002 = gapminder[gapminder['year']==2002] >print(gapminder_2002.shape) (142, 6)
How To Filter rows using Pandas chaining?
We can also use Pandas chaining operation, to access a dataframe’s column and to select rows like previous example. Pandas chaining makes it easy to combine one Pandas command with another Pandas command or user defined functions.
Here we use Pandas eq() function and chain it with the year series for checking element-wise equality to filter the data corresponding to year 2002.
# filter rows for year 2002 using the boolean expression >gapminder_2002 = gapminder[gapminder.year.eq(2002)] >print(gapminder_2002.shape) (142, 6)
In the above example, we checked for equality (year==2002) and kept the rows matching a specific value. We can use any other comparison operator like “less than” and “greater than” and create boolean expression to filter rows of pandas dataframe.
2. How to Select Rows of Pandas Dataframe Whose Column Value Does NOT Equal a Specific Value?
Sometimes, you may want tot keep rows of a data frame based on values of a column that does not equal something. Let us filter our gapminder dataframe whose year column is not equal to 2002. Basically we want to have all the years data except for the year 2002.
To check if the value of a variable equals to a specific value, we can access it in two ways. First we use the dot operate to access a variable.
# filter rows for year does not equal to 2002 >gapminder_not_2002 = gapminder[gapminder.year != 2002] >gapminder_not_2002.shape (1562, 6)
We can also use square bracket based selection to get a variable.
# filter rows for year does not equal to 2002 >gapminder_not_2002 = gapminder[gapminder['year']!=2002] >gapminder_not_2002.shape (1562, 6)
3. How to Select Rows of Pandas Dataframe Whose Column Value is NOT NA/NAN?
Often you may want to filter a Pandas dataframe such that you would like to keep the rows if values of certain column is NOT NA/NAN.
We can use Pandas notnull() method to filter based on NA/NAN values of a column.
# filter out rows ina . dataframe with column year values NA/NAN >gapminder_no_NA = gapminder[gapminder.year.notnull()]
4. How to Select Rows of Pandas Dataframe Based on a list?
Also in the above example, we selected rows based on single value, i.e. year == 2002. However, often we may have to select rows using multiple values present in an iterable or a list. For example, let us say we want select rows for years [1952, 2002].
Pandas dataframe’s isin() function allows us to select rows using a list or any iterable. If we use isin() with a single column, it will simply result in a boolean variable with True if the value matches and False if it does not.
#To select rows whose column value is in list years = [1952, 2007] gapminder.year.isin(years)
We can use the boolean array to select the rows like before
>gapminder_years= gapminder[gapminder.year.isin(years)] >gapminder_years.shape (284, 6)
We can make sure our new data frame contains row corresponding only the two years specified in the list. Let us use Pandas unique function to get the unique values of the column “year”
>gapminder_years.year.unique() array([1952, 2007])
5. How to Select Rows of Pandas Dataframe Based on Values NOT in a list?
We can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front. For example, to get rows of gapminder data frame whose column values not in the continent list, we will use
>continents = ['Asia','Africa', 'Americas', 'Europe'] >gapminder_Ocean = gapminder[~gapminder.continent.isin(continents)] >gapminder_Ocean.shape (24,6)
This will result in a smaller dataframe with gapminder data for just Oceania continent. We can verify this again using Pandas’ unique function as before. We will just see “Oceania” continent.
>gapminder_Ocean.continent.unique() array(['Oceania'], dtype=object)
6. How to Select Rows of Pandas Dataframe using Multiple Conditions?
We can combine multiple conditions using & operator to select rows from a pandas data frame. For example, we can combine the above two conditions to get Oceania data from years 1952 and 2002.
gapminder[~gapminder.continent.isin(continents) & gapminder.year.isin(years)]
Now we will have rows corresponding to the Oceania continent for the years 1957 and 2007.
country year pop continent lifeExp gdpPercap 60 Australia 1952 8691212.0 Oceania 69.120 10039.59564 71 Australia 2007 20434176.0 Oceania 81.235 34435.36744 1092 New Zealand 1952 1994794.0 Oceania 69.390 10556.57566 1103 New Zealand 2007 4115771.0 Oceania 80.204 25185.00911
In this post we saw multiple ways to filter rows of Pandas dataframe. There is more. Check out the post on how to use Pandas query() function to select rows from Pandas data frame.
Are you new to Pandas? And getting started with Pandas recently? Check out our new Byte Sized Pandas 101 tutorials.