How To Filter Pandas Dataframe By Values of Column?

Pandas Filter/Select Rows Based on Column Values
Pandas Filter/Select Rows Based on Column Values

In this post, we will learn how to filter Pandas dataframe by column values. More specifically, we will subset a pandas dataframe based on one or more values of a specific column.

Pandas Filter/Select Rows Based on Column Values

In this tutorial, we will see SIX examples of using Pandas dataframe to filter rows or select rows based values of a column(s). These examples cover common filtering scenarios where we want to filter rows what a column value equals something, the column value does not equal something, column value is not NA, or if the column values are in a list or not in a list.

Let us first load gapminder data as a dataframe into pandas.

# load pandas
import pandas as pd
data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)

This data frame has over 6000 rows and 6 columns. One of the columns is year. Let us look at the first three rows of the data frame.

print(gapminder.head(3))
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710

Let us say we want to filter the data frame such that we get a smaller data frame with “year” values equal to 2002. That is, we want to subset the data frame based on values of year column. We keep the rows if its year value is 2002, otherwise we don’t.

1. How to Select Rows of Pandas Dataframe Based on a Single Value of a Column?

One way to filter by rows in Pandas is to use boolean expression. We first create a boolean variable by taking the column of interest and checking if its value equals to the specific value that we want to select/keep.

For example, let us filter the dataframe or subset the dataframe based on year’s value 2002. This conditional results in a boolean variable that has True when the value of year equals 2002, False otherwise.

# does year equals to 2002?
# is_2002 is a boolean variable with True or False in it
>is_2002 =  gapminder['year']==2002
# is_2002 is a boolean variable with True or False in it
>print(is_2002.head())
0    False
1    False
2    False
3    False
4    False

We can then use this boolean variable to filter the dataframe. After subsetting we can see that new dataframe is much smaller in size.

# filter rows for year 2002 using  the boolean variable
>gapminder_2002 = gapminder[is_2002]

Checking the shape or dimension of the filtered dataframe

>print(gapminder_2002.shape)
(142, 6)

We have successfully filtered pandas dataframe based on values of a column. Here, all the rows with year equals to 2002.

>print(gapminder_2002.head())
        country  year         pop continent  lifeExp    gdpPercap
10  Afghanistan  2002  25268405.0      Asia   42.129   726.734055
22      Albania  2002   3508512.0    Europe   75.651  4604.211737
34      Algeria  2002  31287142.0    Africa   70.994  5288.040382
46       Angola  2002  10866106.0    Africa   41.003  2773.287312
58    Argentina  2002  38331121.0  Americas   74.340  8797.640716

In the above example, we used two steps, 1) create boolean variable satisfying the filtering condition 2) use boolean variable to filter rows. However, we don’t really have to create a new boolean variable and save it to do the filtering. Instead, we can directly give the boolean expression to subset the dataframe by column value as follows.

# filter rows for year 2002 using  the boolean expression
>gapminder_2002 = gapminder[gapminder['year']==2002]
>print(gapminder_2002.shape)
(142, 6)

How To Filter rows using Pandas chaining?

We can also use Pandas chaining operation, to access a dataframe’s column and to select rows like previous example. Pandas chaining makes it easy to combine one Pandas command with another Pandas command or user defined functions.

Here we use Pandas eq() function and chain it with the year series for checking element-wise equality to filter the data corresponding to year 2002.

# filter rows for year 2002 using  the boolean expression
>gapminder_2002 = gapminder[gapminder.year.eq(2002)]
>print(gapminder_2002.shape)
(142, 6)

In the above example, we checked for equality (year==2002) and kept the rows matching a specific value. We can use any other comparison operator like “less than” and “greater than” and create boolean expression to filter rows of pandas dataframe.

2. How to Select Rows of Pandas Dataframe Whose Column Value Does NOT Equal a Specific Value?

Sometimes, you may want tot keep rows of a data frame based on values of a column that does not equal something. Let us filter our gapminder dataframe whose year column is not equal to 2002. Basically we want to have all the years data except for the year 2002.

To check if the value of a variable equals to a specific value, we can access it in two ways. First we use the dot operate to access a variable.

# filter rows for year does not equal to 2002
>gapminder_not_2002 = gapminder[gapminder.year != 2002]
>gapminder_not_2002.shape
(1562, 6)

We can also use square bracket based selection to get a variable.

# filter rows for year does not equal to 2002
>gapminder_not_2002 = gapminder[gapminder['year']!=2002]
>gapminder_not_2002.shape
(1562, 6)

3. How to Select Rows of Pandas Dataframe Whose Column Value is NOT NA/NAN?

Often you may want to filter a Pandas dataframe such that you would like to keep the rows if values of certain column is NOT NA/NAN.

Pandas Filter Out Rows based on NAs in a column

We can use Pandas notnull() method to filter based on NA/NAN values of a column.

# filter out rows ina . dataframe with column year values NA/NAN
>gapminder_no_NA = gapminder[gapminder.year.notnull()]

4. How to Select Rows of Pandas Dataframe Based on a list?

Also in the above example, we selected rows based on single value, i.e. year == 2002. However, often we may have to select rows using multiple values present in an iterable or a list. For example, let us say we want select rows for years [1952, 2002].

Pandas dataframe’s isin() function allows us to select rows using a list or any iterable. If we use isin() with a single column, it will simply result in a boolean variable with True if the value matches and False if it does not.

#To select rows whose column value is in list 
years = [1952, 2007]
gapminder.year.isin(years)

We can use the boolean array to select the rows like before

>gapminder_years= gapminder[gapminder.year.isin(years)]
>gapminder_years.shape
(284, 6)

We can make sure our new data frame contains row corresponding only the two years specified in the list. Let us use Pandas unique function to get the unique values of the column “year”

>gapminder_years.year.unique()
array([1952, 2007])

5. How to Select Rows of Pandas Dataframe Based on Values NOT in a list?

We can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front. For example, to get rows of gapminder data frame whose column values not in the continent list, we will use

>continents = ['Asia','Africa', 'Americas', 'Europe']
>gapminder_Ocean = gapminder[~gapminder.continent.isin(continents)]
>gapminder_Ocean.shape 
(24,6)

This will result in a smaller dataframe with gapminder data for just Oceania continent. We can verify this again using Pandas’ unique function as before. We will just see “Oceania” continent.

>gapminder_Ocean.continent.unique()
array(['Oceania'], dtype=object)

6. How to Select Rows of Pandas Dataframe using Multiple Conditions?

We can combine multiple conditions using & operator to select rows from a pandas data frame. For example, we can combine the above two conditions to get Oceania data from years 1952 and 2002.

gapminder[~gapminder.continent.isin(continents) & 
           gapminder.year.isin(years)]

Now we will have rows corresponding to the Oceania continent for the years 1957 and 2007.

          country  year         pop continent  lifeExp    gdpPercap
60      Australia  1952   8691212.0   Oceania   69.120  10039.59564
71      Australia  2007  20434176.0   Oceania   81.235  34435.36744
1092  New Zealand  1952   1994794.0   Oceania   69.390  10556.57566
1103  New Zealand  2007   4115771.0   Oceania   80.204  25185.00911

In this post we saw multiple ways to filter rows of Pandas dataframe. There is more. Check out the post on how to use Pandas query() function to select rows from Pandas data frame.

Are you new to Pandas? And getting started with Pandas recently? Check out our new Byte Sized Pandas 101 tutorials.