How To Filter Pandas Dataframe By Values of Column?

One of the biggest advantages of having the data as a Pandas Dataframe is that Pandas allows us to slice and dice the data in multiple ways.

Often, you may want to subset a pandas dataframe based on one or more values of a specific column. Essentially, we would like to select rows based on one value or multiple values present in a column.

Here are SIX examples of using Pandas dataframe to filter rows or select rows based values of a column(s).

Let us first load gapminder data as a dataframe into pandas.

# load pandas
import pandas as pd
data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)

This data frame has over 6000 rows and 6 columns. One of the columns is year. Let us look at the first three rows of the data frame.

print(gapminder.head(3))
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710

Let us say we want to filter the data frame such that we get a smaller data frame with “year” values equal to 2002. That is, we want to subset the data frame based on values of year column. We keep the rows if its year value is 2002, otherwise we don’t.

1. How to Select Rows of Pandas Dataframe Based on a Single Value of a Column?

One way to filter by rows in Pandas is to use boolean expression. We first create a boolean variable by taking the column of interest and checking if its value equals to the specific value that we want to select/keep.

For example, let us filter the dataframe or subset the dataframe based on year’s value 2002. This conditional results in a boolean variable that has True when the value of year equals 2002, False otherwise.

# does year equals to 2002?
# is_2002 is a boolean variable with True or False in it
>is_2002 =  gapminder['year']==2002
>print(is_2002.head())
0    False
1    False
2    False
3    False
4    False

We can then use this boolean variable to filter the dataframe. After subsetting we can see that new dataframe is much smaller in size.

# filter rows for year 2002 using  the boolean variable
>gapminder_2002 = gapminder[is_2002]
>print(gapminder_2002.shape)
(142, 6)

We have successfully filtered pandas dataframe based on values of a column. Here, all the rows with year equals to 2002.

>print(gapminder_2002.head())
        country  year         pop continent  lifeExp    gdpPercap
10  Afghanistan  2002  25268405.0      Asia   42.129   726.734055
22      Albania  2002   3508512.0    Europe   75.651  4604.211737
34      Algeria  2002  31287142.0    Africa   70.994  5288.040382
46       Angola  2002  10866106.0    Africa   41.003  2773.287312
58    Argentina  2002  38331121.0  Americas   74.340  8797.640716

In the above example, we used two steps, 1) create boolean variable satisfying the filtering condition 2) use boolean variable to filter rows. However, we don’t really have to create a new boolean variable and save it to do the filtering. Instead, we can directly give the boolean expression to subset the dataframe by column value as follows.

# filter rows for year 2002 using  the boolean expression
>gapminder_2002 = gapminder[gapminder['year']==2002]
>print(gapminder_2002.shape)
(142, 6)

How To Filter rows using Pandas chaining?

We can also use Pandas chaining operation, to access a dataframe’s column and to select rows like previous example. Pandas chaining makes it easy to combine one Pandas command with another Pandas command or user defined functions.

# filter rows for year 2002 using  the boolean expression
>gapminder_2002 = gapminder[gapminder.year == 2002]
>print(gapminder_2002.shape)
(142, 6)

In the above example, we checked for equality (year==2002) and kept the rows matching a specific value. We can use any other comparison operator like “less than” and “greater than” and create boolean expression to filter rows of pandas dataframe.

2. How to Select Rows of Pandas Dataframe Whose Column Value Does NOT Equal a Specific Value?

Sometimes, you may want tot keep rows of a data frame based on values of a column that does not equal something. Let us filter our gapminder dataframe whose year column is not equal to 2002. Basically we want to have all the years data except for the year 2002.

# filter rows for year does not equal to 2002
>gapminder_not_2002 = gapminder[gapminder.year != 2002]
>gapminder_not_2002 = gapminder[gapminder['year']!=2002]
>gapminder_not_2002.shape
(1562, 6)

3. How to Select Rows of Pandas Dataframe Whose Column Value is NOT NA/NAN?

Often you may want to filter a Pandas dataframe such that you would like to keep the rows if values of certain column is NOT NA/NAN.

We can use Pandas notnull() method to filter based on NA/NAN values of a column.

# filter out rows ina . dataframe with column year values NA/NAN
>gapminder_no_NA = gapminder[gapminder.year.notnull()]

4. How to Select Rows of Pandas Dataframe Based on a list?

Also in the above example, we selected rows based on single value, i.e. year == 2002. However, often we may have to select rows using multiple values present in an iterable or a list. For example, let us say we want select rows for years [1952, 2002].

Pandas dataframe’s isin() function allows us to select rows using a list or any iterable. If we use isin() with a single column, it will simply result in a boolean variable with True if the value matches and False if it does not.

#To select rows whose column value is in list 
years = [1952, 2007]
gapminder.year.isin(years)

We can use the boolean array to select the rows like before

>gapminder_years= gapminder[gapminder.year.isin(years)]
>gapminder_years.shape
(284, 6)

We can make sure our new data frame contains row corresponding only the two years specified in the list. Let us use Pandas unique function to get the unique values of the column “year”

>gapminder_years.year.unique()
array([1952, 2007])

5. How to Select Rows of Pandas Dataframe Based on Values NOT in a list?

We can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front. For example, to get rows of gapminder data frame whose column values not in the continent list, we will use

>continents = ['Asia','Africa', 'Americas', 'Europe']
>gapminder_Ocean = gapminder[~gapminder.continent.isin(continents)]
>gapminder_Ocean.shape 
(24,6)

This will result in a smaller dataframe with gapminder data for just Oceania continent. We can verify this again using Pandas’ unique function as before. We will just see “Oceania” continent.

>gapminder_Ocean.continent.unique()
array(['Oceania'], dtype=object)

6. How to Select Rows of Pandas Dataframe using Multiple Conditions?

We can combine multiple conditions using & operator to select rows from a pandas data frame. For example, we can combine the above two conditions to get Oceania data from years 1952 and 2002.

gapminder[~gapminder.continent.isin(continents) & 
           gapminder.year.isin(years)]

Now we will have rows corresponding to the Oceania continent for the years 1957 and 2007.

          country  year         pop continent  lifeExp    gdpPercap
60      Australia  1952   8691212.0   Oceania   69.120  10039.59564
71      Australia  2007  20434176.0   Oceania   81.235  34435.36744
1092  New Zealand  1952   1994794.0   Oceania   69.390  10556.57566
1103  New Zealand  2007   4115771.0   Oceania   80.204  25185.00911