Pandas offer many ways to select rows from a dataframe. One of the commonly used approach to filter rows of a dataframe is to use the indexing in multiple ways. For example, one can use label based indexing with loc function.
Introducing pandas query() function, Jake VanderPlas nicely explains,
While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.
Not just that, often this involve slightly messier code with a lot of repetition. A simpler alternative in Pandas to select or filter rows dataframe with specified condition is to use query function Pandas.
In this post, we will see multiple examples of using query function in Pandas to select or filter rows of Pandas data frame based values of columns.
Let us first load Pandas.
# import pandas import pandas as pd
Let us load gapminder dataset to work through examples of using query() to filter rows.
data_url = 'http://bit.ly/2cLzoxH' # read data from url as pandas dataframe gapminder = pd.read_csv(data_url) print(gapminder.head(3)) country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
Filtering Rows of Pandas Dataframe – the usual way
Let us say we want to subset the gapminder dataframe such that we want all rows whose country value is United States. We can use Pandas indexing to subset the gapminder dataframe for United States as follows. Here we first create a boolean series and use it to filter the dataframe.
gapminder[gapminder.country=="United States"].head()
And we would get
country year pop continent lifeExp gdpPercap 1608 United States 1952 157553000.0 Americas 68.44 13990.48208 1609 United States 1957 171984000.0 Americas 69.49 14847.12712 1610 United States 1962 186538000.0 Americas 70.21 16173.14586 1611 United States 1967 198712000.0 Americas 70.76 19530.36557 1612 United States 1972 209896000.0 Americas 71.34 21806.03594
Filtering Rows of Pandas Dataframe by variable using query() function
In the above example, we can see that we have to create an intermediate boolean variable and also have to repeat “gapminder” two times.
Filtering Rows with Pandas query(): Example 1
A cleaner approach to filter Pandas dataframe is to use Pandas query() function and select rows. The way to query() function to filter rows is to specify the condition within quotes inside query().
# filter rows with Pandas query gapminder.query('country=="United States"').head()
And we would get the same answer as above.
Filtering Rows with Pandas query(): Example 2
In the above query() example we used string to select rows of a dataframe. We can also use it to select based on numerical values. For example, to select rows for year 1952, we can write
gapminder.query('year==1952').head()
And we would get a new dataframe for the year 1952.
country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 12 Albania 1952 1282697.0 Europe 55.230 1601.056136 24 Algeria 1952 9279525.0 Africa 43.077 2449.008185 36 Angola 1952 4232095.0 Africa 30.015 3520.610273 48 Argentina 1952 17876956.0 Americas 62.485 5911.315053
Filtering Rows with Pandas query() multiple conditions: Example 3
Similarly, we use boolean operators to combine multiple conditions. For example, if want to select rows corresponding to US for the year greater than 1996,
gapminder.query('country=="United States" & year > 1996')
And we would get
country year pop continent lifeExp gdpPercap 1617 United States 1997 272911760.0 Americas 76.810 35767.43303 1618 United States 2002 287675526.0 Americas 77.310 39097.09955 1619 United States 2007 301139947.0 Americas 78.242 42951.65309
Filtering Rows with Pandas query() Multiple Conditions: Example 4
We can also use query() to check for matches with a list of values corresponding to a column. Here we use in operator to check for equality.
gapminder.query('country in ["United States", "United Kingdom"] & year > 2000')
And we would get
country year pop continent lifeExp gdpPercap 1606 United Kingdom 2002 59912431.0 Europe 78.471 29478.99919 1607 United Kingdom 2007 60776238.0 Europe 79.425 33203.26128 1618 United States 2002 287675526.0 Americas 77.310 39097.09955 1619 United States 2007 301139947.0 Americas 78.242 42951.65309
Filtering Rows with Pandas query(): Example 5
Starting with Pandas 1.0.0. query() function has expanded the functionalities of using backtick quoting for more than only spaces. In the simplest use case backticks quoted variable is useful for column names with spaces in it. For example, if we have data frame with column ‘C C’ with space
df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2), 'C C': range(10, 5, -1)})
We can use query function with backticks quoting as shown in Pandas documentation.
df.query('B == `C C`')