How to Select Top N Rows with the Largest Values in a Column(s) in Pandas?

Sometimes, while doing data wrangling, we might need to get a quick look at the top rows with the largest or smallest values in a column. This kind of quick glance at the data reveal interesting information in a dataframe. Pandas dataframe easily enables one to have a quick look at the top rows either with largest or smallest values in a column.

Pandas library has function called nlargest makes it really easy to look at the top or bottom rows. Let us first load Pandas library.

import pandas as pd

Let us use gapminder data. Let us load the data from Carpentry’s github page and look at the data corresponding to the year 2007 alone.

# Carpentry url containing data
data_url = 'http://bit.ly/2cLzoxH'
# Load the data from Carpentry url
gapminder = pd.read_csv(data_url)
# filter the data to contain just year=2007
gapminder_2007 = gapminder[gapminder.year==2007]

Pandas nlargest function can take the number of rows we need as argument and the column name for which we are looking for largest values. Pandas nlargest function

Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

Let us look at the top 3 rows of the dataframe with the largest population values using the column variable “pop”.

gapminder_2007.nlargest(3,'pop')

We just get two rows and see that China, India, and US are the top 3 countries with largest population.

country	year	pop	continent	lifeExp	gdpPercap
299	China	2007	1.318683e+09	Asia	72.961	4959.114854
707	India	2007	1.110396e+09	Asia	64.698	2452.210407
1619	United States	2007	3.011399e+08	Americas	78.242	42951.653090

The function nlargest also has an argument keep that allows one to deal with duplicate values. keep can take {‘first’, ‘last’, ‘all’}, where

first : prioritize the first occurrence(s)
last : prioritize the last occurrence(s)
all : does not drop any duplicates

How to Get Top N Rows Based on Largest Values in Multiple Columns in Pandas?

In the above example we saw getting top rows ordered by values of a single column. Pandas nlargest function can take more than one variable to order the top rows.

We can give a list of variables as input to nlargest and get first n rows ordered by the list of columns in descending order.

# top n rows ordered by multiple columns
gapminder_2007.nlargest(3,['lifeExp','gdpPercap'])

Here we get top 3 rows with largest values in column “lifeExp” and then “gdpPercap”.

	country	year	pop	continent	lifeExp	gdpPercap
803	Japan	2007	127467972.0	Asia	82.603	31656.06806
671	Hong Kong China	2007	6980412.0	Asia	82.208	39724.97867
695	Iceland	2007	301931.0	Europe	81.757	36180.78919

How to Get Top N Rows Based on Smallest Values of a Column in Pandas?

Just as you guessed, Pandas has the function nsmallest to select top rows of smallest values in one or more column, in descending order.

Let use see an example of using nsmallest on gapminder data. Here is how to get top 3 countries with smallest lifeExp.

gapminder_2007.nsmallest(3,'liefExp')
	country	year	pop	continent	lifeExp	gdpPercap
1463	Swaziland	2007	1133066.0	Africa	39.613	4513.480643
1043	Mozambique	2007	19951656.0	Africa	42.082	823.685621
1691	Zambia	2007	11746035.0	Africa	42.384	1271.211593