Often you want to sort Pandas data frame in a specific way. Typically, one may want to sort pandas data frame based on the values of one or more columns or sort based on the values of row index or row names of pandas dataframe. Pandas data frame has two useful functions
- sort_values(): to sort pandas data frame by one or more columns
- sort_index(): to sort pandas data frame by row index
Each of these functions come with numerous options, like sorting the data frame in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm and so on.
Here is a quick Pandas tutorial on multiple ways of using sort_values() and sort_index() to sort pandas data frame using a real data set (gapminder).
Let us first load the gapminder data from software carpentry URL.
data_url = 'http://bit.ly/2cLzoxH' # read data from url as pandas dataframe gapminder = pd.read_csv(data_url) # print the first three rows print(gapminder.head(n=3))
1. How to Sort Pandas Dataframe based on the values of a column?
We can sort pandas dataframe based on the values of a single column by specifying the column name wwe want to sort as input argument to sort_values(). For example, we can sort by the values of “lifeExp” column in the gapminder data like
sort_by_life = gapminder.sort_values('lifeExp')
print(sort_by_life.head(n=3)) country year pop continent lifeExp gdpPercap 1292 Rwanda 1992 7290203.0 Africa 23.599 737.068595 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 552 Gambia 1952 284320.0 Africa 30.000 485.230659
Note that by default sort_values sorts and gives a new data frame. The new sorted data frame is in ascending order (small values first and large values last). With head function we can see that the first rows have smaller life expectancy. Using tail function the sorted data frame, we can see that the last rows have higher life expectancy.
print(sort_by_life.tail(n=3)) country year pop continent lifeExp gdpPercap 802 Japan 2002 127065841.0 Asia 82.000 28604.59190 671 Hong Kong China 2007 6980412.0 Asia 82.208 39724.97867 803 Japan 2007 127467972.0 Asia 82.603 31656.06806
2. How to Sort Pandas Dataframe based on the values of a column (Descending order)?
To sort a dataframe based on the values of a column but in descending order so that the largest values of the column are at the top, we can use the argument ascending=False.
sort_by_life = gapminder.sort_values('lifeExp', ascending=False)
In this example, we can see that after sorting the dataframe by lifeExp with ascending=False, the countries with largest life expectancy are at the top.
print(sort_by_life.head(n=3)) country year pop continent lifeExp gdpPercap 803 Japan 2007 127467972.0 Asia 82.603 31656.06806 671 Hong Kong China 2007 6980412.0 Asia 82.208 39724.97867 802 Japan 2002 127065841.0 Asia 82.000 28604.59190
3. How to Sort Pandas Dataframe based on a column and put missing values first?
Often a data frame might contain missing values and when sorting a data frame on a column with missing value, we might want to have rows with missing values to be at the first or at the last.
We can specify the position we want for missing values using the argument na_position. With na_position=’first’, it will have the rows with missing values first.
sort_na_first = gapminder.sort_values('lifeExp', na_position='first')
In this example, there are NO missing values and that is why there is no na values at the top when sorted with the option na_position=’first’.
sort_na_first.head() country year pop continent lifeExp gdpPercap 1292 Rwanda 1992 7290203.0 Africa 23.599 737.068595 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 552 Gambia 1952 284320.0 Africa 30.000 485.230659
4. How to Sort Pandas Dataframe based on a column in place?
By default sorting pandas data frame using sort_values() or sort_index() creates a new data frame. If you don’t want create a new data frame after sorting and just want to do the sort in place, you can use the argument “inplace = True”. Here is an example of sorting a pandas data frame in place without creating a new data frame.
gapminder.sort_values('lifeExp', inplace=True, ascending=False)
We can see that the data frame sorted as lifeExp values at the top are smallest and the row indices are not in order.
print(gapminder.head(n=3)) country year pop continent lifeExp gdpPercap 1292 Rwanda 1992 7290203.0 Africa 23.599 737.068595 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 552 Gambia 1952 284320.0 Africa 30.000 485.230659
Note that, the row index of the sorted data frame is different from the data frame before sorting.
5. How to Sort Pandas Dataframe based on Index (in place)?
We can use sort_index() to sort pandas dataframe to sort by row index or names. In this example, row index are numbers and in the earlier example we sorted data frame by lifeExp and therefore the row index are jumbled up. We can sort by row index (with inplace=True option) and retrieve the original dataframe.
gapminder.sort_index(inplace=True)
Now we can see that row indices start from 0 and sorted in ascending order. Compare it to the previous example, where the first row index is 1292 and row indices are not sorted.
print(gapminder.head(n=3)) country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
6. How to Sort Pandas Dataframe Based on the Values of Multiple Columns?
Often, you might want to sort a data frame based on the values of multiple columns. We can specify the columns we want to sort by as a list in the argument for sort_values(). For example, to sort by values of two columns, we can do.
sort_by_life_gdp = gapminder.sort_values(['lifeExp','gdpPercap'])
We can see that lifeExp column is sorted in ascending order and for each values of lifeExp, gdpPercap is sorted.
print(sort_by_life_gdp.head()) country year pop continent lifeExp gdpPercap 1292 Rwanda 1992 7290203.0 Africa 23.599 737.068595 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 552 Gambia 1952 284320.0 Africa 30.000 485.230659
Note that when sorting by multiple columns, pandas sort_value() uses the first variable first and second variable next. We can see the difference by switching the order of column names in the list.
sort_by_life_gdp = gapminder.sort_values(['gdpPercap','lifeExp'])
print(sort_by_life_gdp.head()) country year pop continent lifeExp gdpPercap 334 Congo Dem. Rep. 2002 55379852.0 Africa 44.966 241.165877 335 Congo Dem. Rep. 2007 64606759.0 Africa 46.462 277.551859 876 Lesotho 1952 748747.0 Africa 42.138 298.846212