One of the most common tasks in data science is to manipulate the data frame we have to a specific format. For example, sometime we may want to take data frame with fewer columns, say in long format, summarize and convert into a data frame with multiple columns, i.e. a wide data frame.
Python Pandas function pivot_table help us with the summarization and conversion of dataframe in long form to dataframe in wide form, in a variety of complex scenarios.
In Pandas, the pivot table function takes simple data frame as input, and performs grouped operations that provides a multidimensional summary of the data. Yes, in a way, it is related Pandas group_by function. Jake Vanderplas nicely explains pivot_table in his Python Data Science Handbook as
essentially a multidimensional version of GroupBy aggregation. That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.
Here are 3 examples of using pivot in Pandas with pivot_Table. We will use Pandas’ pivot_table function to summarize and convert our two/three column dataframe to multiple column dataframe.
Let us firs load Python pandas.
import pandas as pd
Let us use the gapminder data first create a data frame with just two columns.
>data_url = 'http://bit.ly/2cLzoxH' >gapminder = pd.read_csv(data_url) >print(gapminder.head(3)) country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
1. Pandas pivot Simple Example
Let us see a simple example of Python Pivot using a dataframe with jus two columns. Let us subset our dataframe to contain just two columns, continent and lifeExp
# select two columns from gapminder dataframe >df = gapminder[['continent','lifeExp']] >print(df.shape) (1704, 2)
We can see that df is a data frame in long format with two columns. As a simple example, we can use Pandas pivot_table to convert the tall table to a wide table, computing the mean lifeExp across continents. To do that, we will use pd.pivot_table with the data frame as one of the arguments and specify which variable we would like use for columns and which variable we would like to summarize. One of the arguments of pivot_table, agg_func has mean as default.
# simple example with pivot_table >pd.pivot_table(df, values='lifeExp', columns='continent') continent Africa Americas Asia Europe Oceania lifeExp 48.86533 64.658737 60.064903 71.903686 74.326208
Let us see another simple example of pivot_table. In the above example we used pvot_table to compute mean lifeExp for each continent. We can compute mean lifeExp for each country, by simply specifying “country” for columns argument.
# simple example with pivot_table >df = gapminder[['country','lifeExp']] # print dataframe shape >print(df.shape) # pivot_table example >pd.pivot_table(df, values='lifeExp', columns='country')
And we will get mean lifeExp for each country in wide format.
2. Pandas pivot_table on a data frame with three columns
Pandas pivot_table gets more useful when we try to summarize and convert a tall data frame with more than two variables into a wide data frame.
Let us say we have dataframe with three columns/variables and we want to convert this into a wide data frame have one of the variables summarized for each value of the other two variables.
Let us use three columns; continent, year, and lifeExp, from gapminder data and use pivot_table to compute mean lifeExp for each continent and year. Let us use ‘continent’ on columns and year as ‘index’
# select three columns from gapminder dataframe >df1 = gapminder[['continent', 'year','lifeExp']] # pivot table example with three columns >pd.pivot_table(df1, values='lifeExp', index=['year'], columns='continent') continent Africa Americas Asia Europe Oceania year 1952 39.135500 53.27984 46.314394 64.408500 69.255 1957 41.266346 55.96028 49.318544 66.703067 70.295 1962 43.319442 58.39876 51.563223 68.539233 71.085 1967 45.334538 60.41092 54.663640 69.737600 71.310 1972 47.450942 62.39492 57.319269 70.775033 71.910
3. Pandas pivot_table with Different Aggregating Function
As mentioned before, pivot_table uses mean function for aggregating or summarizing data by default. We can change the aggregating function, if needed. For example, we can use aggfunc=’min’ to compute “minimum” lifeExp instead of “mean” lifeExp for each year and continent values.
# pivot table example with aggfunc='min' >pd.pivot_table(df1, values='lifeExp', index=['year'], columns='continent', aggfunc='min') continent Africa Americas Asia Europe Oceania year 1952 30.000 37.579 28.801 43.585 69.12 1957 31.570 40.696 30.332 48.079 70.26 1962 32.767 43.428 31.997 52.098 70.93 1967 34.113 45.032 34.020 54.336 71.10 1972 35.400 46.714 36.088 57.005 71.89
Not only we can specify what aggregating function we want, we can also specify more than one aggregating functions. For example, if we are interested in bot minimum and maximum values of lifeExp for each year and continent, we can specify the functions as a list to the argument ‘aggfunc’
# pivot table example with multiple aggregating functions >pd.pivot_table(df1, values='lifeExp', index=['year'], columns='continent', aggfunc=[min,max]) min max continent Africa Americas Asia Europe Oceania Africa Americas Asia Europe Oceania year 1952 30.000 37.579 28.801 43.585 69.12 52.724 68.75 65.39 72.67 69.39 1957 31.570 40.696 30.332 48.079 70.26 58.089 69.96 67.84 73.47 70.33 1962 32.767 43.428 31.997 52.098 70.93 60.246 71.30 69.39 73.68 71.24
When we give multiple aggregating functions, we would get a multi-indexed data frame as output.