How To Reshape Pandas Dataframe with melt and wide_to_long()?

Pandas melt to reshape dataframe
Pandas melt to reshape dataframe

Reshaping data frames into tidy format is probably one of the most frequent things you would do in data wrangling. In this post, we will learn how to use Pandas melt() function and wide_long_long() function to reshape Pandas dataframe in wide form to long tidy form.

A data frame is tidy when it satisfies the following rules.

  • Each variable in the data set is placed in its own column
  • Each observation is placed in its own row
  • Each value is placed in its own cell

Take a look the nice visual representation of the tidy data principle from Garrett.

Tidy Data (Credit: https://garrettgman.github.io/tidying/)

Often your data may be in a wider format that is pleasing visually, but not that easy to do data analysis. Pandas has excellent tool sets to wrangle data and reshape it to tidy format. In this post, we will see three examples of tidying data by reshaping data frame in wide form to long form. The three examples aim to reshape the data as shown below, but with different levels of complexities.

Pandas melt to reshape dataframe: Wide to Tidy

In the first example we will see a simple example of data frame in wider form and use Pandas melt function to reshape it into longer tidier form.

Let us start with a toy data frame made from scratch. We will create a data frame from a dictionary.

data = {'Name': ['John', 'Smith', 'Liz',], 
        'Weight': [150, 170, 110], 
        'BP': [120, 130, 100]}

It is a toy data frame with data that is useful for eyes.

wide_df = pd.DataFrame(data)
wide_df

	Name	Weight	BP
0	John	150	120
1	Smith	170	130
2	Liz	110	100

Reshaping with Pandas Melt

We can use Pandas melt function to reshape the data frame to a longer form that satisfies the tidy data principles. To Pandas melt function, we need to specify which variable we need to keep in the long tidy data frame and optionally we can specify the names for variable and the values. In this example, we simply name the variable as key and value as value.

wide_df.melt(id_vars='Name', var_name='key', value_name='value')

	Name	key	value
0	John	Weight	150
1	Smith	Weight	170
2	Liz	Weight	110
3	John	BP	120
4	Smith	BP	130
5	Liz	BP	100

Reshape: Wide to Long with Pandas Melt

Let us use gapminder data in wide form to reshape dataframe in a wide form to long tidy form.

data_url = "https://goo.gl/ioc2Td"
gapminder = pd.read_csv(data_url)
print(gapminder.head(3))

  continent  country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Africa  Algeria     2449.008185     3013.976023     2550.816880   
1    Africa   Angola     3520.610273     3827.940465     4269.276742   
2    Africa    Benin     1062.752200      959.601080      949.499064   

The gapminder data in wide form contains lifeExp, gdpPercap and population per year as columns. For this example, let us work only with lifeExp columns. We will filter our dataframe to contain life Expectancy values per year using Pandas loc and string matching.

lifeExp = gapminder.loc[:, gapminder.columns.str.contains('^life|^c')]
print(lifeExp.head(n=3))

  continent  country  lifeExp_1952  lifeExp_1957  lifeExp_1962  lifeExp_1967  \
0    Africa  Algeria        43.077        45.685        48.303        51.407   
1    Africa   Angola        30.015        31.999        34.000        35.985   
2    Africa    Benin        38.223        40.358        42.618        44.885  

Now our dataframe has country, continent and lifeExp per year in each column. Note that for each row country value is unique. We can use pandas melt function to convert this wide data frame to a data frame in long form. In this example, we would like to keep both continent and country as columns, so we specify that using ‘id_vars’ argument. In th long tidy form we want year and lifeExp as our additional columns. We can specify ‘var_name=year’ and value_name=’lifeExp’

gapminder_tidy = lifeExp.melt(id_vars=["continent", "country"], 
                              var_name="year", 
                              value_name="lifeExp")
gapminder_tidy.head(n=3)

We get the gapminder data in long form, with each variable as separate columns. However, note that the year column contains the prefix “lifeExp_”.

gapminder_tidy.head()

	continent	country	year	lifeExp
0	Africa	Algeria	lifeExp_1952	43.077
1	Africa	Angola	lifeExp_1952	30.015
2	Africa	Benin	lifeExp_1952	38.223
3	Africa	Botswana	lifeExp_1952	47.622
4	Africa	Burkina Faso	lifeExp_1952	31.975

We can remove the prefix using Pandas str.replace function. And we get the tidy long data frame as we wanted.

gapminder_tidy.year = gapminder_tidy.year.str.replace("lifeExp_","")

	continent	country	year	lifeExp
0	Africa	Algeria	1952	43.077
1	Africa	Angola	1952	30.015
2	Africa	Benin	1952	38.223
3	Africa	Botswana	1952	47.622
4	Africa	Burkina Faso	1952	31.975

Reshaping Pandas Dataframe with wide_to_long()

In addition to melt, Pandas also another function called “wide_to_long”. We can use Pandas’ wide_to_long() to reshape the wide dataframe into long/tall dataframe. Another benefit of using Pandas wide_to_long() is that we can easily take care of the prefix in the column names.

We need to specify “stubnames” to extract the prefix from column variable names. In our example, ‘stubnames=[‘lifeExp’]’. We also need to specify how the column names are separated. We also need to specify a column that is unique to each row. IN our example, “i=’country'” is unique to each row. And finally, we need to specify the new variable name we create. In our example “j=’year'”.

lifeExp_tidy = pd.wide_to_long(lifeExp, stubnames=['lifeExp'], i='country',
                 j='year', sep='_')

Now we have reshaped the wide dataframe to long dataframe using Pandas wide_to_long() function.

lifeExp_tidy.head(n=3)

continent	lifeExp
country	year		
Algeria	1952	Africa	43.077
Angola	1952	Africa	30.015
Benin	1952	Africa	38.223

We can reindex to change the multi-indexed column names and get the data in long form.

lifeExp_tidy.reset_index(inplace=True)
lifeExp_tidy.head(n=3)


country	year	continent	lifeExp
0	Algeria	1952	Africa	43.077
1	Angola	1952	Africa	30.015
2	Benin	1952	Africa	38.223

Under the hood Pandas wide_to_long() uses Pandas’ melt function. However, Pandas’ wide_to_long() is easier to use. For example, the original gapminder dataframe in wide form contain multiple variables over time, lifeExp, gdpPercap, and pop. We can use Pandas’ wide_to_long()’s argument stubnames to specify multiple variables that we want to reshape to long form. For example, to reshape all three variables over time in gapminder dataframe in wide form, we specify the prefixes with stubnames=[‘lifeExp’, ‘gdpPercap’,’pop’].


gapminder_long = pd.wide_to_long(gapminder, ['lifeExp', 'gdpPercap','pop'], 
                 i='id',
                 j='year', sep='_')
gapminder_long.head()

In a single line using Pandas’ wide_to_long() we have reshaped gapminder dataframe with multiple variables over time to long form.

		continent	country	lifeExp	gdpPercap	pop
id	year					
0	1952	Africa	Algeria	43.077	2449.008185	9279525.0
1	1952	Africa	Angola	30.015	3520.610273	4232095.0
2	1952	Africa	Benin	38.223	1062.752200	1738315.0
3	1952	Africa	Botswana	47.622	851.241141	442308.0
4	1952	Africa	Burkina Faso	31.975	543.255241	4469979.0

In summary, we saw multiple examples using Pandas’ melt and Pandas wide_to_long() functions to reshape a dataframe in wide form to long form. If you are interested in reshaping a dataframe in long form or tidy to wide form, check out the post on using pivot_table in Pandas to do that