Reshaping data frames into tidy format is probably one of the most frequent things you would do in data wrangling. In this post, we will learn how to use Pandas melt() function and wide_long_long() function to reshape Pandas dataframe in wide form to long tidy form.
A data frame is tidy when it satisfies the following rules.
- Each variable in the data set is placed in its own column
- Each observation is placed in its own row
- Each value is placed in its own cell
Take a look the nice visual representation of the tidy data principle from Garrett.
Often your data may be in a wider format that is pleasing visually, but not that easy to do data analysis. Pandas has excellent tool sets to wrangle data and reshape it to tidy format. In this post, we will see three examples of tidying data by reshaping data frame in wide form to long form. The three examples aim to reshape the data as shown below, but with different levels of complexities.
In the first example we will see a simple example of data frame in wider form and use Pandas melt function to reshape it into longer tidier form.
Let us start with a toy data frame made from scratch. We will create a data frame from a dictionary.
data = {'Name': ['John', 'Smith', 'Liz',], 'Weight': [150, 170, 110], 'BP': [120, 130, 100]}
It is a toy data frame with data that is useful for eyes.
wide_df = pd.DataFrame(data) wide_df Name Weight BP 0 John 150 120 1 Smith 170 130 2 Liz 110 100
Reshaping with Pandas Melt
We can use Pandas melt function to reshape the data frame to a longer form that satisfies the tidy data principles. To Pandas melt function, we need to specify which variable we need to keep in the long tidy data frame and optionally we can specify the names for variable and the values. In this example, we simply name the variable as key and value as value.
wide_df.melt(id_vars='Name', var_name='key', value_name='value') Name key value 0 John Weight 150 1 Smith Weight 170 2 Liz Weight 110 3 John BP 120 4 Smith BP 130 5 Liz BP 100
Reshape: Wide to Long with Pandas Melt
Let us use gapminder data in wide form to reshape dataframe in a wide form to long tidy form.
data_url = "https://goo.gl/ioc2Td" gapminder = pd.read_csv(data_url) print(gapminder.head(3)) continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \ 0 Africa Algeria 2449.008185 3013.976023 2550.816880 1 Africa Angola 3520.610273 3827.940465 4269.276742 2 Africa Benin 1062.752200 959.601080 949.499064
The gapminder data in wide form contains lifeExp, gdpPercap and population per year as columns. For this example, let us work only with lifeExp columns. We will filter our dataframe to contain life Expectancy values per year using Pandas loc and string matching.
lifeExp = gapminder.loc[:, gapminder.columns.str.contains('^life|^c')] print(lifeExp.head(n=3)) continent country lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 \ 0 Africa Algeria 43.077 45.685 48.303 51.407 1 Africa Angola 30.015 31.999 34.000 35.985 2 Africa Benin 38.223 40.358 42.618 44.885
Now our dataframe has country, continent and lifeExp per year in each column. Note that for each row country value is unique. We can use pandas melt function to convert this wide data frame to a data frame in long form. In this example, we would like to keep both continent and country as columns, so we specify that using ‘id_vars’ argument. In th long tidy form we want year and lifeExp as our additional columns. We can specify ‘var_name=year’ and value_name=’lifeExp’
gapminder_tidy = lifeExp.melt(id_vars=["continent", "country"], var_name="year", value_name="lifeExp") gapminder_tidy.head(n=3)
We get the gapminder data in long form, with each variable as separate columns. However, note that the year column contains the prefix “lifeExp_”.
gapminder_tidy.head() continent country year lifeExp 0 Africa Algeria lifeExp_1952 43.077 1 Africa Angola lifeExp_1952 30.015 2 Africa Benin lifeExp_1952 38.223 3 Africa Botswana lifeExp_1952 47.622 4 Africa Burkina Faso lifeExp_1952 31.975
We can remove the prefix using Pandas str.replace function. And we get the tidy long data frame as we wanted.
gapminder_tidy.year = gapminder_tidy.year.str.replace("lifeExp_","") continent country year lifeExp 0 Africa Algeria 1952 43.077 1 Africa Angola 1952 30.015 2 Africa Benin 1952 38.223 3 Africa Botswana 1952 47.622 4 Africa Burkina Faso 1952 31.975
Reshaping Pandas Dataframe with wide_to_long()
In addition to melt, Pandas also another function called “wide_to_long”. We can use Pandas’ wide_to_long() to reshape the wide dataframe into long/tall dataframe. Another benefit of using Pandas wide_to_long() is that we can easily take care of the prefix in the column names.
We need to specify “stubnames” to extract the prefix from column variable names. In our example, ‘stubnames=[‘lifeExp’]’. We also need to specify how the column names are separated. We also need to specify a column that is unique to each row. IN our example, “i=’country'” is unique to each row. And finally, we need to specify the new variable name we create. In our example “j=’year'”.
lifeExp_tidy = pd.wide_to_long(lifeExp, stubnames=['lifeExp'], i='country', j='year', sep='_')
Now we have reshaped the wide dataframe to long dataframe using Pandas wide_to_long() function.
lifeExp_tidy.head(n=3) continent lifeExp country year Algeria 1952 Africa 43.077 Angola 1952 Africa 30.015 Benin 1952 Africa 38.223
We can reindex to change the multi-indexed column names and get the data in long form.
lifeExp_tidy.reset_index(inplace=True) lifeExp_tidy.head(n=3) country year continent lifeExp 0 Algeria 1952 Africa 43.077 1 Angola 1952 Africa 30.015 2 Benin 1952 Africa 38.223
Under the hood Pandas wide_to_long() uses Pandas’ melt function. However, Pandas’ wide_to_long() is easier to use. For example, the original gapminder dataframe in wide form contain multiple variables over time, lifeExp, gdpPercap, and pop. We can use Pandas’ wide_to_long()’s argument stubnames to specify multiple variables that we want to reshape to long form. For example, to reshape all three variables over time in gapminder dataframe in wide form, we specify the prefixes with stubnames=[‘lifeExp’, ‘gdpPercap’,’pop’].
gapminder_long = pd.wide_to_long(gapminder, ['lifeExp', 'gdpPercap','pop'], i='id', j='year', sep='_') gapminder_long.head()
In a single line using Pandas’ wide_to_long() we have reshaped gapminder dataframe with multiple variables over time to long form.
continent country lifeExp gdpPercap pop id year 0 1952 Africa Algeria 43.077 2449.008185 9279525.0 1 1952 Africa Angola 30.015 3520.610273 4232095.0 2 1952 Africa Benin 38.223 1062.752200 1738315.0 3 1952 Africa Botswana 47.622 851.241141 442308.0 4 1952 Africa Burkina Faso 31.975 543.255241 4469979.0
In summary, we saw multiple examples using Pandas’ melt and Pandas wide_to_long() functions to reshape a dataframe in wide form to long form. If you are interested in reshaping a dataframe in long form or tidy to wide form, check out the post on using pivot_table in Pandas to do that