Pandas Melt: Reshape Wide to Tidy with identifiers

Pandas Melt Example with Identifiers
Pandas Melt Example with Identifiers

Pandas melt() function is a versatile function to reshape Pandas dataframe. Earlier, we saw how to use Pandas melt() function to reshape a wide dataframe into long tidy dataframe, with a simple use case. Often while reshaping dataframe, you might want to reshape part of the columns in your data and keep one or more columns as it it is as identifiers.

In this tutorial, we will examples of Pandas melt() function to reshape a wide dataframe, but with keeping identifiers. We will first create a toy dataframe with single identifier and use melt() to reshape wide to tidy form with single identifier. Next, we will see an example using real world Penguins dataset to reshape wide to tidy form with two identifiers.

Pandas Melt Example with Identifiers

Pandas Melt Example with Identifiers on a Toy Dataframe

Let us load the modules needed for simulating a toy dataframe. We use random and string modules to create alphabets as identifiers. We generate random numbers from Scipy.stats module for the dataframe.

import pandas as pd
import random
import string    
from scipy.stats import poisson

Let us create the coliumns for the toy dataframe as lists.

id=random.choices(string.ascii_lowercase,k=3)
c1= poisson.rvs(mu=10, size=3)
c2= poisson.rvs(mu=15, size=3)
c3= poisson.rvs(mu=20, size=3)

And use Pandas’ DataFrame() function to create a dataframe from a dictionary.

df=pd.DataFrame({"pID":id,
                 "C1":c1,
                 "C2":c2,
                 "C3":c3})

Our toy dataframe is in wide form and looks like this with one identifier column and three numerical columns.

df

     pID	C1	C2	C3
0	h	13	15	12
1	n	16	16	16
2	w	13	22	18

We can reshape the wide dataframe into a tidy long data frame keeping one identifier with Pandas melt() function. We need to specify which variable we want as identifier using “id_vars” argument to melt() function.

df.melt(id_vars=["pID"],var_name="Sample", value_name="Count")

Now we have reshaped the data frame into tidy form. Now the tidy dataframe has three columns and 8 rows. Note that our identifier column has expanded the rows for each combination.


pID	Sample	Count
0	h	C1	13
1	n	C1	16
2	w	C1	13
3	h	C2	15
4	n	C2	16
5	w	C2	22
6	h	C3	12
7	n	C3	16
8	w	C3	18

Pandas Melt Example with Two Identifiers using Penguins Data

In the above toy example of using Pandas melt(), we reshaped a wide data frame into a long form with just one identifier. Often while doing real data analysis, you might have multiple columns and would like to keep more than one column as identifiers.

Let us see an example of reshaping a wide dataframe, but this time with real data and keep two columns as identifiers. We will use Penguins dataset to illustrate Pandas melt() usage in our example and load the data from github page.

p2data = "https://raw.githubusercontent.com/cmdlinetips/palmerpenguins/master/data-raw/penguins_raw.csv"
penguins_raw = pd.read_csv(p2data)
penguins_raw.columns
Index(['studyName', 'Sample Number', 'Species', 'Region', 'Island', 'Stage',
       'Individual ID', 'Clutch Completion', 'Date Egg', 'Culmen Length (mm)',
       'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex',
       'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'],
      dtype='object')

As you can see the Penguins data has multiple columns. For the sake of simplicity, let us work with a subset of columns. Here we select two string variables and three numerical variables from the Penguins data.

columns_of_interest = ['Species',"Sex", "Culmen Length (mm)", 
                       "Flipper Length (mm)", 
                       "Body Mass (g)" ]
penguins_df = penguins_raw.loc[:,columns_of_interest]
penguins_df.head()

Species	Sex	Culmen Length (mm)	Flipper Length (mm)	Body Mass (g)
0	Adelie Penguin (Pygoscelis adeliae)	MALE	39.1	181.0	3750.0
1	Adelie Penguin (Pygoscelis adeliae)	FEMALE	39.5	186.0	3800.0
2	Adelie Penguin (Pygoscelis adeliae)	FEMALE	40.3	195.0	3250.0
3	Adelie Penguin (Pygoscelis adeliae)	NaN	NaN	NaN	NaN
4	Adelie Penguin (Pygoscelis adeliae)	FEMALE	36.7	193.0	3450.0

We will use the two columns with string variables as identifiers and reshape the remaining numerical columns. To keep two variables as identifiers, we specify the name of columns as argument to “id_vars”. And specify the new variable name for the column names using “var_name” and a variable name for the numerical values with “value_name”

penguins_df.melt(id_vars=["Species","Sex"],
        var_name="Measurement", 
        value_name="Values")

We get a reshaped dataframe with four columns in tidy form. The first two columns are identifiers and the same as the original dataframe and then we have two columns with reshaped data.

Species	Sex	Measurement	Values
0	Adelie Penguin (Pygoscelis adeliae)	MALE	Culmen Length (mm)	39.1
1	Adelie Penguin (Pygoscelis adeliae)	FEMALE	Culmen Length (mm)	39.5
2	Adelie Penguin (Pygoscelis adeliae)	FEMALE	Culmen Length (mm)	40.3
3	Adelie Penguin (Pygoscelis adeliae)	NaN	Culmen Length (mm)	NaN
4	Adelie Penguin (Pygoscelis adeliae)	FEMALE	Culmen Length (mm)	36.7
...	...	...	...	...
1027	Chinstrap penguin (Pygoscelis antarctica)	MALE	Body Mass (g)	4000.0
1028	Chinstrap penguin (Pygoscelis antarctica)	FEMALE	Body Mass (g)	3400.0
1029	Chinstrap penguin (Pygoscelis antarctica)	MALE	Body Mass (g)	3775.0
1030	Chinstrap penguin (Pygoscelis antarctica)	MALE	Body Mass (g)	4100.0
1031	Chinstrap penguin (Pygoscelis antarctica)	FEMALE	Body Mass (g)	3775.0
1032 rows × 4 columns

This post is part of the series on Byte Size Pandas: Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.