Pandas melt() function is a versatile function to reshape Pandas dataframe. Earlier, we saw how to use Pandas melt() function to reshape a wide dataframe into long tidy dataframe, with a simple use case. Often while reshaping dataframe, you might want to reshape part of the columns in your data and keep one or more columns as it it is as identifiers.
In this tutorial, we will examples of Pandas melt() function to reshape a wide dataframe, but with keeping identifiers. We will first create a toy dataframe with single identifier and use melt() to reshape wide to tidy form with single identifier. Next, we will see an example using real world Penguins dataset to reshape wide to tidy form with two identifiers.
Pandas Melt Example with Identifiers on a Toy Dataframe
Let us load the modules needed for simulating a toy dataframe. We use random and string modules to create alphabets as identifiers. We generate random numbers from Scipy.stats module for the dataframe.
import pandas as pd import random import string from scipy.stats import poisson
Let us create the coliumns for the toy dataframe as lists.
id=random.choices(string.ascii_lowercase,k=3) c1= poisson.rvs(mu=10, size=3) c2= poisson.rvs(mu=15, size=3) c3= poisson.rvs(mu=20, size=3)
And use Pandas’ DataFrame() function to create a dataframe from a dictionary.
df=pd.DataFrame({"pID":id, "C1":c1, "C2":c2, "C3":c3})
Our toy dataframe is in wide form and looks like this with one identifier column and three numerical columns.
df pID C1 C2 C3 0 h 13 15 12 1 n 16 16 16 2 w 13 22 18
We can reshape the wide dataframe into a tidy long data frame keeping one identifier with Pandas melt() function. We need to specify which variable we want as identifier using “id_vars” argument to melt() function.
df.melt(id_vars=["pID"],var_name="Sample", value_name="Count")
Now we have reshaped the data frame into tidy form. Now the tidy dataframe has three columns and 8 rows. Note that our identifier column has expanded the rows for each combination.
pID Sample Count 0 h C1 13 1 n C1 16 2 w C1 13 3 h C2 15 4 n C2 16 5 w C2 22 6 h C3 12 7 n C3 16 8 w C3 18
Pandas Melt Example with Two Identifiers using Penguins Data
In the above toy example of using Pandas melt(), we reshaped a wide data frame into a long form with just one identifier. Often while doing real data analysis, you might have multiple columns and would like to keep more than one column as identifiers.
Let us see an example of reshaping a wide dataframe, but this time with real data and keep two columns as identifiers. We will use Penguins dataset to illustrate Pandas melt() usage in our example and load the data from github page.
p2data = "https://raw.githubusercontent.com/cmdlinetips/palmerpenguins/master/data-raw/penguins_raw.csv"
penguins_raw = pd.read_csv(p2data) penguins_raw.columns Index(['studyName', 'Sample Number', 'Species', 'Region', 'Island', 'Stage', 'Individual ID', 'Clutch Completion', 'Date Egg', 'Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'], dtype='object')
As you can see the Penguins data has multiple columns. For the sake of simplicity, let us work with a subset of columns. Here we select two string variables and three numerical variables from the Penguins data.
columns_of_interest = ['Species',"Sex", "Culmen Length (mm)", "Flipper Length (mm)", "Body Mass (g)" ] penguins_df = penguins_raw.loc[:,columns_of_interest]
penguins_df.head() Species Sex Culmen Length (mm) Flipper Length (mm) Body Mass (g) 0 Adelie Penguin (Pygoscelis adeliae) MALE 39.1 181.0 3750.0 1 Adelie Penguin (Pygoscelis adeliae) FEMALE 39.5 186.0 3800.0 2 Adelie Penguin (Pygoscelis adeliae) FEMALE 40.3 195.0 3250.0 3 Adelie Penguin (Pygoscelis adeliae) NaN NaN NaN NaN 4 Adelie Penguin (Pygoscelis adeliae) FEMALE 36.7 193.0 3450.0
We will use the two columns with string variables as identifiers and reshape the remaining numerical columns. To keep two variables as identifiers, we specify the name of columns as argument to “id_vars”. And specify the new variable name for the column names using “var_name” and a variable name for the numerical values with “value_name”
penguins_df.melt(id_vars=["Species","Sex"], var_name="Measurement", value_name="Values")
We get a reshaped dataframe with four columns in tidy form. The first two columns are identifiers and the same as the original dataframe and then we have two columns with reshaped data.
Species Sex Measurement Values 0 Adelie Penguin (Pygoscelis adeliae) MALE Culmen Length (mm) 39.1 1 Adelie Penguin (Pygoscelis adeliae) FEMALE Culmen Length (mm) 39.5 2 Adelie Penguin (Pygoscelis adeliae) FEMALE Culmen Length (mm) 40.3 3 Adelie Penguin (Pygoscelis adeliae) NaN Culmen Length (mm) NaN 4 Adelie Penguin (Pygoscelis adeliae) FEMALE Culmen Length (mm) 36.7 ... ... ... ... ... 1027 Chinstrap penguin (Pygoscelis antarctica) MALE Body Mass (g) 4000.0 1028 Chinstrap penguin (Pygoscelis antarctica) FEMALE Body Mass (g) 3400.0 1029 Chinstrap penguin (Pygoscelis antarctica) MALE Body Mass (g) 3775.0 1030 Chinstrap penguin (Pygoscelis antarctica) MALE Body Mass (g) 4100.0 1031 Chinstrap penguin (Pygoscelis antarctica) FEMALE Body Mass (g) 3775.0 1032 rows × 4 columns
This post is part of the series on Byte Size Pandas: Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.