Often you may want to collapse two or multiple columns in a Pandas data frame into one column. For example, you may have a data frame with data for each year as columns and you might want to get a new column which summarizes multiple columns. One may need to have flexibility of collapsing columns of interest into one. If you guess, this is kind of “groupby operation” you are right.
Let us see an pictorial example of what we aim to do. Here we have a data frame with five columns and four of them are data from first two weeks of Jan and Feb. We want to create a new dataframe with just two columns one for Jan and the other for Feb, i.e combining the weekly data to monthly data.
Let us see a small example of collapsing columns of Pandas dataframe by combining multiple columns into one.
Let us first load NumPy and Pandas.
import numpy as np import pandas as pd
We will use NumPy’s random module to create random data and use them to create a pandas data frame.
n = 3 Jan_week1 = np.random.randint(100, size=n).tolist() Jan_week2 = np.random.randint(100, size=n).tolist() Feb_week1 = np.random.randint(100, size=n).tolist() Feb_week2 = np.random.randint(100, size=n).tolist()
sample_id = zip(["S"]*n,list(range(1,n+1))) s_names=[''.join([w[0],str(w[1])]) for w in sample_id] s_names
Now we have all the columns we needed as lists. Let us convert multiple lists into a single data frame, first by creating a dictionary for each list with a name.
d = {'s_names':s_names, 'Jan_week1':Jan_week1,'Jan_week2':Jan_week2, 'Feb_week1':Feb_week1, 'Feb_week2':Feb_week2}
d {'Feb_week1': [32, 20, 38], 'Feb_week2': [68, 7, 82], 'Jan_week1': [8, 21, 65], 'Jan_week2': [42, 33, 2], 's_names': ['S1', 'S2', 'S3']}
Then use Pandas dataframe into dict. Now we get a data frame with four columns of data and one column for names.
df_reps = pd.DataFrame(d) df_reps Feb_week1 Feb_week2 Jan_week1 Jan_week2 s_names 0 32 68 8 42 S1 1 20 7 21 33 S2 2 38 82 65 2 S3
How to Collapse/Combine Columns in Pandas Data Frame?
Now we have data frame with multiple columns and we want to collapse or combine multiple columns using specific rule. Ideally, we would like to clearly specify which columns we want to combine or collapse. We can do that by specifying the mapping as a dictionary, where the keys are the names of columns we would like to combine and the values are the names of resulting column.
mapping = {'Jan_week1':'Jan','Jan_week2':'Jan', 'Feb_week1':'Feb', 'Feb_week2':'Feb'} mapping {'Feb_week1': 'Feb', 'Feb_week2': 'Feb', 'Jan_week1': 'Jan', 'Jan_week2': 'Jan'}
We can use the mapping dictionary with in groupby function and specify axis=1 to groupby columns. Since we only want to collapse multiple columns of data not the “name” column, we first set it as row index and reset it later.
df =df_reps.set_index('s_names').groupby(mapping, axis=1).sum() df.reset_index(level=0) s_names Feb Jan 0 S1 100 50 1 S2 27 54 2 S3 120 67
Now we have successfully combined multiple columns and have the collapsed data frame we wanted.
In the above example we collapsed multiple numerical columns into a single column. If you want to collapse multiple text columns into a single, check this post