While doing data wrangling or data manipulation, often one may want to add a new column or variable to an existing Pandas dataframe without changing anything else. Obviously the new column will have have the same number of elements.
Let us see examples of three ways to add new columns to a Pandas data frame.
Let us first load pandas library
import pandas as pd
Let us use gapminder data set to add new column or new variable in our examples. We will use gapminder data from Software Carpentry website given as data_url below.
data_url = 'http://bit.ly/2cLzoxH' # load the gapminder dataframe from web as data frame gapminder = pd.read_csv(data_url) # select four columns gapminder = gapminder[['country','year', 'gdpPercap', 'pop']] # view few elements of the data frame print(gapminder.head(3)) country year gdpPercap pop 0 Afghanistan 1952 779.445314 8425333.0 1 Afghanistan 1957 820.853030 9240934.0 2 Afghanistan 1962 853.100710 10267083.0
How To Add New Column to Pandas Dataframe by Indexing: Example 1
Let us say we want to create a new column from an existing column in the data frame. We can create a new column by indexing, using square bracket notation like we do to access the existing element.
For example, we can create a new column with population values in millions in addition to the original variable as
# add new column using square bracket notation gapminder['pop_in_millions'] = gapminder['pop']/1e06 country year gdpPercap pop pop_in_millions 0 Afghanistan 1952 779.445314 8425333.0 8.425333 1 Afghanistan 1957 820.853030 9240934.0 9.240934 2 Afghanistan 1962 853.100710 10267083.0 10.267083
How To Add New Column to Pandas Dataframe using loc: Example 2
Another way to add a new column to a dataframe is to use “loc” function. Here we specify the new column variable and its values.
gapminder.loc[:,'pop_in_millions'] = gapminder['pop']/1e06 gapminder.head(3) country year gdpPercap pop pop_in_millions 0 Afghanistan 1952 779.445314 8425333.0 8.425333 1 Afghanistan 1957 820.853030 9240934.0 9.240934 2 Afghanistan 1962 853.100710 10267083.0 10.267083
How To Add New Column to Pandas Dataframe using assign: Example 3
Inspired by dplyr’s mutate function in R to add new variable, Pandas’ recent versions have new function “assign” to add new columns. We can simply chain “assign” to the data frame.
gapminder.assign(pop_in_millions=gapminder['pop']/1e06).head(3) country year gdpPercap pop pop_in_millions 0 Afghanistan 1952 779.445314 8425333.0 8.425333 1 Afghanistan 1957 820.853030 9240934.0 9.240934 2 Afghanistan 1962 853.100710 10267083.0 10.267083
It returns a copy of the data frame as a new object with the new columns added to the original data frame. Remember that if you use the names of existing column, then it will be over-written.
With assign function, we can also use a function to add a new column. Here we use a lambda function to create nthe new column with population in millions.
gapminder.assign(pop_in_millions=lambda x: x['pop']/1e06).head()
With Python 3.6+, now one can create multiple new columns using the same assign statement so that one of the new columns uses another newly created column within the same assign statement.
For example, we can create two new variables such that the second new variable uses the first new column as shown below.
gapminder.assign(pop_in_millions=lambda x: x['pop']/1e6, pop_in_billions=lambda x: x['pop_in_millions']/1e3).head()