Often while doing data analysis, one might create a new column or multiple columns to an existing data frame. In this post we will learn how to add one or more columns to a dataframe in R. tibble package in tidyverse, has a lesser known, but powerful function add_column(). We will learn 6 tips to use add_column() function to add one or more columns at the right place and making sure we don’t over right an existing column.
Let is first load tidyverse and create a simple data frame using tibble() function.
library(tidyverse) df <- tibble(x=1:5,y=5:1)
Our simple dataframe contains two columns named x and y.
df ## # A tibble: 5 x 2 ## x y ## <int> <int> ## 1 1 5 ## 2 2 4 ## 3 3 3 ## 4 4 2 ## 5 5 1
1. How To Add a New Column?
We can add a new column to a dataframe using add_column() function providing the new column as an argument. In this example, we add a new column named “z” and we can see that we have added the new column to the dataframe.
df %>% add_column(z=-2:2) ## # A tibble: 5 x 3 ## x y z ## <int> <int> <int> ## 1 1 5 -2 ## 2 2 4 -1 ## 3 3 3 0 ## 4 4 2 1 ## 5 5 1 2
2. Add a column before another column?
add_column() in tibble/tidyverse is powerful. We can also specify where to add the new column. For example, we can add the new column just before another existing column using “.before” argument
df %>% add_column(before_y=-2:2, .before="y")
## # A tibble: 5 x 3 ## x before_y y ## <int> <int> <int> ## 1 1 -2 5 ## 2 2 -1 4 ## 3 3 0 3 ## 4 4 1 2 ## 5 5 2 1
3. How to Add a column after another column?
Similar to “.before” argument, add_column() function also has “.after” argument and we can use it to add a column after a another specific column.
In this example, we add a new column after “x” column using .after=”x” argument to add_column() function.
df %>% add_column(after_x=-2:2, .after="x") ## # A tibble: 5 x 3 ## x after_x y ## <int> <int> <int> ## 1 1 -2 5 ## 2 2 -1 4 ## 3 3 0 3 ## 4 4 1 2 ## 5 5 2 1
4. How to Add a column with same values?
Often you might face a situation, where you need to add a new column with same values repeated for each row. With add_column() we can add a column with same values as in the previous examples, but this time we specify the value we would like to repeat just once. We don’t need to create a vector repeating the same values to add new column.
Here, we add a new column called “batch” with repeating “batch1” for all the rows.
df %>% add_column(batch_id="batch1") ## # A tibble: 5 x 3 ## x y batch_id ## <int> <int> <chr> ## 1 1 5 batch1 ## 2 2 4 batch1 ## 3 3 3 batch1 ## 4 4 2 batch1 ## 5 5 1 batch1
5. How To Add multiple columns?
To add multiple columns, we specify each column that we would like to add separated by comma as shown below.
df %>% add_column(z=-2:2, batch_id="batch1")
We have added two columns with add_column() function.
## # A tibble: 5 x 4 ## x y z batch_id ## <int> <int> <int> <chr> ## 1 1 5 -2 batch1 ## 2 2 4 -1 batch1 ## 3 3 3 0 batch1 ## 4 4 2 1 batch1 ## 5 5 1 2 batch1
6. How to Avoid Adding Duplicate Columns?
One of the concerns while adding a new column is that we might over write an existing column with the same name. add_column() function offers multiple options to deal with duplicate columns.
For example, if we try to add a duplicate column with the same name like here
df %>% add_column(x=-2:2)
By default, we would get an error warning us the new column cannot be a duplicate. In this case, we already have column named “x” and we are trying to add another column with the name “x”.
Error: Column name `x` must not be duplicated. Run `rlang::last_error()` to see where the error occurred.
However, sometimes you might want to add the new column, by dealing with the duplicate names. add_column() function has “.name_repair” argument with multiple options to deal with duplicate columns, Here are the arguments “.name_repair” can take check_unique, unique, universal, minimal.
Here, when we specify .name_repair = "universal"
, add_column() changes the column names to make them distinct.
df %>% add_column(x=-2:2, .name_repair = "universal")
add_column() warns us that it is changing the column names.
## New names: ## * x -> x...1 ## * x -> x...3
Now, we can see that the first column with name “x” is called “x..1” and the recent one we added is named “x..3”.
## # A tibble: 5 x 3 ## x...1 y x...3 ## <int> <int> <int> ## 1 1 5 -2 ## 2 2 4 -1 ## 3 3 3 0 ## 4 4 2 1 ## 5 5 1 2
7. Dealing with more/less observations in the new column
Another useful functionality of add_column() is that it guards us against adding a new column whose length differs from the number of rows of the dataframe.
For example, when we try to add a column with 6 elements to a dataframe with 5 rows
df %>% add_column(z=-2:3)
We get an error telling us
Error: New columns must be compatible with `.data`. x New column has 6 rows. ? `.data` has 5 rows. Run `rlang::last_error()` to see where the error occurred.
Also, we will get a similar error if we try to add a column with fewer elements than the number of rows of dataframe.