How To Collapse Multiple Text Columns in Dataframe Using Tidyverse?

Often you may have a data frame, where multiple columns are related and you may want to combine those related columns into a single column. In an earlier post, we saw how we can collapse a numerical data frame with related columns using Python.

In this post, we consider the problem of collapsing or combining multiple related text columns using tidyverse in R. A visual demonstration of the question showing a data frame before and after collapsing is depicted in the figure below.

How To Collapse Text Columns Tidyverse?
How To Collapse Text Columns Tidyverse?

Let us make a toy data frame with six columns and a column for IDs. The every other column is related, for example the columns S1_1 and S1_2 are related and S2_1 and S2_2 are related. Notice that each column has text/string. We would like to collapse or combine two similar columns. For example, we would like to collapse ‘A’ and ‘T’ in columns s_A and s_B into ‘AT’.

# load tidyverse
library(tidyverse)
# create row/sample IDs 
row_id <- c("s_A", "s_B", "s_C", "s_D", "s_E")
# create column variables
S1_1 <- c('A', 'T', 'G', 'C', 'A')
S1_2 <- c('G', 'C', 'A', 'T', 'A')
S2_1 <- c('A', 'T', 'G', 'C', 'A')
S2_2 <- c('G', 'T', 'G', 'T', 'A')
S3_1 <- c('G', 'C', 'A', 'T', 'A')
S3_2 <- c('A', 'C', 'G', 'C', 'A')

We have all the variables needed to create a dataframe. Now we can use data.frame function and
make a new data frame df.

# create data frame 
df <- data.frame(row_id, S1_1, S1_2, S2_1, S2_2, S3_1, S3_2,
                 stringsAsFactors=FALSE)
# view the data frame
df 

Let us use tidyverse, mainly functions from the packages tidyr and dplyr to collapse/combine multiple columns. At first we need to make our data frame tidy. When the data is “tidy”,

  • Each variable is in a column
  • Each observation is a row.
  • Each value is a cell.

We will start with the dataframe df and use magritter or pipe to give the dataframe content to tidyr’s gather function to make the data long from the wide form.

df %>%
  gather("col_ID", "Value",-row_id) %>%
  head(n=3)

Once we have the data in the long form, we can select certain columns and rename it for convenience. Here, we identify the group IDs of columns that are related using separate function in tidyr. We are splitting the column IDs by under score and keeping the ID representing the group.

df %>%
  gather("col_ID", "Value",-row_id) %>%
  separate(col_ID,c("col_group")) %>%
  head(n=3)

Now we can use group_by function on the group IDs, so that we combine with in each group.

df %>%
  gather("col_ID", "Value",-row_id) %>%
  separate(col_ID,c("col_group")) %>%
  group_by(row_id, col_group) %>%
  head(n=3)

Here we use summarize function to combine values from related columns using paste0.

df %>%
  gather("col_ID", "Value",-row_id) %>%
  separate(col_ID,c("col_group")) %>%
  group_by(row_id, col_group) %>%
  summarize(new_value=paste0(Value,collapse = "")) %>%
  head

Now we have the output we needed, but in long form, so we can use spread function from tidyr to reshape our data frame into wide form.

df %>%
  gather("col_ID", "Value",-row_id) %>%
  separate(col_ID,c("col_group")) %>%
  group_by(row_id, col_group) %>%
  summarize(new_value=paste0(Value,collapse = "")) %>%
  spread(col_group,new_value) %>%
  head

Voila.. we have the data frame we wanted! Tune in for another example with collapsing a data frame with numerical columns using tidyverse.