How to Split Text in a Column in Data Frame in R?

Very often you may have to manipulate a column of text in a data frame with R. You may want to separate a column in to multiple columns in a data frame or you may want to split a column of text and keep only a part of it.

tidyr’s separate function is the best option to separate a column or split a column of text the way you want. Let us see some simple examples of using tidyr’s separate function.

Let us first load the R packages needed to see the examples with separate function.

library(dplyr)
library(tidyr)

Let us create a small data frame with a column of text separated by underscore.

df <- data.frame(file_name = c("1_jan_2018.csv",
                       "2_feb_2018.csv", 
                       "3_mar_2018.csv"))

The data frame contains just single column of file names.

df
       file_name
1 1_jan_2018.csv
2 2_feb_2018.csv
3 3_mar_2018.csv

How to Split a Single Column into Multiple Columns with tidyr’ separate()?

Let us use separate function from tidyr to split the “file_name” column into multiple columns with specific column name. Here, we will specify the column names in a vector.

By default, separate uses regular expression that matches any sequence of non-alphanumeric values as delimiter to split.

df %>% separate(file_name, 
                c("ID", "Month","Year", "rest"))
  ID Month Year rest
1  1   jan 2018  csv
2  2   feb 2018  csv
3  3   mar 2018  csv

In this example, tidyr automatically found that the delimiters are underscore and dot and separted the single column to four columns with the names specified.

Often you want only part of text in a column. Let us see another example of a data frame with column containing text, but this time we specify only three columns for our output.

df %>% separate(file_name,
                c("ID", "Month","Year"))

Note that we provide just three columns in separate function.

 ID Month Year
1  1   jan 2018
2  2   feb 2018
3  3   mar 2018
Warning message:
Too many values at 3 locations: 1, 2, 3 

The output of separate() in this example contains only three column as we specified. And we also see a warning, since we left out the extra element present after separating the text.

We can use argument extra=’drop’ to specify separate to drop anything extra without warning us.

df %>% separate(file_name, 
               c("ID", "Month","Year"),
               extra='drop')
 ID Month Year
1  1   jan 2018
2  2   feb 2018
3  3   mar 2018

Similarly, if we want only the first element after splitting, we can just specify only one column for our output.

df %>% tidyr::separate(file_name, 
                       c("ID"), extra='drop')
  ID
1  1
2  2
3  3

If you want an element that is in the middle after separating with separate, we can use dplyr’s select function select the column needed. For example, if we need the second element ‘Month’, we can combine tidyr’s separate with dplyr’s select.

df %>% 
    tidyr::separate(file_name, 
                       c("ID", "Month","Year"),
                       extra='drop') %>%
    select('Month')
  Month
1   jan
2   feb
3   mar 

unite() to combine multiple columns to a single column

Sometimes you may want to do opposite ehat separate can do, i.e. combine multiple columns into a single column. You guessed it right, tidyr has a cool function to do that. tidyr’s unite() complements separate() and combine multiple columns into a single column.

Let us see an example of unite() combining two columns created by separate(). Here, we first separate a column into three columns and then use unite() to combine the first two columns into a single column.

df %>% 
    tidyr::separate(file_name,                      
                    c("ID", "Month","Year"),
                    extra='drop') %>%
    tidyr::unite('ID_Month',c('ID','Month'))

The output is a dataframe with two columns, where the first column is the result of unite().

  ID_Month Year
1    1_jan 2018
2    2_feb 2018
3    3_mar 2018