• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / R / How To Categorize Multiple Numerical Columns in R

How To Categorize Multiple Numerical Columns in R

April 2, 2021 by cmdlinetips

Recently I had to convert a numerical matrix into categorical ones based on some conditions. Obviously there are multiple ways to go about. One of the key functions to categorize a numerical vector in R is to use cut() function, that allows to specify the intervals to categorize a numerical variable. Till now I was mainly using tidyr’s pivot_longer() and pivot_wider() with cut() functions to categorize multiple numerical columns into categorical ones. Finally, remembered about dplyr’s across() function which supports column-wise operation in dplyr and used to convert multiple numerical columns into categorical columns. Here is a quick post showing how to do that before I forget for future self 🙂

Let us start with loading tidyverse.

library(tidyverse)

The cartoon below illustrates the gist of the problem, where the starting point is a dataframe with multiple numerical columns and the output is another dataframe, but the numerical columns are categorized based on some conditions.

Categorize Multiple Columns
Categorize Multiple Columns

Let us created a simple dataframe with three numerical columns.

set.seed(2021)
df <- tibble(id=paste0(rep(letters[1:5],5)),
             x1 = rnorm(25,mean=20,sd=10),
             x2 = rnorm(25,mean=15,sd=5),
             x3 = rnorm(25,mean=12,sd=6))
df

In this simple illustration we also have a character variable with unique values.

df
## # A tibble: 25 x 4
##    id        x1    x2    x3
##    <chr>  <dbl> <dbl> <dbl>
##  1 a     18.8   15.5   9.00
##  2 b     25.5    7.72 -1.54
##  3 c     23.5   13.2  12.3 
##  4 d     23.6   14.5   9.79
##  5 e     29.0   20.5   6.24
##  6 a      0.774  5.18 12.6 
##  7 b     22.6    7.76 14.6 
##  8 c     29.2   20.1  11.0 
##  9 d     20.1    7.89  2.71
## 10 e     37.3   12.0   2.97
## # … with 15 more rows

Categorizing a numerical vector with cut()

We can categorize a single vector using cut() function. The versatile cut() function takes in a vector, and a specification on how to categorise and optional labels to name the category levels. In this example, we categorise a single numerical vector into three categories low, middle, and high. We specify the intervals for low, middle and high using breaks().

df %>%
  pull(x1) %>%
  cut(breaks=c(-Inf,10,20,Inf), 
      labels=c("low", "middle", "high"))

##  [1] middle high   high   high   high   low    high   high   high   high  
## [11] low    middle high   high   high   low    high   high   high   high  
## [21] middle middle low    high   low   
## Levels: low middle high

Categorizing Multiple numerical columns with pivot_longer, cut and pivot_wider()

To convert multiple numerical columns with base R, we can use apply() function on columns and apply the cut function to categorize each column. However, a disadvantage is that the input data has to be a matrix.

With tidyverse, we can categorise multiple numerical columns in a dataframe containing other type of variables.

One of the ways to do is to reshape the dataframe into tidy form with pivot_longer() first and then categorize the numerical variables using cut() function and then reshaping into the original wide form.

Here is the first step converting the wide dataframe into loing form using pivot_longer() from tidyr 1.0.0.

df %>% 
  pivot_longer(-id,
               names_to = "vars", values_to="groups") 
## # A tibble: 78 x 3
##    id    vars  groups
##    <chr> <chr>  <dbl>
##  1 a     x1     18.8 
##  2 a     x2      7.72
##  3 a     x3     12.3 
##  4 b     x1     25.5 
##  5 b     x2     13.2 
##  6 b     x3      9.79
##  7 c     x1     23.5 
##  8 c     x2     14.5 
##  9 c     x3      6.24
## 10 d     x1     23.6 
## # … with 68 more rows

Next we use mutate() to categorize the numerical variables into categorical variables using cut() function at once.

df %>% 
  pivot_longer(-id,
               names_to = "vars", values_to="groups") %>%
  mutate(groups=cut(groups,breaks=c(-Inf,10,20,Inf), 
      labels=c("low", "middle", "high")))
## # A tibble: 78 x 3
##    id    vars  groups
##    <chr> <chr> <fct> 
##  1 a     x1    middle
##  2 a     x2    low   
##  3 a     x3    middle
##  4 b     x1    high  
##  5 b     x2    middle
##  6 b     x3    low   
##  7 c     x1    high  
##  8 c     x2    middle
##  9 c     x3    low   
## 10 d     x1    high  
## # … with 68 more rows

Finally, use pivot_wider() to reshape the tidy data into original dataframe.

df %>% 
  pivot_longer(-id,
               names_to = "vars", values_to="groups") %>%
  mutate(groups=cut(groups,breaks=c(-Inf,10,20,Inf), 
      labels=c("low", "middle", "high"))) %>%
  pivot_wider(names_from = vars, values_from = groups)
## # A tibble: 26 x 4
##    id    x1     x2     x3    
##    <chr> <fct>  <fct>  <fct> 
##  1 a     middle low    middle
##  2 b     high   middle low   
##  3 c     high   middle low   
##  4 d     high   high   middle
##  5 e     high   low    middle
##  6 f     low    low    middle
##  7 g     high   high   low   
##  8 h     high   low    low   
##  9 i     high   middle middle
## 10 j     high   low    middle
## # … with 16 more rows

How to Categorize Multiple numerical columns using column-wise function across()?

Starting from dplyr 1.0.0, we can easily perform colum-wise operations using across() function.

Here we first across() and provide numerical columns using where(is.numeric) function and then use cut() function to categorize each column as before.

df %>%
  mutate(across(where(is.numeric), 
                ~ cut(.x, breaks=c(-Inf,10,20,Inf), 
                      labels=c("low", "middle", "high")))) 

With a fewer line of code we have categorized multiple numerical columns at once using across().

## # A tibble: 26 x 4
##    id    x1     x2     x3    
##    <chr> <fct>  <fct>  <fct> 
##  1 a     middle low    middle
##  2 b     high   middle low   
##  3 c     high   middle low   
##  4 d     high   high   middle
##  5 e     high   low    middle
##  6 f     low    low    middle
##  7 g     high   high   low   
##  8 h     high   low    low   
##  9 i     high   middle middle
## 10 j     high   low    middle
## # … with 16 more rows

As we can guess, the approach to use across() to categorize multiple columns seems to be faster as well. A quick runtime estimate from 500 reps using Sys.time() show that using across() function is faster for the small dataframe example.

Pseudo-code to compute the runtime by both the methods.

for (i in 1:500){
    start_time <- Sys.time()
    # code to categorize by method A
    end_time <- Sys.time()
    run_time[i] <- end_time - start_time
}

Runtime comparison for categorizing multiple columns

Runtime comparison for categorising multiple columns

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow to Compute Summary Statistics Across Multiple Columns in R Default ThumbnailHow To Discretize/Bin a Variable in Python with NumPy and Pandas? Default Thumbnail7 Tips to Add Columns to a DataFrame with add_column() in tidyverse Default ThumbnailHow To Drop Multiple Columns in Pandas Dataframe?

Filed Under: Categorize multiple columns R, dplyr across(), R, R Tips, tidyverse 101 Tagged With: R, tidyverse 101

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version