• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Python / NumPy / NumPy Digitize / How To Discretize/Bin a Variable in Python with NumPy and Pandas?

How To Discretize/Bin a Variable in Python with NumPy and Pandas?

December 9, 2019 by cmdlinetips

Sometimes you may have a quantitative variable in your data set and you might want to discretize it or bin it or categorize it based on the values of the variable. For example, let us say you have measurements of height and want to discretize it such that it is 0 or 1 depending on if the height is below or above a certain value of height.

We will see examples of discretizing or binning a quantitative variable in two ways. We will fiorst use Numpy’s digitize() function to discretize a quantitative variable. Next we will use Pandas’ cut function to discretize the same quantitative variable.

Let us first load NumPy and Pandas.

# load numpy
import numpy as np
# load pandas
import pandas as pd

How to Discretize or Bin with Numpy’s digitize() function?

Let us create a numpy array with 10 integers. We will use NumpPy’s random module to generate random numbers in between 25 and 200. We will also use random seed to reproduce the random numbers.

# set a random seed to reproduce
np.random.seed(123)
# create 10 random integers  
x = np.random.randint(low=25, high=200, size=10)

Let us sort the numbers for convenience.

x = np.sort(x)

We can see the numbers we generated 10 numbers for height ranging from 42 to 151.

print(x)
array([ 42,  82,  91, 108, 121, 123, 131, 134, 148, 151])

We can use NumPy’s digitize() function to discretize the quantitative variable. Let us consider a simple binning, where we use 50 as threshold to bin our data into two categories. One with values less than 50 are in the 0 category and the ones above 50 are in the 1 category.

We specify the threshold to digitize or discretize as a list to bins argument.

# digitize examples
np.digitize(x,bins=[50])

We can see that except for the first value all are more than 50 and therefore get 1.

array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1])

The bins argument is a list and therefore we can specify multiple binning or discretizing conditions. In the example below, we bin the quantitative variable in to three categories.

np.digitize(x,[50,100])

It gives us three categories as we wanted, 0 category with values less than 50, 1 category with value less than 100 and category 3 with more than 100.

array([0, 1, 1, 2, 2, 2, 2, 2, 2, 2])

We can also bin/categorize/discretize the variable into multiple categories. Here is an example with four categories using digitize.

np.digitize(x,[25,50,100])
array([1, 2, 2, 3, 3, 3, 3, 3, 3, 3])

How to Discretize or Bin with Pandas cut() function?

Now let us use Pandas cut function to discretize/categorize a quantitative variable and produce the same results as NumPy’s digitize function.

Pandas cut function is a powerful function for categorize a quantitative variable. The way it works is bit different from NumPy’s digitize function.

Let us first make a Pandas data frame with height variable using the random number we generated above.

df = pd.DataFrame({"height":x})
df.head()

     height
0	42
1	82
2	91
3	108
4	121

Let us categorize the height variable into four categories using Pandas cut function. Pandas cut function takes the variable that we want to bin/categorize as input. In addition to that, we need to specify bins such that height values between 0 and 25 are in one category, values between 25 and 50 are in second category and so on.

df['binned']=pd.cut(x=df['height'], bins=[0,25,50,100,200])

Let us save the binned variable as another variable in the original dataframe. When we apply Pandas’ cut function, by default it creates binned values with interval as categorical variable. Check the type of each Pandas variable using df.dtypes.

Note how we specify the bins with Pandas cut, we need to specify both lower and upper end of the bins for categorizing.

df.head()
   height      binned
0      42    (25, 50]
1      82   (50, 100]
2      91   (50, 100]
3     108  (100, 200]
4     121  (100, 200]

Pandas Cut Example

Let us see another Pandas cut example, but this time let us specify labels for each categorical variable that Pandas cut provides. We can specify the labels or the names of the categorical group we want using the argument “labels”.

In this Pandas cut example, we provide the labels as integers. Since we want to have four bins or categories, we provide the bin labels as [0,1,2,3].

df['height_bin']=pd.cut(x = df['height'],
                        bins = [0,25,50,100,200], 
                        labels = [0, 1, 2,3])
df

We save the new bins for height as a variable and it perfectly matches with our Numpy’s digitize example above.

     height	binned	height_bin
0	42	(25, 50]	1
1	82	(50, 100]	2
2	91	(50, 100]	2
3	108	(100, 200]	3
4	121	(100, 200]	3

In the above Pandas cut example, we used integers as labels. However, we can use more descriptive categories like this as well

df['height_bin']=pd.cut(x=df['height'], bins=[0,25,50,100,200], 
                        labels=["very short", " short", "medium","tall"])
print(df.head())

 height      binned height_bin
0      42    (25, 50]      short
1      82   (50, 100]     medium
2      91   (50, 100]     medium
3     108  (100, 200]       tall
4     121  (100, 200]       tall

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow To Make Histogram in Python with Pandas and Seaborn? Default ThumbnailSimulating Coin Toss Experiment in Python with NumPy Default ThumbnailEmpirical cumulative distribution function (ECDF) in Python Default ThumbnailHow to Implement Pandas Groupby operation with NumPy?

Filed Under: NumPy Digitize, Pandas Cut, Pandas DataFrame, Python Bin

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version