NumPy or Numerical Python is one of the packages in Python for all things computing with numerical values. Learning NumPy makes one’s life much easier to compute with multi-dimensional arrays and matrices. A huge collection of very useful mathematical functions available to operate on these arrays these arrays makes it one of the powerful environment […]

## How To Reshape Pandas Dataframe with melt and wide_to_long()?

Reshaping data frames into tidy format is probably one of the most frequent things you would do in data wrangling. A data frame is tidy when it satisfies the following rules. Each variable in the data set is placed in its own column Each observation is placed in its own row Each value is placed […]

## Singular Value Decomposition (SVD) in Python

Matrix decomposition by Singular Value Decomposition (SVD) is one of the widely used methods for dimensionality reduction. For example, Principal Component Analysis often uses SVD under the hood to compute principal components. In this post, we will work through an example of doing SVD in Python. We will use gapminder data in wide form to […]

## How To Create a Column Using Condition on Another Column in Pandas?

Often while cleaning data, one might want to create a new variable or column based on the values of another column using conditions. In this post we will see two different ways to create a column based on values of another column using conditional statements. First we will use NumPy’s little unknown function where to […]

## Empirical cumulative distribution function (ECDF) in Python

Histograms are a great way to visualize a single variable. One of the problems with histograms is that one has to choose the bin size. With a wrong bin size your data distribution might look very different. In addition to bin size, histograms may not be a good option to visualize distributions of multiple variables […]

## How To Randomly Add NaN to Pandas Dataframe?

Sometimes while testing a method, you might want to create a Pandas dataframe with NaNs randomly distributed. In this post we will see an example of how to introduce NaNs randomly in a data frame with Pandas. Let us load the packages we need Let us use gaominder data in wide form to introduce NaNs […]

## How to Implement Pandas Groupby operation with NumPy?

Pandas’ GroupBy function is the bread and butter for many data munging activities. Groupby enables one of the most widely used paradigm “Split-Apply-Combine”, for doing data analysis. Sometimes you will be working NumPy arrays and may still want to perform groupby operations on the array. Just recently wrote a blogpost inspired by Jake’s post on […]

## How To Specify Colors to Scatter Plots in Python?

Scatter plots are extremely useful to analyze the relationship between two quantitative variables in a data set. Often datasets contain multiple quantitative and categorical variables and may be interested in relationship between two quantitative variables with respect to a third categorical variable. And coloring scatter plots by the group/categorical variable will greatly enhance the scatter […]

## How To Select Columns by Data Type in Pandas?

Often when you are working with bigger dataframe and doing some data cleaning or exploratory data analysis, you might want to select columns of Pandas dataframe by their data types. For example, you might want to quickly select columns that are numerical in type and visualize their summary data. Or you might want to select […]

## How To Make Scatter Plot in Python with Seaborn?

Scatter plots are a useful visualization when you have two quantitative variables and want to understand the relationship between them. In this post we will see examples of making scatter plots using Seaborn in Python. We will first make a simple scatter plot and improve it iteratively. Let us first load the packages we need […]