Histograms are a great way to visualize a single variable. One of the problems with histograms is that one has to choose the bin size. With a wrong bin size your data distribution might look very different. In addition to bin size, histograms may not be a good option to visualize distributions of multiple variables […]
How To Randomly Add NaN to Pandas Dataframe?
In this post we will see an example of how to introduce missing value, i.e. NaNs randomly in a data frame uusisng Pandas. Sometimes while testing a method, you might want to create a Pandas dataframe with NaNs randomly distributed. Here wee show how to do it. Let us load the packages we need Let […]
How To Highlight Select Data Points with ggplot2 in R?
The power of ggplot2 lies in making it easy to make great plots and in easily tweaking it to the one wants. Sometimes, one might want to highlight certain data points in a plot in different color. Here we will see an example of highlighting specific data points in a plot. Let us first load […]
How to Implement Pandas Groupby operation with NumPy?
Pandas’ GroupBy function is the bread and butter for many data munging activities. Groupby enables one of the most widely used paradigm “Split-Apply-Combine”, for doing data analysis. Sometimes you will be working NumPy arrays and may still want to perform groupby operations on the array. Just recently wrote a blogpost inspired by Jake’s post on […]
Implementing K-means clustering in Python from Scratch
K-means clustering is one of the commonly used unsupervised techniques in Machine learning. K-means clustering clusters or partitions data in to K distinct clusters. In a typical setting, we provide input data and the number of clusters K, the k-means clustering algorithm would assign each data point to a distinct cluster. In this post, we […]



