20 Free Online Books to Learn R and Data Science

If you are interested in learning Data Science with R, but not interested in spending money on books, you are definitely in a good space. There are a number of fantastic books and resources available online for free from top most creators and scientists. Here are such 13 free (so far) online data science books […]

How To Do PCA in tidyverse Framework?

In an earlier post, we saw a tutorial on how to do PCA in R using gapminder data set. Another interesting way of doing PCA is to follow the tidyverse framework. In this post, we will see an example of doing PCA analysis using gapminder data in a tidy framework. Being the first attempt to […]

How To Create a Column Using Condition on Another Column in Pandas?

Often while cleaning data, one might want to create a new variable or column based on the values of another column using conditions. In this post we will see two different ways to create a column based on values of another column using conditional statements. First we will use NumPy’s little unknown function where to […]

Empirical cumulative distribution function (ECDF) in Python

Histograms are a great way to visualize a single variable. One of the problems with histograms is that one has to choose the bin size. With a wrong bin size your data distribution might look very different. In addition to bin size, histograms may not be a good option to visualize distributions of multiple variables […]

How To Randomly Add NaN to Pandas Dataframe?

Sometimes while testing a method, you might want to create a Pandas dataframe with NaNs randomly distributed. In this post we will see an example of how to introduce NaNs randomly in a data frame with Pandas. Let us load the packages we need Let us use gaominder data in wide form to introduce NaNs […]

How To Highlight Select Data Points with ggplot2 in R?

The power of ggplot2 lies in making it easy to make great plots and in easily tweaking it to the one wants. Sometimes, one might want to highlight certain data points in a plot in different color. Here we will see an example of highlighting specific data pints in a plot. Let us first load […]

How to Implement Pandas Groupby operation with NumPy?

Pandas’ GroupBy function is the bread and butter for many data munging activities. Groupby enables one of the most widely used paradigm “Split-Apply-Combine”, for doing data analysis. Sometimes you will be working NumPy arrays and may still want to perform groupby operations on the array. Just recently wrote a blogpost inspired by Jake’s post on […]

K-means clustering in Python

K-means clustering is one of the commonly used unsupervised techniques in Machine learning. K-means clustering clusters or partitions data in to K distinct clusters. In a typical setting, we provide input data and the number of clusters K, the k-means clustering algorithm would assign each data point to a distinct cluster. In this post, we […]

R Graphics Cookbook Second Edition is Available for Free

Winston Chang from RStudio quietly announced last week that the second edition of his popular R Graphics Cookbook: Practical Recipes for Visualizing Data is available now to buy. Not just that, the book is also available online for free at https://r-graphics.org/. Winston Chang’s first edition of R Graphics Cookbook was the first R book I […]

PCA example using prcomp in R

Principal Component Analysis, aka, PCA is one of the commonly used approaches to do unsupervised learning/ dimensionality reduction. It is a fantastic tool to have in your data science/Machine Learning arsenal. You will be surprised how often the use of PCA pops up, whenever working with high dimensional data. Let us see a step-by-step example […]

How To Specify Colors to Scatter Plots in Python?

Scatter plots are extremely useful to analyze the relationship between two quantitative variables in a data set. Often datasets contain multiple quantitative and categorical variables and may be interested in relationship between two quantitative variables with respect to a third categorical variable. And coloring scatter plots by the group/categorical variable will greatly enhance the scatter […]