Exploring your data while dong analysis is extremely important. skimr, an R package, from rOpenSci is a great package that helps you get the summary statistics in a nice way, so you can quickly skim your data summary and understand it better.
If you have not heard of rOpenSci, it is a non-profit initiative founded in 2011 by Karthik Ram, Scott Chamberlain, and Carl Boettiger with the goal to make scientific data reproducible. rOpenSci’s unconference is pretty cool idea. Check out rOpenSci website to learn more.
One of the best parts about skimr is nice summary tiny histogram that it displays for all numerical variables in the data. The other great thing about skimr is it works smoothly with tidyverse pipeline at any stage.
How to Install Skimr?
You can install skimr from CRAN.
# install.packages("devtools") devtools::install_github("ropenscilabs/skimr")
Let us load the packages we will be using.
library(gapminder) library(dplyr) library(skimr)
Let us make a small data frame using two vectors and use skimr to check the summary of the dataframe
>x = seq(1000) >y = rnorm(1000,mean=3) >df = data.frame(x,y) >skim(df)
When you type skim(df), skimr will give you a quick summary of the dataframe. skimr will first tell you the basic info like the number of variables. And then it will give summary stats on different types of variables. In this case, we have just an integer variable and a numeric variable. And for the numerical variable, skimr gives you a beautiful histogram in the console in addition to the standard summary statistics. The histogram tells us that our x variable is uniformly distributed and the y variable has a peak in the middle and spread out equally on the sides.
Now that we have seen the gist of what skimr can do for use, let us look at a bit more real data set and see how skim can be extremely using in a data analysis pipeline.
Skimr to select columns, just like dplyr’s select verb
Skimr can work just like dplyr’s select verb and we can select columns and look at the summary data. For example
>skim(gapminder, lifeExp, gdpPercap)
Customize summary functions in skimr
Another coolest thing that skimr can do is you can customize the summary functions the way you want. For example, if you think the summary function that skimr offers is too much and you just want custom summary functions, you can easily do that.
For example, for every numerical variable, if you just to min, max, and mean summary values, you can specify that in a list and instruct skimr to use these summary functions instead of default with “skim_with” command.
funs <- list( min = min, max = max, mean = mean ) skim_with(numeric = funs, append = FALSE)
Skimr to group_by object
Any call to skimr after this custom function definition will have customized summary stats. For example, you can use skim in any stage of tidyverse pipeline and get only the custom summary function. Yes, any stage of tidyverse pipeline, for example skimr can work with grouped object after group_by() in a dplyr pipeline. Here is an example using custom summary functions to grouped object in dplyr pipeline.
gapminder %>% filter(year==2007) %>% select(continent,lifeExp,gdpPercap)%>% group_by(continent) %>% skim()
Here we call skim to group_by() object after a series of data manipulation to the gapminder data. The result is a tidy skimr object with the summary statistics we defined. And any time, if you want the default summary statistics, you can restore the default option with
# Restore defaults skim_with_defaults()
There are many more interesting use of skimr to fit your tidy analysis pipeline. So check out skimr from rOpenSci now 🙂