Data Visualization: A Practical Introduction
by Duke University Professor Kieran Healy is a great introduction Data Visualization.If you have not heard of the book before, here is a little back story. The author, Kieran Healy developed the book using R Bookdown and made the whole book available online for free. Yes, it is available for free online even now. Check out http://socviz.co/ and read the book for free.
The print version of the book was available in late 2018. Finally, got a chance to get the print version. What a difference the physical copy makes! In the two weeks I had the book, I have gone back to pick up the book many times than ever. definitely old-school or just old ๐
Here are some thoughts on the book or a review of the book, mainly covering what the book covers, whom it is useful and what I liked and what I learned from each chapter of the book.
TL; DR: Data Visualization: A Practical Introduction is an absolute must for beginners interested in Data Visualization using R Data Science ecosystem.
Chapter 1: Data Visualization
The book contains eight chapters and it starts with why should one look at the data. It illustrates the benefits of looking at the data with the classic Anscombe’ quartet, where you have four scatter plots with the same correlation, but with different underlying data. The book does not stop with that and I really liked the 16 scatter plots with the same correlation but different underlying data, hammering again the need to look at the data and how it can hurt us. It goes on to give great examples of why does a figure look bad and explains a bit on the theory/philosophy of good/bad Data Visualization.
Chapter 2: Getting Started
The second chapter is great for beginners who are not using R and it shows how to get started with R. I really liked the way the book introduces R to new users. It did not start with the traditional way learning R. Instead, the author jumps right into recommending using R Markdown, Knitr, and working with Projects using RStudio while starting. At first this may look daunting, but this is the best way to start for the beginners. And it sets them on the right path and from the experience of teaching R Markdown/Knitr first to beginners, I am sure that it is not as scary as it may look.
And the biggest take home message from this chapter for a beginner is, just like the author says “Be patient with R and with yourself”.
Chapter 3. Make a Plot
The third chapter, the book takes a deep dive into teaching the core aspects of ggplot2 and show how to make scatter plots with ggplot2 in R. It might take little time getting the hang of of ggplot’s logic in making plots. The chapter nicely explain the logic behind ggplot2 and goes on to use gapminder dataset to illustrate making scatterplots with ggplot. If you are a regular to this blog, you might know our love of the gapminder data set :).
What I really liked about the third chapter showing how to make scatter plots with ggplot2 is that the author picked two variables of interest GDPperCapita and Life Expectancy from gapminder data and went on to show the power of ggplot2 with 14 different plots using the same two variables. It is a fantastic way to illustrate the process of making a good data visualization in general, not just the power of ggplot2. One typically starts with something simple and then goes on to tweak the code to make better version of the plot. One day I am going to make all the fourteen versions of the scatter plots in a blog post here ๐
Just, nit picking here, I just couldn’t get used to seeing ggplot2 with Uppercase G like this Ggplot ๐
It takes about 50 pages before one actually writes some code to make data visualization in this book. If you can’t wait to get your hands wet actually making a plot or two with ggplot2 in R, here is a set of recommendations for the first three chapters.
Scenario 1: You are totally new to R and don’t have R set up
- Glance through Chapter 1,
- Use Chapter 2 to get your computer and yourself ready to actually make some code.
- Work through examples of Chapter 3, by actually typing instead of copying and pasting. You don’t actually have to make all 14 plots. Just make as many as to get yourself comfortable
Scenario 2: You are not totally new to R and have R set up already
- Glance through Chapter 1 often
- Work through examples of Chapter 3, by actually typing instead of copying and pasting.
Scenario 3: You just want to learn to use ggplot2 and have R set up already
- Jump right into examples of Chapter 3, by actually typing instead of copying and pasting.
Chapter 4. Show the Right Numbers
The main goal of the fourth chapter is to improve the fluency of using ggplot2 for making variety of different plots. Or speaking the ggplot2 language, more “geom_”s. Earlier chapter already introduced geom_point() for making scatter plots. In this chapter, you will see examples of using geom_line() for making line plots, geom_bar() for barplot, and geom_histogram() for histograms. One of the most common tasks while making visualization is to group/split/transform the data specific to type of visualizations. This chapter gives plenty of examples of doing that in ggplot2.
One of the things I like about this chapter is that, it teaches you how things might go wrong and get bizarre-looking plots (or widely known as accidental aRts”, if we are not specific about the visualization we want.
Another thing I really like is that this chapter introduces the use of stat functions in ggplot. Often when we make plots with ggplot2, under the hood ggplot2 can transform the data specific to the type of plot one is making. For example, while making barplots showing the count/frequency of multiple variables. Our original data frame did not really have count/frequency as a variable in the dataframe. ggplot2’s stat functions computes it on the fly for us. The beauty is sometime you may want to use those stat functions to do something specific you want. When I started using ggplot2, it used to be mysterious on how to use the stat functions and took a while to grasp the idea. This chapter gives a nice introduction of using stat functions to make plots.
Chapter 6. Work with Models
Data Visualization is one of the important components of data analysis/data science. statistical modeling or analysis is another important dimension. DataViz with support from statistical models can drastically improve how much we can get out of visualizing data.
Sixth chapter stresses the importance of working with models while making visualization. Not just that, it shows how one can work with models easily.
If you have ever worked with statistical models, for example the simplest linear models, you would know how painful it can be to extract the results from the model. And to make it worse, getting results one type of statistical model is often very different from another,
A couple of years ago David Robinson wrote a wonderful package called “broom”, as the name suggests, broom cleans up statistical models in R using the tidy approach. Broom makes it really easy to get the results different statistical models. Sixth chapter jumps right to introduce broom to work with statistical models and shows examples of using the results from models in making visualization.
Nobody ever builds just a single model ๐ Often one ends up building multiple models and the chapter gives a great teaser to how to do that in tidy way. The chapter stops short of introducing one of the tidyverse packages purrr, but shows the power of nesting and mapping while building multiple models.
Overall, it is a great introductory chapter to start working with models and ggplot2. It is of use not just for new learners, but also a very useful for intermediate R users.
If you loved the chapter and wanted more check out R for Data Science, especially chapters on map functions in Iterations and Models.
Chapter 7. Draw Maps
Seventh chapter is about working with geo-spatial data and making beautiful maps. This is probably the most useful chapter for me. Although I have coded in R for few years, one thing I never touched is playing with maps.This chapter shows, how one can make breathtaking maps using ggplot2. It starts with using 2016 president election data from US and shows how to make state-level, county-level maps. It is just fantastic and I am looking forward to learn some mapping skills with this chapter. This chapter is a perfect example of the useful chapter for both beginners and advanced users alike,
Chapter 8. Refine Your Plots
The final chapter is all about how to refine your plot and make it better. Actually, having a separate chapter on “refining the plot” at last does not mean that this book teach the basics of refining your plots earlier. In earlier chapter, the author used mainly ggplot’s functionalities to refine the plots. Remember the 14 different scatter plots I mentioned in the third chapter? This final chapter introduces how to enhance you plots using right colors, text and theme. You will learn about color palettes, adding text with geom_text_repel to highlight relevant aspect of the plots and using different themes. The chapter has a nice case study on a new dataset where one can put all these to use to refine the plots. In short, a great way to end the book.
By now the it must be clear that the upshot of the review, go grab a copy of the book if you are interested R, ggplot, Data Visualization irrespective of whether you are beginner or expert.