Finally, a Book Review. It is long over due. During the last few years, used a lot of interesting books on Data Science and Machine Learning for learning and teaching. Hoping to add reviews of the useful Data Science books that I have used, periodically.
Here is the first book review on Python Data Science Handbook.
Python Data Science Handbook by Jake VanderPlas is one of the basic data science books that lets one get started with Data Science using Python.
The book assumes that the audience already knows Python, so it does not teach basics of Python. The book is meant for anyone who is interested in using Python for most common data science tasks.
What does Python Data Science Handbook Cover?
One of the things I really like about Python Data Science Handbook is that, it covers extensively the awesome stack of data science tools available in Python. IPython/Jupyter, NumPy, Pandas, Matplotlib, and Machine Learning with SciKitLearn.
Chapter 1: IPython
The book starts off with giving introduction to IPython and JuPyter notebooks and covers the basic benefits of using IPython/Jupyter. It covers special functions available in IPython, like magic commands and so on. Although the commands discussed in the chapter are extremely useful, who can forget the pain of cutting and pasting in Python code chunk in shell before IPython and its magic commands. Still, the first chapter is missing what we can do now. If you are starting now, you will be better off starting with “jupyter lab”. The IPython and Jupyter team is doing such great job developing tools make the whole Python experience wonderful.
The next two chapters of the book totally focusses on the data and offering solutions to how can we load data in memory and work with it?
Chapter 2: NumPy
In the second chapter, the book covers one of the most important tools to deal with data in Python, that is NumPy. Even though it is just a single chapter, the chapter has everything that you need to get started with NumPy for data science.
In case if you did not know, NumPy, short for Numerical Python, is a python library for all things numerical. NumPy offers most efficient ways to store data and operate on data. This chapter introduces NumPy and teaches you how to create NumPy arrays, Python’s built-in list data structure, how to slice it and how to use mathematical/arithmetical operations on NumPy arrays.
One of the most useful things I learned from this chapter was about UFuncs – universal functions. One of the reasons NumPy is fast is because it enables vectorized operations that are implemented through ufuncs. The chapter contains lots of useful information about when NumPy can be fast or slow.
As Jake says in the book
NumPy arrays forms the core of nearly entire ecosystem of data science tools in Python, so learning time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.
Chapter 3: Pandas
If you think you don’t need NumPy for now and just want to get started playing with real world heterogenous data, one can jump to the third chapter covering the fundamentals of Pandas for analyzing mixed data in a tabular format.
Pandas originally developed by Wes McKinney is one of the most useful Python packages for doing data science. It has democratized doing data science with Python. Check this article to learn more about a bit of history of Pandas development.
Basically, Pandas is built on NumPy and offers DatFrame data structure and methods for dealing with data in data frame format. The chapter on Pandas teaches the basics of using Pandas, starting with how to create a data frame in Pandas, how to access elements, rows or columns in a Pandas data frame, how to deal with Time Series data and many more.
One of the most useful concepts (in addition to Time Series data) to learn from this chapter is dealing with tidy data and using split-apply-combine approach.
Chapter 4: Matplotlib
Python’s data visualization landscape is complicated. Python’s Matplotlib is one of the oldest python plotting library. It originally started with the goal to create plotting in MATLAB-style. As Jake says, the newly available data visualization options make
Matplotlib feel clunky and old fashioned. Still, I’m of the opinion that we cannot ignore Matplotlib’s strength as well-tested cross-platform graphics engine.
The fourth chapter on Matplotlib offers in-depth view of getting started and using Matplotlib. For the new users, starting with Matplotlib can be painful, especially not knowing the basic tips of loading the library and making the plot you made visible on the screen. Yes, I have burned by this many times 🙂 This chapter starts with the useful Matplotlib tips so that you don’t suffer the initial pains.
Chapter 5: Machine Learning
The final chapter is on Machine learning, one of the most useful data science tools. This chapter introduces the basic concepts of Machine Learning and gives a great introduction to Python’s popular Machine Learning library scikit-learn.
The Machine learning chapter covers the two big categories of Machine Learning; Supervised learning and Unsupervised learning. In supervised learning the book goes over examples of classification and regression problems. In unsupervised learning, it covers clustering/dimensionality reduction techniques.
One of the good things about how Jake presented Machine Learning algorithms with Python in this book is that, he starts off showing how to use a specific ML approach on real-world like data set using scikit-learn API and then takes a deep dive onto many common Machine Learning approaches with the goal of gaining a intuition behind the methods and when and how to use them.
For example, the final chapter covers in depth on any LM topics like Naive Bayes Classification, Linear regression and Principal Component Analysis.
Overall, Python Data Science Handbook by Jake VanderPlas is fantastic Data Science book in Python and there is no need to hesitate a bit to get a copy of it. BTW, the whole book is available online for free as well, if you don’t want to spend money to get a physical copy.
Not just that @pydatasci twitter account has some great tips for doing data science with Python. Here is one nice tip I learned
A favorite Jupyter notebook trick: for cleaner presentation, suppress matplotlib's object repr output with a semicolon. pic.twitter.com/tC8lNvN75a
— Python Data Science (@pydatasci) June 1, 2017