Altair 4.0 is here with a lot of new features. Altair is one of the newest data visualization libraries in Python using a grammar of interactive graphics. Altair is one of my favorites. Not a long ago, but still remember the first time I saw an Altair plot, a chart in “Altair-speak” and pretty impressed by how clean it looked. About 18 months ago, I tried out Altair version 2.0 and wrote a post on introduction to Altair. Unfortunately, have not had a chance to go back and use Altair more often.
If you are new to Altair, Altair is Data Visualization package in Python and it is kind of wrapper for Vega/Vega-lite library for quickly making statistical visualizations in Python. Altair is developed by Jake Vanderplas, the author of Python for Data Science book and Brian Granger, the core contributor of the IPython Notebook and the leader of Project Jupyter Notebook team.
Altair 4.0 is released! https://t.co/PCyrIOTcvv
Try it with:pip install -U altair
The full list of changes is at https://t.co/roXmzcsT58 …read on for some highlights. pic.twitter.com/vWJ0ZveKbZ
— Jake VanderPlas (@jakevdp) December 11, 2019
In the last two years, Altair has moved from version to 4.0 with a lot of changes. Thanks to the latest version of Altair with a lot of new features and the holiday time, I was able to try out Altair 4.0. Here is a quick introduction to Altair post checking out new useful features Altair 4.0. As I said it is another basic introduction to Altair post, it does not even touch one of the highlights of Altair i.e. how ii makes easy to create interactive visualizations in Python.
In this post, we will see examples of three new features of Altair
* How to create Altair Bar plot and increase the size the plot with the new Altair feature?
* How to make a scatter plot in Altair and add different types regression lines to the scatter plot?
* How to make boxplots with Altair?
For each example, we will start with really basic and add new features to make the plot better and to understand Altair’s functions.
Let us first install Altair 4.0 and on new Macbook Air the installation was a breeze with
pip install -U altair
Let us import packages we need, including Altair, Pandas and Numpy.
import altair as alt import pandas as pd import numpy as np print(alt.__version__) 4.0.0
We will use gapminder data to make plots with Altair data.
data_url = 'http://bit.ly/2cLzoxH' gapminder = pd.read_csv(data_url)
We will transform one of the variables, gdpPercap, with log2 scaling to make the relationship with lifeExp linear. And the modified dataframe looks like this.
gapminder['log2_gdpPercap']=np.log2(gapminder['gdpPercap']) gapminder.head() country year pop continent lifeExp gdpPercap log2_gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 9.606304 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 9.680980 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710 9.736572
Simple Barplot with Altair
Let us make a simple barplot using Altair using the gaominder data. Here we want to plot the number of countries for each continent.
Let us use Pandas groupby() function to count the number of countries per each continent.
df = gapminder2.groupby("continent").count()['country'] df = df.to_frame().reset_index()
We have simple data frame with the number of countries per continent.
df continent country 0 Africa 624 1 Americas 300 2 Asia 396 3 Europe 360 4 Oceania 24
Let us make a barplot using Altair first. On the x-axis we will have continents and on the y-axis we will have the number of countries/entries for the continent.
We provide the dataframe df to Chart function in Altair, add bars with mark_bar() function, and specify the variables for the plot with encode function. At the most basic level, an Altair plot needs these three level of details; data, plot-type, and variables to be plotted.
We can save the Altair object as a variable.
# simple barplot with Altair barplot_altair = alt.Chart(df).mark_bar().encode( x='continent', y='country' ) # plot bar chart barplot_altair
We get the most basic barplot from Altair. A quick look at this one can see a number of things can be improved.
At first, the plot is pretty squished and we need to increase the width of the plot.
.
We can change the size of the plot quite easily using the width and height properties of the chart object as follows.
In this example, we have specified the height and width as argument to the properties() function.
barplot_altair = alt.Chart(df).mark_bar().encode( x='continent', y='country' ).properties(height=300,width=450) barplot_altair
Now the basic barplot made with Altair looks better than the basic default barplot from Altair.
Still, the thickness of the bar is bit larger. We can adjust the thickness using size argument to mark_bar().
Here we have the set the size to 50.
alt.Chart(df).mark_bar(size=50).encode( x='continent', y='country' ).properties(height=300,width=450)
Simple Scatter Plots with Altair
Adding regression fit line to scatter plot between two variables can be useful in understanding relationship between two quantitative variables.
Let us first start with making a simple scatter plot using Altair. In this example, we will use gapminder data’s lifeExp and log2_gdpPercap variables to make scatter plots.
scatter_plot1_altair = alt.Chart(gapminder).mark_point().encode( x='lifeExp', y='log2_gdpPercap' ) scatter_plot1_altair
The simplest/default scatter plot made with Altair wastes a bit of real estate in the plot. Basically the plot shows x and y ranges where there is no data.
Let us change the x-axis and y-axis ranges of the scatter plot. The way to change the axes ranges is to use alt.Scale function inside each axis. For any change we want to do an axis, we use alt.X or alt.Y function. And inside the function we use alt.Scale with domain argument as below.
Scatter_Plot_Altair = alt.Chart(gapminder).mark_point().encode( x=alt.X('lifeExp', scale=alt.Scale(domain=(20, 90))), y=alt.Y('log2_gdpPercap',scale=alt.Scale(domain=(6, 18))) ) Scatter_Plot_Altair
Now the scatter plot made with Altair looks much better. We can see a nice linear trend from the scatter plot.
How To Add Regression Line to Scatter Plots?
Let us add a regression line to the scatter plot with Altair. Altair offers multiple options for fitting the data and adding a regression line. Altair’s function transform_regression() fits regression models to smooth and predict data.
It can
fit multiple models for input data (one per group) and generates new data objects that represent points for summary trend lines. Alternatively, this transform can be used to generate a set of objects containing regression model parameters, one per group.
This transform supports parametric models for the following functional forms:
linear (linear): y = a + b * x
logarithmic (log): y = a + b * log(x)
exponential (exp): y = a + eb * x
power (pow): y = a * xb
quadratic (quad): y = a + b * x + c * x2
polynomial (poly): y = a + b * x + … + k * xorder
Let us fit a simple linear regression to our scatter plot. We add transform_regression() as additional layer to the scatter plot object we created above. We need to provide the two variables to do regression and specify the regression method using the “method=” argument.
In this example, we differentiate the linear regression line from the data points with a color.
Scatter_Plot_Altair.transform_regression('lifeExp', 'log2_gdpPercap',method="linear" ).mark_line(color="red")
We have have done linear regression analysis on the data from scatter plot, added a regression line in red and it looks like this
We can see that from the scatter plot, the trend is not a simple linear trend. We might try performing polynomial regression and add regression line using the method=”poly”, instead of method=”linear”.
Scatter_Plot_Altair + Scatter_Plot_Altair.transform_regression('lifeExp', 'log2_gdpPercap',method="poly" ).mark_line(color="red")
The polynomial fit line on the scatter plot looks better for the data.
How To Color Data Points by a variable in Altair?
Let us add colors to data points of the scatter plot using a variable in the data set. To color data points by a variable, we assign the variable name that we want to color to color argument inside encode().
In this example, we want to color the data points based on the continent it corresponds to.
color_by_variable = alt.Chart(gapminder).mark_point().encode( x=alt.X('lifeExp', scale=alt.Scale(domain=(20, 90))), y=alt.Y('log2_gdpPercap',scale=alt.Scale(domain=(6, 18))), color='continent' ) # Altair plot color by variable color_by_variable
Now we have colored the points on scatter plot by continent’s value.
How To Add Regression Line Each Group in Altair?
Let us regression line to each group of data points in a scatter plot. In our example, we have colored data points corresponding to each continent. Now we want to add separate regression line to each continent’s data points.
We can do that using transform_regression() function and use groupby argument in addition to the variables that we want to use for regression modelling.
In our example, we want to add linear regression line to each continent, so we specify “groupby=[‘continent’]”.
And we also specify the thinkness of regression line.
transform_regression() function performs linear regression fit by default, so we did not specify the method for regression analysis.
color_by_variable + color_by_variable.transform_regression('lifeExp', 'log2_gdpPercap', groupby=['continent']).mark_line(size=4)
Now, we have scatter plot with multiple regression lines, one for each continent.
Boxplots with Altair
Last time I tried Altair 2.0, Altair did not have a function to make boxplots, so had to hack a way to make boxplot.
Hurray, Altair 4.0 has a function to make boxplot. It is called mark_boxplot().
Let us make a boxplot with mark_boxplot() function.
simple_boxplot = alt.Chart(gapminder).mark_boxplot().encode( x='continent:O', y='lifeExp') simple_boxplot
Here is how the simple boxplot looks, boxes are a bit tiny.
We can change the size of boxes in boxplot using size argument inside mark_boxplot(). We also color the boxes boxplot using variable from gapminder dataset using color='continent'. boxplot_altair= alt.Chart(gapminder).mark_boxplot(size=50).encode( x='continent:O', y=alt.Y('lifeExp', scale=alt.Scale(domain=(20, 90))), color='continent' ) boxplot_altair
And here is a better looking boxplot made with Altair using gapminder data.
Boxplot definitely looks better, Next step is to look add jittered data points over the boxplot. That is for next time.
To summarize, I have tried the latest version of Altair 4.0.0 after a while for making three most common statistical data visualization techniques I use often. And Altair did not disappoint me. There is a lot good changes since version 2.0. I am hoping to use more often than last time 🙂 .
Happy Holidays Everyone.
[…] is its plotting capabilities. Yes, one can make better visualizations with Matplotlib or Seaborn or Altair. However, Pandas plotting capabilities can be extremely handy when you are in exploratory data […]