Altair 4.0 is here: Barplots, Scatter Plots with Regression Line and Boxplots

Altair 4.0 is here with a lot of new features. Altair is one of the newest data visualization libraries in Python using a grammar of interactive graphics. Altair is one of my favorites. Not a long ago, but still remember the first time I saw an Altair plot, a chart in “Altair-speak” and pretty impressed by how clean it looked. About 18 months ago, I tried out Altair version 2.0 and wrote a post on introduction to Altair. Unfortunately, have not had a chance to go back and use Altair more often.

If you are new to Altair, Altair is Data Visualization package in Python and it is kind of wrapper for Vega/Vega-lite library for quickly making statistical visualizations in Python. Altair is developed by Jake Vanderplas, the author of Python for Data Science book and Brian Granger, the core contributor of the IPython Notebook and the leader of Project Jupyter Notebook team.

In the last two years, Altair has moved from version to 4.0 with a lot of changes. Thanks to the latest version of Altair with a lot of new features and the holiday time, I was able to try out Altair 4.0. Here is a quick introduction to Altair post checking out new useful features Altair 4.0. As I said it is another basic introduction to Altair post, it does not even touch one of the highlights of Altair i.e. how ii makes easy to create interactive visualizations in Python.

In this post, we will see examples of three new features of Altair
* How to create Altair Bar plot and increase the size the plot with the new Altair feature?
* How to make a scatter plot in Altair and add different types regression lines to the scatter plot?
* How to make boxplots with Altair?
For each example, we will start with really basic and add new features to make the plot better and to understand Altair’s functions.

Let us first install Altair 4.0 and on new Macbook Air the installation was a breeze with

 
pip install -U altair

Let us import packages we need, including Altair, Pandas and Numpy.

 
import altair as alt
import pandas as pd
import numpy as np
print(alt.__version__)
4.0.0

We will use gapminder data to make plots with Altair data.

 
data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)

We will transform one of the variables, gdpPercap, with log2 scaling to make the relationship with lifeExp linear. And the modified dataframe looks like this.

 
gapminder['log2_gdpPercap']=np.log2(gapminder['gdpPercap'])
gapminder.head()
	country	year	pop	continent	lifeExp	gdpPercap	log2_gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314	9.606304
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030	9.680980
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710	9.736572

Simple Barplot with Altair

Let us make a simple barplot using Altair using the gaominder data. Here we want to plot the number of countries for each continent.

Let us use Pandas groupby() function to count the number of countries per each continent.

 
df = gapminder2.groupby("continent").count()['country']
df = df.to_frame().reset_index()

We have simple data frame with the number of countries per continent.

 
df

continent	country
0	Africa	 624
1	Americas 300
2	Asia	 396
3	Europe	 360
4	Oceania	  24

Let us make a barplot using Altair first. On the x-axis we will have continents and on the y-axis we will have the number of countries/entries for the continent.

We provide the dataframe df to Chart function in Altair, add bars with mark_bar() function, and specify the variables for the plot with encode function. At the most basic level, an Altair plot needs these three level of details; data, plot-type, and variables to be plotted.

We can save the Altair object as a variable.

 
# simple barplot with Altair
barplot_altair = alt.Chart(df).mark_bar().encode(
    x='continent',
    y='country'
)
# plot bar chart
barplot_altair

We get the most basic barplot from Altair. A quick look at this one can see a number of things can be improved.

At first, the plot is pretty squished and we need to increase the width of the plot.
.

Simple Barplot with Altair
Default Barplot with Altair

We can change the size of the plot quite easily using the width and height properties of the chart object as follows.

In this example, we have specified the height and width as argument to the properties() function.

 
barplot_altair = alt.Chart(df).mark_bar().encode(
    x='continent',
    y='country'
).properties(height=300,width=450)
barplot_altair

Now the basic barplot made with Altair looks better than the basic default barplot from Altair.

How To Change Size Barplot Altair?

Still, the thickness of the bar is bit larger. We can adjust the thickness using size argument to mark_bar().

Here we have the set the size to 50.

 
alt.Chart(df).mark_bar(size=50).encode(
    x='continent',
    y='country'
).properties(height=300,width=450)

Simple Scatter Plots with Altair

Adding regression fit line to scatter plot between two variables can be useful in understanding relationship between two quantitative variables.

Let us first start with making a simple scatter plot using Altair. In this example, we will use gapminder data’s lifeExp and log2_gdpPercap variables to make scatter plots.

 
scatter_plot1_altair = alt.Chart(gapminder).mark_point().encode(
    x='lifeExp',
    y='log2_gdpPercap'
)
scatter_plot1_altair

The simplest/default scatter plot made with Altair wastes a bit of real estate in the plot. Basically the plot shows x and y ranges where there is no data.

Basic Scatter Plot Altair

Let us change the x-axis and y-axis ranges of the scatter plot. The way to change the axes ranges is to use alt.Scale function inside each axis. For any change we want to do an axis, we use alt.X or alt.Y function. And inside the function we use alt.Scale with domain argument as below.

 
Scatter_Plot_Altair = alt.Chart(gapminder).mark_point().encode(
    x=alt.X('lifeExp', scale=alt.Scale(domain=(20, 90))),
    y=alt.Y('log2_gdpPercap',scale=alt.Scale(domain=(6, 18)))
)
Scatter_Plot_Altair 

Now the scatter plot made with Altair looks much better. We can see a nice linear trend from the scatter plot.

Adjusting Ranges Scatter Plot Altair

How To Add Regression Line to Scatter Plots?

Let us add a regression line to the scatter plot with Altair. Altair offers multiple options for fitting the data and adding a regression line. Altair’s function transform_regression() fits regression models to smooth and predict data.
It can

fit multiple models for input data (one per group) and generates new data objects that represent points for summary trend lines. Alternatively, this transform can be used to generate a set of objects containing regression model parameters, one per group.

This transform supports parametric models for the following functional forms:

linear (linear): y = a + b * x
logarithmic (log): y = a + b * log(x)
exponential (exp): y = a + eb * x
power (pow): y = a * xb
quadratic (quad): y = a + b * x + c * x2
polynomial (poly): y = a + b * x + … + k * xorder

Let us fit a simple linear regression to our scatter plot. We add transform_regression() as additional layer to the scatter plot object we created above. We need to provide the two variables to do regression and specify the regression method using the “method=” argument.

In this example, we differentiate the linear regression line from the data points with a color.

 
Scatter_Plot_Altair.transform_regression('lifeExp', 'log2_gdpPercap',method="linear"
).mark_line(color="red")

We have have done linear regression analysis on the data from scatter plot, added a regression line in red and it looks like this

Scatter Plot with Regression Line Altair

We can see that from the scatter plot, the trend is not a simple linear trend. We might try performing polynomial regression and add regression line using the method=”poly”, instead of method=”linear”.

 
Scatter_Plot_Altair  + 
       Scatter_Plot_Altair.transform_regression('lifeExp',
                          'log2_gdpPercap',method="poly"
).mark_line(color="red")

The polynomial fit line on the scatter plot looks better for the data.

Scatter Plot with Polynomial Regression Line with Altair

How To Color Data Points by a variable in Altair?

Let us add colors to data points of the scatter plot using a variable in the data set. To color data points by a variable, we assign the variable name that we want to color to color argument inside encode().

In this example, we want to color the data points based on the continent it corresponds to.

 
color_by_variable = alt.Chart(gapminder).mark_point().encode(
    x=alt.X('lifeExp', scale=alt.Scale(domain=(20, 90))),
    y=alt.Y('log2_gdpPercap',scale=alt.Scale(domain=(6, 18))),
    color='continent'
)

# Altair plot color by variable
color_by_variable 

Now we have colored the points on scatter plot by continent’s value.

Scatter Plot Color By Variable: Altair

How To Add Regression Line Each Group in Altair?

Let us regression line to each group of data points in a scatter plot. In our example, we have colored data points corresponding to each continent. Now we want to add separate regression line to each continent’s data points.

We can do that using transform_regression() function and use groupby argument in addition to the variables that we want to use for regression modelling.

In our example, we want to add linear regression line to each continent, so we specify “groupby=[‘continent’]”.

And we also specify the thinkness of regression line.

transform_regression() function performs linear regression fit by default, so we did not specify the method for regression analysis.

 
color_by_variable + color_by_variable.transform_regression('lifeExp', 'log2_gdpPercap', 
        groupby=['continent']).mark_line(size=4)

Now, we have scatter plot with multiple regression lines, one for each continent.

Scatter Plot Colored Regression Line per Group

Boxplots with Altair

Last time I tried Altair 2.0, Altair did not have a function to make boxplots, so had to hack a way to make boxplot.

Hurray, Altair 4.0 has a function to make boxplot. It is called mark_boxplot().

Let us make a boxplot with mark_boxplot() function.

 
simple_boxplot = alt.Chart(gapminder).mark_boxplot().encode(
    x='continent:O',
    y='lifeExp')
simple_boxplot

Here is how the simple boxplot looks, boxes are a bit tiny.

Simple Boxplot with Altair
 

We can change the size of boxes in boxplot using size argument inside mark_boxplot(). We also color the boxes boxplot using variable from gapminder dataset using color='continent'.

boxplot_altair= alt.Chart(gapminder).mark_boxplot(size=50).encode(
    x='continent:O',
    y=alt.Y('lifeExp', scale=alt.Scale(domain=(20, 90))),
    color='continent'
)
boxplot_altair

And here is a better looking boxplot made with Altair using gapminder data.

Coloring Boxplots in Altair

Boxplot definitely looks better, Next step is to look add jittered data points over the boxplot. That is for next time.

To summarize, I have tried the latest version of Altair 4.0.0 after a while for making three most common statistical data visualization techniques I use often. And Altair did not disappoint me. There is a lot good changes since version 2.0. I am hoping to use more often than last time 🙂 .

Happy Holidays Everyone.

1 comment

Comments are closed.