Plotting#

As we discovered in the Introduction, HoloViews allows plotting a variety of data types. Here we will use the sample data module and load the pandas and dask hvPlot API:

import numpy as np
import hvplot.pandas  # noqa
import hvplot.dask  # noqa

As we learned the hvPlot API closely mirrors the Pandas plotting API, but instead of generating static images when used in a notebook, it uses HoloViews to generate either static or dynamically streaming Bokeh plots. Static plots can be used in any context, while streaming plots require a live Jupyter notebook, a deployed Bokeh Server app, or a deployed Panel app.

HoloViews provides an extensive, very rich set of objects along with a powerful set of operations to apply, as you can find out in the HoloViews User Guide. But here we will focus on the most essential mechanisms needed to make your data visualizable, without having to worry about the mechanics going on behind the scenes.

We will be focusing on two different datasets:

  • A small CSV file of US crime data, broken down by state

  • A larger Parquet-format file of airline flight data

The hvplot.sample_data module makes these datasets Intake data catalogue, which we can load either using pandas:

from hvplot.sample_data import us_crime, airline_flights

crime = us_crime.read()
print(type(crime))
crime.head()
<class 'pandas.core.frame.DataFrame'>
Year Population Violent crime total Murder and nonnegligent Manslaughter Legacy rape /1 Revised rape /2 Robbery Aggravated assault Property crime total Burglary ... Violent Crime rate Murder and nonnegligent manslaughter rate Legacy rape rate /1 Revised rape rate /2 Robbery rate Aggravated assault rate Property crime rate Burglary rate Larceny-theft rate Motor vehicle theft rate
0 1960 179323175 288460 9110 17190 NaN 107840 154320 3095700 912100 ... 160.9 5.1 9.6 NaN 60.1 86.1 1726.3 508.6 1034.7 183.0
1 1961 182992000 289390 8740 17220 NaN 106670 156760 3198600 949600 ... 158.1 4.8 9.4 NaN 58.3 85.7 1747.9 518.9 1045.4 183.6
2 1962 185771000 301510 8530 17550 NaN 110860 164570 3450700 994300 ... 162.3 4.6 9.4 NaN 59.7 88.6 1857.5 535.2 1124.8 197.4
3 1963 188483000 316970 8640 17650 NaN 116470 174210 3792500 1086400 ... 168.2 4.6 9.4 NaN 61.8 92.4 2012.1 576.4 1219.1 216.6
4 1964 191141000 364220 9360 21420 NaN 130390 203050 4200400 1213200 ... 190.6 4.9 11.2 NaN 68.2 106.2 2197.5 634.7 1315.5 247.4

5 rows × 22 columns

Or using dask as a dask.DataFrame:

flights = airline_flights.to_dask().persist()
print(type(flights))
flights.head()
<class 'dask.dataframe.core.DataFrame'>
year month day dayofweek dep_time crs_dep_time arr_time crs_arr_time carrier flight_num ... taxi_in taxi_out cancelled cancellation_code diverted carrier_delay weather_delay nas_delay security_delay late_aircraft_delay
0 2008 11 15 6 1411.0 1420 1535.0 1546 OO 4391 ... 5.0 11.0 0 None 0 NaN NaN NaN NaN NaN
1 2008 11 28 5 1222.0 1230 1345.0 1356 OO 4391 ... 5.0 15.0 0 None 0 NaN NaN NaN NaN NaN
2 2008 11 22 6 1414.0 1420 1540.0 1546 OO 4391 ... 5.0 10.0 0 None 0 NaN NaN NaN NaN NaN
3 2008 11 15 6 1304.0 1305 1507.0 1519 OO 4392 ... 10.0 9.0 0 None 0 NaN NaN NaN NaN NaN
4 2008 11 22 6 1323.0 1305 1536.0 1519 OO 4392 ... 5.0 21.0 0 None 0 0.0 0.0 0.0 0.0 17.0

5 rows × 29 columns

The plot interface#

The dask.dataframe.DataFrame.hvplot, pandas.DataFrame.hvplot and intake.DataSource.plot interfaces (and Series equivalents) from HvPlot provide a powerful high-level API to generate complex plots. The .hvplot API can be called directly or used as a namespace to generate specific plot types.

The plot method#

The most explicit way to use the plotting API is to specify the names of columns to plot on the x- and y-axis respectively:

crime.hvplot.line(x='Year', y='Violent Crime rate')

As you’ll see in more detail below, you can choose which kind of plot you want to use for the data:

crime.hvplot(x='Year', y='Violent Crime rate', kind='scatter')

To group the data by one or more additional columns, specify an additional by variable. As an example here we will plot the departure delay (‘depdelay’) as a function of ‘distance’, grouping the data by the ‘carrier’. There are many available carriers, so we will select only two of them so that the plot is readable:

flight_subset = flights[flights.carrier.isin(['OH', 'F9'])]
flight_subset.hvplot(x='distance', y='depdelay', by='carrier', kind='scatter', alpha=0.2, persist=True)

Here we have specified the x axis explicitly, which can be omitted if the Pandas index column is already the desired x axis. Similarly, here we specified the y axis; by default all of the non-index columns would be plotted (which would be a lot of data in this case). If you don’t specify the ‘y’ axis, it will have a default label named ‘value’, but you can then provide a y axis label explicitly using the value_label option.

Putting all of this together we will plot violent crime, robbery, and burglary rates on the y-axis, specifying ‘Year’ as the x, and relabel the y-axis to display the ‘Rate’.

crime.hvplot(x='Year', y=['Violent Crime rate', 'Robbery rate', 'Burglary rate'],
             value_label='Rate (per 100k people)')

The hvplot namespace#

Instead of using the kind argument to the plot call, we can use the hvplot namespace, which lets us easily discover the range of plot types that are supported. Use tab completion to explore the available plot types:

crime.hvplot.<TAB>

Plot types available include:

  • .area(): Plots a area chart similar to a line chart except for filling the area under the curve and optionally stacking

  • .bar(): Plots a bar chart that can be stacked or grouped

  • .bivariate(): Plots 2D density of a set of points

  • .box(): Plots a box-whisker chart comparing the distribution of one or more variables

  • .heatmap(): Plots a heatmap to visualizing a variable across two independent dimensions

  • .hexbins(): Plots hex bins

  • .hist(): Plots the distribution of one or histograms as a set of bins

  • .kde(): Plots the kernel density estimate of one or more variables.

  • .line(): Plots a line chart (such as for a time series)

  • .scatter(): Plots a scatter chart comparing two variables

  • .step(): Plots a step chart akin to a line plot

  • .table(): Generates a SlickGrid DataTable

  • .violin(): Plots a violin plot comparing the distribution of one or more variables using the kernel density estimate

Area#

Like most other plot types the area chart supports the three ways of defining a plot outlined above. An area chart is most useful when plotting multiple variables in a stacked chart. This can be achieve by specifying x, y, and by columns or using the columns and index/use_index (equivalent to x) options:

crime.hvplot.area(x='Year', y=['Robbery', 'Aggravated assault'])

We can also explicitly set stacked to False and define an alpha value to compare the values directly:

crime.hvplot.area(x='Year', y=['Aggravated assault', 'Robbery'], stacked=False, alpha=0.4)

Another use for an area plot is to visualize the spread of a value. For instance using the flights dataset we may want to see the spread in mean delay values across carriers. For that purpose we compute the mean delay by day and carrier and then the min/max mean delay for across all carriers. Since the output of hvplot is just a regular holoviews object, we can use the overlay operator (*) to place the plots on top of each other.

delay_min_max = flights.groupby(['day', 'carrier'])['carrier_delay'].mean().groupby('day').agg({'min': np.min, 'max': np.max})
delay_mean = flights.groupby('day')['carrier_delay'].mean()

delay_min_max.hvplot.area(x='day', y='min', y2='max', alpha=0.2) * delay_mean.hvplot()

Bars#

In the simplest case we can use .hvplot.bar to plot x against y. We’ll use rot=90 to rotate the tick labels on the x-axis making the years easier to read:

crime.hvplot.bar(x='Year', y='Violent Crime rate', rot=90)

If we want to compare multiple columns instead we can set y to a list of columns. Using the stacked option we can then compare the column values more easily:

crime.hvplot.bar(x='Year', y=['Violent crime total', 'Property crime total'],
                 stacked=True, rot=90, width=800, legend='top_left')

Scatter#

The scatter plot supports many of the same features as the other chart types we have seen so far but can also be colored by another variable using the c option.

crime.hvplot.scatter(x='Violent Crime rate', y='Burglary rate', c='Year')

Anytime that color is being used to represent a dimension, the cmap option can be used to control the colormap that is used to represent that dimension. Additionally, the colorbar can be disabled using colorbar=False.

Step#

A step chart is very similar to a line chart but instead of linearly interpolating between samples the step chart visualizes discrete steps. The point at which to step can be controlled via the where keyword allowing 'pre', 'mid' (default) and 'post' values:

crime.hvplot.step(x='Year', y=['Robbery', 'Aggravated assault'])

HexBins#

You can create hexagonal bin plots with the hexbin method. Hexbin plots can be a useful alternative to scatter plots if your data are too dense to plot each point individually. Since these data are not regularly distributed, we’ll use the logz option to map z-axis (color) to a log scale colorbar.

flights.hvplot.hexbin(x='airtime', y='arrdelay', width=600, height=500, logz=True)

Bivariate#

You can create a 2D density plot with the bivariate method. Bivariate plots can be a useful alternative to scatter plots if your data are too dense to plot each point individually.

crime.hvplot.bivariate(x='Violent Crime rate', y='Burglary rate', width=600, height=500)

HeatMap#

A HeatMap lets us view the relationship between three variables, so we specify the ‘x’ and ‘y’ variables and an additional ‘C’ variable. Additionally we can define a reduce_function that computes the values for each bin from the samples that fall into it. Here we plot the ‘depdelay’ (i.e. departure delay) for each day of the month and carrier in the dataset:

flights.compute().hvplot.heatmap(x='day', y='carrier', C='depdelay', reduce_function=np.mean, colorbar=True)

Tables#

Unlike all other plot types, a table only supports one signature: either all columns are plotted, or a subset of columns can be selected by defining the columns explicitly:

crime.hvplot.table(columns=['Year', 'Population', 'Violent Crime rate'], width=400)

Distributions#

Plotting distributions differs slightly from other plots since they plot only one variable in the simple case rather than plotting two or more variables against each other. Therefore when plotting these plot types no index or x value needs to be supplied. Instead:

  1. Declare a single y variable, e.g. source.plot.hist(variable), or

  2. Declare a y variable and by variable, e.g. source.plot.hist(variable, by='Group'), or

  3. Declare columns or plot all columns, e.g. source.plot.hist() or source.plot.hist(columns=['A', 'B', 'C'])

Histogram#

The Histogram is the simplest example of a distribution; often we simply plot the distribution of a single variable, in this case the ‘Violent Crime rate’. Additionally we can define a range over which to compute the histogram and the number of bins using the bin_range and bins arguments respectively:

crime.hvplot.hist(y='Violent Crime rate')

Or we can plot the distribution of multiple columns:

columns = ['Violent Crime rate', 'Property crime rate', 'Burglary rate']
crime.hvplot.hist(y=columns, bins=50, alpha=0.5, legend='top', height=400)