BEP 3: Charts interface

BEP 3	Issues and PRs management.
Authors	Bryan Van de Ven, Fabio Pliger and Damián Avila
Status	WIP
Discussion	https://github.com/bokeh/bokeh/issues/1373
Implementation	https://github.com/bokeh/bokeh/issues/1387

This is a discussion BEP to discuss and design the next iteration in the charts interface.

Data Interface

Here I am really only concerned with the data formats that each plot accepts. Am including "facet" (and "group" and "stack") options to show what should be possible at some point (but not necessarily immediately) because they can relate to the structure of the data.

Histogram

generate a single histogram of a single series on one plot plot

data = # 1d iterable of scalars
Histogram(data)

generate multiple histograms of multiple series overlaid on one plot (a new color for each series):

data = # list of lists, dict/ordered dict of lists, data frame
Histogram(data)

generate multiple histograms of multiple series on multiple separate plots:

data = # list of lists, dict/ordered dict of lists, data frame
Histogram(data, facet=True)

NOTE: the "single series" is the special case. It should promote itself to the "multiple series" case automatically and transparently.

TimeSeries

For time series it is important to distinguish the index. The "column names" of the "data frame" are the names of the series, and all the points from each series are plotted.

There are two possibilities: one common index shared by all the series, or each series with a distinct index of its own.

common index

If all the series have the exact same "x" coordinates, here are some ideas:

generate a single time series of on one plot plot

index = # 1d iterable of any sort (of datetime values)
data = # 1d iterable of any sort
TimeSeries(index, data)

generate possibly multiple time series overlaid on one plot (a new color for each series):

index = # 1d iterable of datetime values
data = # list of lists, dict/ordered dict of lists 
TimeSeries(index, data)

# OR 

data = # dataframe
TimeSeries(data)

generate multiple time series on multiple separate plots:

index = # 1d iterable of datetime values
data = # list of lists, dict/ordered dict of lists
TimeSeries(index, data, facet=True)

# OR 

data = # dataframe
TimeSeries(data, facet=True)

distinct indices

By definition this only has the "multiple series case"

generate multiple time series overlaid on one plot (a new color for each series):

# here, index and series are both 1d iterables of scalars
data = # 1d iterable of (index, series) pairs
TimeSeries(data)

# OR

TimeSeries(index0, series0, index1, series1, ...)

# OR 

data = # 1d iterable of dataframe
TimeSeries(data)

# OR

Timeseries(df0, df1, df2, ...)

generate multiple time series on multiple separate plots:

# here, index and data are both 1d iterables of scalars
data = # 1d iterable of (index, series) pairs
TimeSeries(data, facet=True)

# OR 

data = # 1d iterable of dataframes
TimeSeries(data, facet=True)

In the last case it's possible that there are fewer indices than series, because each data frame may have more than one non-index column. But conceptually this is no different that the general case.

Scatter

Here there is no need to distinguish the index, what is always needed is pairs of x/y sequences that have the same length.

generate a single time series of on one plot plot

x = # 1d iterable of any sort 
y = # 1d iterable of any sort
Scatter(x, y)

generate possibly multiple time series overlaid on one plot (a new color for each series):

# all vars = 1d iterable of scalars
Scatter((x0, y0), (x1, y1), (x2, y2))

# OR 

data = # groupby of a data frame (a new color for each group)
Scatter(data, x="x_column_name", y="y_column_name")

NOTE: also accepts facet=True to facet multiple scatters on different plots.

BoxPlot

Box plots have a categorical X-axis with a box summary for the series associated with each category. In this case, the "column names" of the "data frame" are the categories, and the data in each column is reduced to a few statistical measures that define the summary box dimensions.

generate a box plot with summary boxes for each category:

data = # list of lists, dict/ordered dict of lists, data frame, or groupby
BoxPlot(data)

NOTE: might also accept an order parameter to order the categorical axis

NOTE: can colormap by category

Violin

Violin inputs identical to BoxPlot (draws violin summaries instead)

Dot

Dot inputs identical to BoxPlot (draws a dot for each point in the series)

NOTE options for jittering along categorical dimension

Line

Acceptable inputs basically identical to Scatter except also has an additional version that auto-generates a range(N) x-values:

generate a single line plot but implicitly use x=range(len(y))

y = # 1d iterable of scalars
Line(y)

generate a multiple line plots but implicitly use x=range(len(y)) for each y series

yn = # 1d iterable of scalars
Line(y0, y1, y2)

NOTE: important generalization: each positional argument is for a single line in all of these cases

NOTE: also accepts facet=True

Step

With regards to data inputs, I believe this is largely identical to Line

Area

With regards to data inputs, I believe this is largely similar to TimeSeries + Line

NOTE offers the option to stack. The hassle here will be computing the intermediate coordinates in the general case where there is more than one set of x coordinates (i.e. where is more like Line with multiple lines, instead of like TimeSeries with a common index)

HeatMap

NOTE: we should deprecate the CategoricalHeatmap option. There should just be one HeatMap and it should "do the right thing" with either categorial or scalar ranges.

This one is a little different: the main input conceptually needs to be a dense 2D array of data, but there's a couple of more generalized cases we should accept.

For the dense 2d array of data case: x and y ranges can either be lists of categories, OR scalar (numerical) bounds.

data = # dense 2D array of data 
x = # sequence of categories OR (start, end) range
y = # sequence of categories OR (start, end) range
Heatmap(data, x, y)

NOTE: data can either be scalar (numerical) data, or can be categorical data, examples:

(country * year * primary export [category])
(country * year * total rainfall [scalar])

This really will only affect the auto-choice of color mapper, from a data perspective they are conceptually identical.

Bar

[To be completed]

Because of all the stacking/grouping/faceting possibilities, and how these actually directly relate to the structure of the input data, Bar is actually one of the most complicated cases.

One important distinction to maintain is that the x-axis is categorical by nature.

Donut

NOTE: includes "pie" chart as a special case

I believe this is largely similar to Bar, but with some fewer options for grouping.

Parameters

The basic and always available interface is the "pile of kwargs" interface:

Histogram(
    df, width=400, height=400, 
    title="some cool stuff", facet="wrap", 
    server="localhost", name="mychart", 
    notebook=True, file="mychart.html"
)

However!

Epiphany: The current "method chaining" can be kept, and used perfectly in service of context managers:

with Histogram(df) as chart:
    chart.width(400).height(400)
    chart.title("some cool stuff").facet("wrap")
    chart.server("localhost").name("mychart")
    chart.notebook()
    chart.file("mychart.html")

Embedding

For static embedding cases, all charts will expose chart.plot which is bokeh.objects.Plot that can be passed to any of the static embedding functions:

components(chart.plot, INLINE)
file_html(chart.plot, CDN, "my plot")
autoload_static(chart.plot, CDN, "static/plots")

For server embedding, all charts will expose chart.script which has an implementation along the lines of:

    @property
    def script(self):
        if self.session is None: 
            return None
        return autoload_server(self.plot, self.session)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly