Skip to content

Bokeh Days Working Document

Nick Roth edited this page Jun 21, 2015 · 38 revisions

Things to consider: dimensions, needs computation, splitting/reduction operations (facet, group, stack, overlay, aggregation, colormapping, marker_selection)

ALL CHARTS SUPPORT: [facet, overlay]

Charts API

Generic User-Facing API:

Chart( data, <form fields>, <surface fields>, <figure options> )

  • data: dataframe-like, labeled arrays, where arrays accessible with data['field']. Typically, data is 'taller' than it is 'wide' (de-aggregated), but this could vary based on the chart.
  • form: position (x, y, lat, lon, theta, etc.) shape, size, rotation, etc.
  • surface: color (hue, brightness, saturation), texture, blur, transparency
  • figure: valid kwargs to configure bokeh figure (title)

Chart takes a series of key value pairs that specifies which "dimensions"/"measures", identified by the field name, map to which Chart attributes. The dimensions specified are used to split data into multiple subsets, which then are used to visualize the various chart attributes

composition_func( <Chart(...)[1..*] or composition_func(...)>, <composition func options> )

Chart is designed to support a number of functions which compose charts together, either by embedding two together, placing charts on a surface, and/or faceting charts into a grid layout. This interface is designed to takes one or more Charts, or another compositional function.

It is possible that a compositional function will require access to the data that is input into the Chart. The functions and Chart should support delaying execution and retrieving data as needed, so that iterative use of Charts and functions doesn't require fundamentally changing how Chart is called on its own.

Operations API

Overlay:

Overlay(Bar(df,...), Bar(df,...))

Facet

Facet is handling the generation of subsets of data, where each subset is positioned into a different plot coordinates. The traditional approach to this is to place each plot into a grid arrangement of plots, where each plot receives a coordinate of (0..n-1, 0..m-1), where n and m correspond to the number of unique values in the dimensions being faceted.

However, you could also take the approach of faceting plots into a more abstract location, such as a tab in a user interface (1 dimensional: tab#), or a grid in a tab (2 dimensional: tab#, row, col), or a grid in a tab on a webpage (4 dimensional, page#, tab#, row, col).

facet(Scatter(df, 'col a', 'col b'), 'col c')

Facet Implementation Challenges

  • Delayed Execution - To use the approach shown above, the Scatter chart's rendering must be delayed, and must have access to df. The reason for taking this approach is because the interactive use of Scatter is most likely to start with using Scatter, then adding in faceting afterwards. The user should not have to significantly modify their use of Scatter to add in faceting.
  • Generic Applicability - Most implementations of faceting center around the grid-based approach. To support grid, tab-based, or even geographic placement of charts, the approach should focus on yielding "coordinates" for a chart to be positioned into some abstract reference frame, whether the frame is cartesian, spherical, geographical, or abstract (web page).

References

Chart Specifications

Below are examples of different user-oriented applications of the core Chart types.

Point Plots

A type of plot that places a glyph directly into a coordinate system without any aggregation. Another variable can be mapped into attributes of the glyph, such as the color, rotation, transparency, etc.

Scatter

  • GoG Example: point(position(d*r))

  • Bokeh Equivalent: Scatter(d, r, glyph=circle) #circle=default

Properties:

  • Rectangular coordinates
  • Performs a cross between the variables
  • index vs index
  • no aggregation for positional placement of graphics, but possible for size, color, etc.
  • Options: color, marker, line, size

One Input: When provided a single input of either value/cat, a scatter will cross the variable with 1, yielding a special "None" axis, that places all points equally spaced from the primary axis, but with no label.

  • Scatter(values)

  • Scatter(cat)

  • Scatter(values, 'index') # 'index' => range(len(values))

TimeSeries

A specific type of line plot that uses a time axis, and likely comes in the Stock style of table, but doesn't have to.

Line

  • GoG Example: line(position(d*r))

  • Bokeh Equivalent: Scatter(df, 'd', 'r', glyph=line)

index vs value, no computation, colormapping, marker_selection

Step (aka Stair)

Plots a variable against the index of each value, connected by a line.

  • GoG Example: N/A

  • ggplot2 Example: qplot(seq_along(x), x, geom="step")

  • Bokeh Equivalent: Scatter(df, 'd', 'r', glyph=line)

Geographical

Uses some kind of LLA, ECEF, ECER or other geographic coordinate system.

  • GoG Example:
ELEMENT: polygon(position(longitude*latitude))
ELEMENT: point(position(longitude*latitude))
  • ggplot2 Example: ggplot() +geom_polygon(data=counties, aes(x=long, y=lat, group=group))+ geom_point(data=mapdata, aes(x=x, y=y), color="red")

  • Bokeh Equivalent: Map(df, 'lon', 'lat', glyph=circle)

Aggregating

This group of charts requires some computation be processed on groups of values associated with categories.

Bar/Donut/Dot

This category works by grouping by some categorical variable, then performing some aggregation function. One special case could be when you have a single value for each unique cat, you just plot the value.

Input Cases:

  • categorical
  • categorical vs values
  • categorical vs count/proporation(categorical)

Options:

  • computation
  • grouping
  • stacking
  • colormapping

Standard

  • GoG Example: interval(position(d*r))
  • Bokeh Equivalent: Bar(df, 'd', 'r')

Grouped

  • GoG Example: interval.dodge(position(d*r), color(c))

Stacked

  • GoG Example: interval.stack(position(summary.proportion(r)), color(c))
  • GoG Example: interval.stack(position(summary.proportion(d*r)), color(c))

Bokeh Examples:

  • Bar(df, cat, values, grouped=True, agg='sum')

  • Bar(df, cat, values, grouped=False, stacked=True, agg='sum')

  • Bar(df, cat, values, grouped=True, stacked=True, agg='sum') ??

  • Bar(df, cat, values, grouped="A", stacked=False, agg='sum')

  • Bar(df, cat, values, stacked="B", agg='sum')

  • Bar(df, cat, values, grouped="A", stacked="B", agg='sum')

  • Bar(df, cat, values, grouped="A", stacked=["B","C"], agg='sum')

df | year | sales | dept | region | revenue

  • Bar(df, year, (sales, revenue))

Pie

  • GoG Example:
COORD: polar.theta(dim(1))
ELEMENT: interval.stack(position(summary.proportion(r)), color(c))

Dot

Very similar to a bar chart, except can be overlayed without an issue with overlap. marker_selection

Histogram

index vs value, computation (binning), color and marker just selected

Standard

  • GoG Example: interval(position(summary.count(bin.rect(y))))

KDE

  • GoG Example: area(position(smooth.density.kernel(y)))

Empirical CDF

Area

index vs value, no computation, colormapping, stack

Heatmap

2d-index vs value, computation, colormapping

Horizon

index vs value, computation, colormapping

Boxplot, Violin

index vs value, computation, grouping, colormapping

Interactive Dashboards with Charts

Chart Modeling

One opportunity with Charts is to specify additional metadata about the Chart, which can both reduce the edge cases the Chart must handle, and provide additional information for composing charts and controls for interactive applications. This enables another type of composition, view composition. Multiple Charts can be composed into a dashboard, to provide multiple views of the same data source.

For example, a Bar Chart implementor might decide that they only want to handle discrete data for the x axis, and continuous data for the y axis. They could handle the continuous data as the x-input in multiple ways. 1. Check for the dtype of the array, and throw an error 2. Use Chart modeling to automatically throw an error 3. Use Chart modeling to automatically convert continuous data to discrete

Chart Modeling Example

The approach taken is the same used for modeling Glyphs, Widgets, etc. in Bokeh.

class MyBar(Chart):
    x = Discrete(transform=True)
    y = Continuous(name=['y', 'height'])
    grouped = Discrete(required=False, transform=True)
    horizontal = Boolean(required=False)

    constraints = [Either('x', 'y')]

This modeling can enable generation of default selectors for an interactive application. Use of MyBar could be as follows:

interact(MyBar(df))

In this example, Bokeh can infer that the following should be generated, using a default layout:

<Row>
    <Col>
        <ColumnSelector name='x' />
        <ColumnSelector name='y' />
        <ColumnSelector name='grouped' />
        <Checkbox name='horizontal' />
    </Col>
    <Col>
        <MyBar />
    </Col
</Row>

Because the contraints are specified, the interactive widget will not regenerate MyBar unless 'x' or 'y' selectors are set to a valid column.

Note: The above example takes a React-like approach for demonstrating how Bokeh would interpret the Chart configuration, which could be used to provide a custom specification in a configuration file, with:

interact(MyBar(df), config='./custom_layout.dashboard')

References

Data Sources and Types

Below is some discussion on different types of data that might be input into Charts, which can provide multiple use cases to ensure that the Charts API can handle the most likely cases.

Types of Table-Like Data

During the Bokeh Days meetup, it was identified that the data structures can cause the user to need to change how they would identify the columns to use for plot aesthetics. The main difference is between de-aggregated (normalized) and aggregated data. Someone that is using data directly from a database might see normalized forms more often.

NOTE: The following labels are used when describing the input types in the chart specifications.

Wide

A table that spreads many measures over additional columns. Very few index-like columns, with multiple columns containing similar types of measurements. A good identifier for wide table types is if you need to think about iterating over columns, it might lean more towards the wide category.

Sensor: | index | Measure1 | Measure2 | Measure3 |

A special case for a wide table is time series data, which instead of having multiple measurements, instead has a single measure type (e.g. profits), and each series represents a category of a categorical variable (e.g. Company Name). Technically, this is a form of pivoted data.

Stock: | Date | Series1 | Series2 | Series3 |

Tall

A table that tends to put multiple measures in a single column, then identify the measure type in a separate column.

The tall form of Sensor data is the following:

Stacked Sensor: | index | Measure Type | value |

The more likely tall type of data you will see in the real world is that, that has been joined, or de-normalized. In this case, you might have multiple tables that have been brought together into a tall table, with a combination of multiple categorical and numerical columns. This kind of table provides many opportunities to focus on different categorical columns, while collapsing the numerical columns with some type of aggregation.

Business: | click id | visit id | visit type | user name | revenue | platform | device |

Pivoted

A table that has been pre-aggregated via a pivoting-like operation, which places discrete categories along the columns and rows, then the intersection is typically characterized by a measure for the intersection.

Aggregated:

| product category | Month 1 | Month 2 | Month 3 | ... |
|:-----------------|:--------|:--------|:--------|:----|
| computers        | $100    | $250    | $225    | ... |
| mobile           | ...     | ...     | ...     | ... |

Data Sources for Demos

Kaggle Competitions

Great source for data sets that go beyond the typical toy examples.

Data Sources Compilation on Github

FiveThirtyEight

Most of their data is pre-aggregated.

Concepts/Support Information

Data Dimensionality

A dataset contains potentially many variables (columns, labeled, or unlabeled data). Bokeh should provide efficient methods for separating a global dataset into smaller datasets, which are then visualized in a way that separates each subset. You might use a variable to separate the data using color, shape, orientation, frame (faceting), etc. With charts, you should identify which variable should be used for which visual aspect of the plot, or group of plots.

Variable Operations

  • Cross (*) - a product of two separate variables. Like an outer join.
  • Nest (/) - produces a decomposition of one categorical variable, into only the valid categories for another variable. You must know the domain metadata (valid categories) to perform nest operations. Can view this as the second variable conditioned on the first. (ex. nesting of sex={Male, Female}, Pregnant={True, False} would not produce (Male, True) in the result). Like a left or inner join.
  • Blend (+) - combines two variables under a single variable. The operation is like an append.

Grammar of Graphics cross vs nest