Skip to content

Bokeh Days Working Document

Nick Roth edited this page Jun 15, 2015 · 38 revisions

Things to consider: dimensions, needs computation, splitting/reduction operations (facet, group, stack, overlay, aggregation, colormapping, marker_selection)

ALL CHARTS SUPPORT: [facet, overlay]

Charts API

Generic User-Facing API:

Chart( data, <form fields>, <surface fields>, <figure options> )

  • data: dataframe-like, labeled arrays, where arrays accessible with data['field']. Typically, data is 'taller' than it is 'wide' (de-aggregated), but this could vary based on the chart.
  • form: position (x, y, lat, lon, theta, etc.) shape, size, rotation, etc.
  • surface: color (hue, brightness, saturation), texture, blur, transparency
  • figure: valid kwargs to configure bokeh figure (title)

Chart takes a series of key value pairs that specifies which "dimensions"/"measures", identified by the field name, map to which Chart attributes. The dimensions specified are used to split data into multiple subsets, which then are used to visualize the various chart attributes

composition_func( <Chart(...)[1..*] or composition_func(...)>, <composition func options> )

Chart is designed to support a number of functions which compose charts together, either by embedding two together, placing charts on a surface, and/or faceting charts into a grid layout. This interface is designed to takes one or more Charts, or another compositional function.

It is possible that a compositional function will require access to the data that is input into the Chart. The functions and Chart should support delaying execution and retrieving data as needed, so that iterative use of Charts and functions doesn't require fundamentally changing how Chart is called on its own.

Operations API

Overlay:

Overlay(Bar(df,...), Bar(df,...))

Facet

Facet is handling the generation of subsets of data, where each subset is positioned into a different plot coordinates. The traditional approach to this is to place each plot into a grid arrangement of plots, where each plot receives a coordinate of (0..n-1, 0..m-1), where n and m correspond to the number of unique values in the dimensions being faceted.

However, you could also take the approach of faceting plots into a more abstract location, such as a tab in a user interface (1 dimensional: tab#), or a grid in a tab (2 dimensional: tab#, row, col), or a grid in a tab on a webpage (4 dimensional, page#, tab#, row, col).

facet(Scatter(df, 'col a', 'col b'), 'col c')

Facet Implementation Challenges

  1. Delayed Execution - To use the approach shown above, the Scatter chart's rendering must be delayed, and must have access to df. The reason for taking this approach is because the interactive use of Scatter is most likely to start with using Scatter, then adding in faceting afterwards. The user should not have to significantly modify their use of Scatter to add in faceting.

  2. Generic Applicability - Most implementations of faceting center around the grid-based approach. To support grid, tab-based, or even geographic placement of charts, the approach should focus on yielding "coordinates" for a chart to be positioned into some abstract reference frame, whether the frame is cartesian, spherical, geographical, or abstract (web page).

References

Chart Specifications

Below are examples of different user-oriented applications of the core Chart types.

Bar/Donut

index vs value, computation, grouping, stacking, colormapping

Bar(df, cat, values, grouped=True, agg='sum')

Bar(df, cat, values, grouped=False, stacked=True, agg='sum')

Bar(df, cat, values, grouped=True, stacked=True, agg='sum') ??

Bar(df, cat, values, grouped="A", stacked=False, agg='sum')

Bar(df, cat, values, stacked="B", agg='sum')

Bar(df, cat, values, grouped="A", stacked="B", agg='sum')

Bar(df, cat, values, grouped="A", stacked=["B","C"], agg='sum')

df | year | sales | dept | region | revenue

Bar(df, year, (sales, revenue))

Line, Step, TimeSeries

index vs value, no computation, colormapping, marker_selection

Scatter

index vs index, no computation, colormapping, marker_selection

Histogram

index vs value, computation (binning), color and marker just selected

Area

index vs value, no computation, colormapping, stack

Heatmap

2d-index vs value, computation, colormapping

Horizon

index vs value, computation, colormapping

Boxplot, Violin

index vs value, computation, grouping, colormapping

Dot

index vs value, computation, grouping, stacking, colormapping, marker_selection

Interactive Dashboards with Charts

Chart Modeling

One opportunity with Charts is to specify additional metadata about the Chart, which can both reduce the edge cases the Chart must handle, and provide additional information for composing charts and controls for interactive applications. This enables another type of composition, view composition. Multiple Charts can be composed into a dashboard, to provide multiple views of the same data source.

For example, a Bar Chart implementor might decide that they only want to handle discrete data for the x axis, and continuous data for the y axis. They could handle the continuous data as the x-input in multiple ways.

  1. Check for the dtype of the array, and throw an error
  2. Use Chart modeling to automatically throw an error
  3. Use Chart modeling to automatically convert continuous data to discrete

Chart Modeling Example

The approach taken is the same used for modeling Glyphs, Widgets, etc. in Bokeh.

class MyBar(Chart):
	x = Discrete(transform=True)
	y = Continuous(name=['y', 'height'])
	grouped = Discrete(required=False, transform=True)
	horizontal = Boolean(required=False)
	
	constraints = [Either('x', 'y')]

This modeling can enable generation of default selectors for an interactive application. Use of MyBar could be as follows:

interact(MyBar(df))

In this example, Bokeh can infer that the following should be generated, using a default layout:

<Row>
	<Col>
		<ColumnSelector name='x' />
		<ColumnSelector name='y' />
		<ColumnSelector name='grouped' />
		<Checkbox name='horizontal' />
	</Col>
	<Col>
		<MyBar />
	</Col
</Row>

Because the contraints are specified, the interactive widget will not regenerate MyBar unless 'x' or 'y' selectors are set to a valid column.

Note: The above example takes a React-like approach for demonstrating how Bokeh would interpret the Chart configuration, which could be used to provide a custom specification in a configuration file, with:

interact(MyBar(df), config='./custom_layout.dashboard')

References

Data Sources and Types

Below is some discussion on different types of data that might be input into Charts, which can provide multiple use cases to ensure that the Charts API can handle the most likely cases.

Types of Table-Like Data

During the Bokeh Days meetup, it was identified that the data structures can cause the user to need to change how they would identify the columns to use for plot aesthetics. The main difference is between de-aggregated (normalized) and aggregated data. Someone that is using data directly from a database might see normalized forms more often.

  • Wide: A table that spreads many measures over additional columns.
  • Tall: A table that tends to put multiple measures in a single column, then identify the measure type in a separate column.
  • Pivoted: A table that has been pre-aggregated via a pivoting-like operation, which places discrete categories along the columns and rows, then the intersection is typically characterized by a measure for the intersection.

Data Sources for Demos

Kaggle Competitions

Great source for data sets that go beyond the typical toy examples.

Data Sources Compilation on Github

FiveThirtyEight

Most of their data is pre-aggregated.

Concepts/Support Information

Data Dimensionality

A dataset contains potentially many variables (columns, labeled, or unlabeled data). Bokeh should provide efficient methods for separating a global dataset into smaller datasets, which are then visualized in a way that separates each subset. You might use a variable to separate the data using color, shape, orientation, frame (faceting), etc. With charts, you should identify which variable should be used for which visual aspect of the plot, or group of plots.

Variable Operations

  • Cross (*) - a product of two separate variables. Like an outer join.
  • Nest (/) - produces a decomposition of one categorical variable, into only the valid categories for another variable. You must know the domain metadata (valid categories) to perform nest operations. Can view this as the second variable conditioned on the first. (ex. nesting of sex={Male, Female}, Pregnant={True, False} would not produce (Male, True) in the result). Like a left or inner join.
  • Blend (+) - combines two variables under a single variable. The operation is like an append.

Grammar of Graphics cross vs nest