-
Notifications
You must be signed in to change notification settings - Fork 0
Bokeh Days Working Document
Things to consider: dimensions, needs computation, splitting/reduction operations (facet, group, stack, overlay, aggregation, colormapping, marker_selection)
ALL CHARTS SUPPORT: [facet, overlay]
Chart( data, <form fields>, <surface fields>, <figure options> )
- data: dataframe-like, labeled arrays, where arrays accessible with data['field']. Typically, data is 'taller' than it is 'wide' (de-aggregated), but this could vary based on the chart.
- form: position (x, y, lat, lon, theta, etc.) shape, size, rotation, etc.
- surface: color (hue, brightness, saturation), texture, blur, transparency
- figure: valid kwargs to configure bokeh figure (title)
Chart takes a series of key value pairs that specifies which "dimensions"/"measures", identified by the field name, map to which Chart attributes. The dimensions specified are used to split data into multiple subsets, which then are used to visualize the various chart attributes
composition_func( <Chart(...)[1..*] or composition_func(...)>, <composition func options> )
Chart is designed to support a number of functions which compose charts together, either by embedding two together, placing charts on a surface, and/or faceting charts into a grid layout. This interface is designed to takes one or more Charts, or another compositional function.
It is possible that a compositional function will require access to the data that is input into the Chart. The functions and Chart should support delaying execution and retrieving data as needed, so that iterative use of Charts and functions doesn't require fundamentally changing how Chart is called on its own.
Overlay(Bar(df,...), Bar(df,...))
Facet is handling the generation of subsets of data, where each subset is positioned into a different plot coordinates. The traditional approach to this is to place each plot into a grid arrangement of plots, where each plot receives a coordinate of (0..n-1, 0..m-1), where n and m correspond to the number of unique values in the dimensions being faceted.
However, you could also take the approach of faceting plots into a more abstract location, such as a tab in a user interface (1 dimensional: tab#), or a grid in a tab (2 dimensional: tab#, row, col), or a grid in a tab on a webpage (4 dimensional, page#, tab#, row, col).
facet(Scatter(df, 'col a', 'col b'), 'col c')
-
Delayed Execution - To use the approach shown above, the
Scatter
chart's rendering must be delayed, and must have access todf
. The reason for taking this approach is because the interactive use of Scatter is most likely to start with usingScatter
, then adding in faceting afterwards. The user should not have to significantly modify their use ofScatter
to add in faceting. -
Generic Applicability - Most implementations of faceting center around the grid-based approach. To support grid, tab-based, or even geographic placement of charts, the approach should focus on yielding "coordinates" for a chart to be positioned into some abstract reference frame, whether the frame is cartesian, spherical, geographical, or abstract (web page).
- Faceting with Matplotlib
- facet_grid with ggplot2
- facet_wrap with ggplot2
- Faceting/Data Aware Grids with Seaborn
Below are examples of different user-oriented applications of the core Chart types.
index vs value, computation, grouping, stacking, colormapping
Bar(df, cat, values, grouped=True, agg='sum')
Bar(df, cat, values, grouped=False, stacked=True, agg='sum')
Bar(df, cat, values, grouped=True, stacked=True, agg='sum')
??
Bar(df, cat, values, grouped="A", stacked=False, agg='sum')
Bar(df, cat, values, stacked="B", agg='sum')
Bar(df, cat, values, grouped="A", stacked="B", agg='sum')
Bar(df, cat, values, grouped="A", stacked=["B","C"], agg='sum')
df | year | sales | dept | region | revenue
Bar(df, year, (sales, revenue))
index vs value, no computation, colormapping, marker_selection
index vs index, no computation, colormapping, marker_selection
index vs value, computation (binning), color and marker just selected
index vs value, no computation, colormapping, stack
2d-index vs value, computation, colormapping
index vs value, computation, colormapping
index vs value, computation, grouping, colormapping
index vs value, computation, grouping, stacking, colormapping, marker_selection
One opportunity with Charts is to specify additional metadata about the Chart, which can both reduce the edge cases the Chart must handle, and provide additional information for composing charts and controls for interactive applications. This enables another type of composition, view composition. Multiple Charts can be composed into a dashboard, to provide multiple views of the same data source.
For example, a Bar Chart implementor might decide that they only want to handle discrete data for the x axis, and continuous data for the y axis. They could handle the continuous data as the x-input in multiple ways.
- Check for the dtype of the array, and throw an error
- Use Chart modeling to automatically throw an error
- Use Chart modeling to automatically convert continuous data to discrete
The approach taken is the same used for modeling Glyphs, Widgets, etc. in Bokeh.
class MyBar(Chart):
x = Discrete(transform=True)
y = Continuous(name=['y', 'height'])
grouped = Discrete(required=False, transform=True)
horizontal = Boolean(required=False)
constraints = [Either('x', 'y')]
This modeling can enable generation of default selectors for an interactive application. Use of MyBar could be as follows:
interact(MyBar(df))
In this example, Bokeh can infer that the following should be generated, using a default layout:
<Row>
<Col>
<ColumnSelector name='x' />
<ColumnSelector name='y' />
<ColumnSelector name='grouped' />
<Checkbox name='horizontal' />
</Col>
<Col>
<MyBar />
</Col
</Row>
Because the contraints are specified, the interactive widget will not regenerate MyBar unless 'x' or 'y' selectors are set to a valid column.
Note: The above example takes a React-like approach for demonstrating how Bokeh would interpret the Chart configuration, which could be used to provide a custom specification in a configuration file, with:
interact(MyBar(df), config='./custom_layout.dashboard')
Below is some discussion on different types of data that might be input into Charts, which can provide multiple use cases to ensure that the Charts API can handle the most likely cases.
During the Bokeh Days meetup, it was identified that the data structures can cause the user to need to change how they would identify the columns to use for plot aesthetics. The main difference is between de-aggregated (normalized) and aggregated data. Someone that is using data directly from a database might see normalized forms more often.
- Wide: A table that spreads many measures over additional columns.
- Tall: A table that tends to put multiple measures in a single column, then identify the measure type in a separate column.
- Pivoted: A table that has been pre-aggregated via a pivoting-like operation, which places discrete categories along the columns and rows, then the intersection is typically characterized by a measure for the intersection.
Great source for data sets that go beyond the typical toy examples.
Most of their data is pre-aggregated.
- Drug Use by Age
- Bad Drivers by State
- [NCAA Modeling] (https://github.com/fivethirtyeight/data/blob/master/historical-ncaa-forecasts/historical-538-ncaa-tournament-model-results.csv)
A dataset contains potentially many variables (columns, labeled, or unlabeled data). Bokeh should provide efficient methods for separating a global dataset into smaller datasets, which are then visualized in a way that separates each subset. You might use a variable to separate the data using color, shape, orientation, frame (faceting), etc. With charts, you should identify which variable should be used for which visual aspect of the plot, or group of plots.
- Cross (*) - a product of two separate variables. Like an outer join.
- Nest (/) - produces a decomposition of one categorical variable, into only the valid categories for another variable. You must know the domain metadata (valid categories) to perform nest operations. Can view this as the second variable conditioned on the first. (ex. nesting of sex={Male, Female}, Pregnant={True, False} would not produce (Male, True) in the result). Like a left or inner join.
- Blend (+) - combines two variables under a single variable. The operation is like an append.