-
Notifications
You must be signed in to change notification settings - Fork 0
Bokeh Days Working Document
Things to consider: dimensions, needs computation, splitting/reduction operations (facet, group, stack, overlay, aggregation, colormapping, marker_selection)
ALL CHARTS SUPPORT: [facet, overlay]
Chart( data, <form fields>, <surface fields>, <figure options> )
- data: dataframe-like, labeled arrays, where arrays accessible with data['field']. Typically, data is 'taller' than it is 'wide' (de-aggregated), but this could vary based on the chart.
- form: position (x, y, lat, lon, theta, etc.) shape, size, rotation, etc.
- surface: color (hue, brightness, saturation), texture, blur, transparency
- figure: valid kwargs to configure bokeh figure (title)
Chart takes a series of key value pairs that specifies which "dimensions"/"measures", identified by the field name, map to which Chart attributes. The dimensions specified are used to split data into multiple subsets, which then are used to visualize the various chart attributes
composition_func( <Chart(...)[1..*] or composition_func(...)>, <composition func options> )
Chart is designed to support a number of functions which compose charts together, either by embedding two together, placing charts on a surface, and/or faceting charts into a grid layout. This interface is designed to takes one or more Charts, or another compositional function.
It is possible that a compositional function will require access to the data that is input into the Chart. The functions and Chart should support delaying execution and retrieving data as needed, so that iterative use of Charts and functions doesn't require fundamentally changing how Chart is called on its own.
One opportunity with Charts is to build them on top of Blaze. What this would provide is a way to feed a Chart data directly from any abstract data source. This approach is something that will make Bokeh charts uniquely suited to building interactive dashboards, compared to existing capabilities (ggplot, matplotlib, etc.).
- Internal => Blaze
- Valid Inputs => Anything Blaze can convert to ColumnDataSource
Overlay(Bar(df,...), Bar(df,...))
Facet is handling the generation of subsets of data, where each subset is positioned into a different plot coordinates. The traditional approach to this is to place each plot into a grid arrangement of plots, where each plot receives a coordinate of (0..n-1, 0..m-1), where n and m correspond to the number of unique values in the dimensions being faceted.
However, you could also take the approach of faceting plots into a more abstract location, such as a tab in a user interface (1 dimensional: tab#), or a grid in a tab (2 dimensional: tab#, row, col), or a grid in a tab on a webpage (4 dimensional, page#, tab#, row, col).
facet(Scatter(df, 'col a', 'col b'), 'col c')
-
Delayed Execution - To use the approach shown above, the
Scatter
chart's rendering must be delayed, and must have access todf
. The reason for taking this approach is because the interactive use of Scatter is most likely to start with usingScatter
, then adding in faceting afterwards. The user should not have to significantly modify their use ofScatter
to add in faceting. - Generic Applicability - Most implementations of faceting center around the grid-based approach. To support grid, tab-based, or even geographic placement of charts, the approach should focus on yielding "coordinates" for a chart to be positioned into some abstract reference frame, whether the frame is cartesian, spherical, geographical, or abstract (web page).
- Faceting with Matplotlib
- facet_grid with ggplot2
- facet_wrap with ggplot2
- Faceting/Data Aware Grids with Seaborn
Below are examples of different user-oriented applications of the core Chart types.
A type of plot that places a glyph directly into a coordinate system without any aggregation. Another variable can be mapped into attributes of the glyph, such as the color, rotation, transparency, etc.
-
GoG Example:
point(position(d*r))
-
Bokeh Equivalent:
Scatter(d, r, glyph=circle) #circle=default
Properties:
- Rectangular coordinates
- Performs a
cross
between the variables - index vs index
- no aggregation for positional placement of graphics, but possible for size, color, etc.
- Options: color, marker, line, size
One Input: When provided a single input of either value/cat, a scatter will cross
the variable with 1
, yielding a special "None" axis, that places all points equally spaced from the primary axis, but with no label.
-
Scatter(values)
-
Scatter(cat)
-
Scatter(values, 'index')
# 'index' => range(len(values))
A specific type of line plot that uses a time axis, and likely comes in the Stock style of table, but doesn't have to.
-
GoG Example:
line(position(d*r))
-
Bokeh Equivalent:
Scatter(df, 'd', 'r', glyph=line)
index vs value, no computation, colormapping, marker_selection
Plots a variable against the index of each value, connected by a line.
-
GoG Example:
N/A
-
ggplot2 Example:
qplot(seq_along(x), x, geom="step")
-
Bokeh Equivalent:
Step(df, 'd', start=<'h'/'v'>)
Uses some kind of LLA, ECEF, ECER or other geographic coordinate system.
- GoG Example:
ELEMENT: polygon(position(longitude*latitude))
ELEMENT: point(position(longitude*latitude))
-
ggplot2 Example:
ggplot() +geom_polygon(data=counties, aes(x=long, y=lat, group=group))+ geom_point(data=mapdata, aes(x=x, y=y), color="red")
-
Bokeh Equivalent:
Map(df, 'lon', 'lat', glyph=circle)
This group of charts requires some computation be processed on groups of values associated with categories.
This category works by grouping by some categorical variable, then performing some aggregation function. One special case could be when you have a single value for each unique cat, you just plot the value.
Input Cases:
- categorical
- categorical vs values
- categorical vs count/proporation(categorical)
Options:
- computation
- grouping
- stacking
- colormapping
-
GoG Example:
interval(position(d*r))
-
Bokeh Equivalent:
Bar(df, 'd', 'r')
The grammar of graphics term for the grouped bar is
-
GoG Example:
interval.dodge(position(d*r), color(c))
Bokeh Equivalent:
-
Bar(df, cat, values, grouped=True, agg='sum')
-
Bar(df, cat, values, grouped="A", stacked=False, agg='sum')
-
GoG Example:
interval.stack(position(summary.proportion(r)), color(c))
-
GoG Example:
interval.stack(position(summary.proportion(d*r)), color(c))
Bokeh Equivalent:
Bar(df, cat, values, grouped=False, stacked=True, agg='sum')
Bar(df, cat, values, stacked="B", agg='sum')
-
Bar(df, cat, values, grouped=True, stacked=True, agg='sum')
?? Bar(df, cat, values, grouped="A", stacked="B", agg='sum')
Bar(df, cat, values, grouped="A", stacked=["B","C"], agg='sum')
df | year | sales | dept | region | revenue
Bar(df, year, (sales, revenue))
- GoG Example:
COORD: polar.theta(dim(1))
ELEMENT: interval.stack(position(summary.proportion(r)), color(c))
Very similar to a bar chart, except can be overlayed without an issue with overlap. marker_selection
index vs value, computation (binning), color and marker just selected
-
GoG Example:
interval(position(summary.count(bin.rect(y))))
-
GoG Example:
area(position(smooth.density.kernel(y)))
index vs value, no computation, colormapping, stack
2d-index vs value, computation, colormapping
index vs value, computation, colormapping
index vs value, computation, grouping, colormapping
One opportunity with Charts is to specify additional metadata about the Chart, which can both reduce the edge cases the Chart must handle, and provide additional information for composing charts and controls for interactive applications. This enables another type of composition, view composition. Multiple Charts can be composed into a dashboard, to provide multiple views of the same data source.
For example, a Bar Chart implementor might decide that they only want to handle discrete data for the x axis, and continuous data for the y axis. They could handle the continuous data as the x-input in multiple ways. 1. Check for the dtype of the array, and throw an error 2. Use Chart modeling to automatically throw an error 3. Use Chart modeling to automatically convert continuous data to discrete
The approach taken is the same used for modeling Glyphs, Widgets, etc. in Bokeh.
class MyBar(Chart):
x = Discrete(transform=True)
y = Continuous(name=['y', 'height'])
grouped = Discrete(required=False, transform=True)
horizontal = Boolean(required=False)
constraints = [Either('x', 'y')]
This modeling can enable generation of default selectors for an interactive application. Use of MyBar could be as follows:
interact(MyBar(df))
In this example, Bokeh can infer that the following should be generated, using a default layout:
<Row>
<Col>
<ColumnSelector name='x' />
<ColumnSelector name='y' />
<ColumnSelector name='grouped' />
<Checkbox name='horizontal' />
</Col>
<Col>
<MyBar />
</Col
</Row>
Because the contraints are specified, the interactive widget will not regenerate MyBar unless 'x' or 'y' selectors are set to a valid column.
Note: The above example takes a React-like approach for demonstrating how Bokeh would interpret the Chart configuration, which could be used to provide a custom specification in a configuration file, with:
interact(MyBar(df), config='./custom_layout.dashboard')
Below is some discussion on different types of data that might be input into Charts, which can provide multiple use cases to ensure that the Charts API can handle the most likely cases.
During the Bokeh Days meetup, it was identified that the data structures can cause the user to need to change how they would identify the columns to use for plot aesthetics. The main difference is between de-aggregated (normalized) and aggregated data. Someone that is using data directly from a database might see normalized forms more often.
NOTE: The following labels are used when describing the input types in the chart specifications.
A table that spreads many measures over additional columns. Very few index-like columns, with multiple columns containing similar types of measurements. A good identifier for wide table types is if you need to think about iterating over columns, it might lean more towards the wide category.
Sensor: | index | Measure1 | Measure2 | Measure3 |
A special case for a wide table is time series data, which instead of having multiple measurements, instead has a single measure type (e.g. profits), and each series represents a category of a categorical variable (e.g. Company Name). Technically, this is a form of pivoted data.
Stock: | Date | Series1 | Series2 | Series3 |
A table that tends to put multiple measures in a single column, then identify the measure type in a separate column.
The tall form of Sensor data is the following:
Stacked Sensor: | index | Measure Type | value |
The more likely tall type of data you will see in the real world is that, that has been joined, or de-normalized. In this case, you might have multiple tables that have been brought together into a tall table, with a combination of multiple categorical and numerical columns. This kind of table provides many opportunities to focus on different categorical columns, while collapsing the numerical columns with some type of aggregation.
Business: | click id | visit id | visit type | user name | revenue | platform | device |
A table that has been pre-aggregated via a pivoting-like operation, which places discrete categories along the columns and rows, then the intersection is typically characterized by a measure for the intersection.
Aggregated:
| product category | Month 1 | Month 2 | Month 3 | ... |
|:-----------------|:--------|:--------|:--------|:----|
| computers | $100 | $250 | $225 | ... |
| mobile | ... | ... | ... | ... |
Great source for data sets that go beyond the typical toy examples.
Most of their data is pre-aggregated.
- Baseball Sqlite Database (suggested by blaze docs)
- Ergast Formula 1/E MySql Database - rothnic has some of a wrapper around REST API with formulapy
A dataset contains potentially many variables (columns, labeled, or unlabeled data). Bokeh should provide efficient methods for separating a global dataset into smaller datasets, which are then visualized in a way that separates each subset. You might use a variable to separate the data using color, shape, orientation, frame (faceting), etc. With charts, you should identify which variable should be used for which visual aspect of the plot, or group of plots.
-
Cross (
*
) - a product of two separate variables. Like an outer join. -
Nest (
/
) - produces a decomposition of one categorical variable, into only the valid categories for another variable. You must know the domain metadata (valid categories) to perform nest operations. Can view this as the second variable conditioned on the first. (ex. nesting of sex={Male, Female}, Pregnant={True, False} would not produce (Male, True) in the result). Like a left or inner join. -
Blend (
+
) - combines two variables under a single variable. The operation is like an append.
- Can we come up with a consistent approach for
dodge
andstack
, as operations, at least internally?- Where does this apply, other than with Bar.
- How does this relate to
overlay
?
- How can we handle
facet
-like layouts that stay within the same frame? A grouped bar chart is similar to faceting, except with merged axes y axis, and a dodged x axis.
- Prototype of a few Charts
- Scatter, Bar
- Create models for Chart input types
- Column (and specific column types), Option, etc.
- Prototype feeding chart directly with out of core source
Chart('./path/to/hdf5::auto', 'mpg', 'disp', ... )
Chart('sqlite://path/to/sqlite::auto', 'mpg', 'disp', ... )
- Reconcile concepts, where it makes sense, with Vega
- (Long Term) Prototype driving Chart with Polestar