-
Notifications
You must be signed in to change notification settings - Fork 0
Bokeh Days Working Document
-
Grammar: a practical, pythonic implementation of grammar of graphics
- ggplot-like plotting of data frames
- ease of use over perfect GoG implementation
- Data Agnostic: leverages Blaze internally to be data size and type agnostic
- Dashboard Building Block: model chart properties, in same way as glyphs, opening door for quick and simple dashboards
-
Declarative: charts produce specification of the collection of glyphs to be rendered (like Vega/Vega-lite), supporting ease of composing chart functions affecting dodging, stacking, faceting, etc.
- i.e, tweaking a visualization by adding faceting should be a natural extension, without requiring rework of inputs into the existing visualization
User downloaded a csv file of vehicle data, not knowing column types, etc. This fake dataset demonstrates a typical denormalized or joined example, and includes cases that may be difficult for visualization.
make | cyl | mpg | displ | type | model | year | mt_coy | engine | style |
---|---|---|---|---|---|---|---|---|---|
Ford | 6 | 22 | 3.0 | Car | Taurus | 1999 | 0 | Duratec 30 | SE Sedan 4D |
- make: the car manufacturer, which is a string, considered a categorical column or dimension
- cyl: a numerical categorical column/dimension, representing the number of cylinders for the car
- mpg: a continuous measure of miles per gallon for the car
- displ: a continuous measure of the displacement of the engine
- type: the kind of vehicle, categorical variable as a string (car, truck, etc)
- model: string name of the specific car model, dimension and nested to make
- year: a datepart year, likely provided as integer, considered a dimension in this case
- mt_coy: a logical variable stored as 0/1 to indicate whether the car model was motor trend car of the year for that year in its vehicle type category
- engine: the engine model as a string, sometimes used across makes/models, a dimension with a likely large quantity of uniques
- style: string describing specifics about a model configuration, this separates the multiple entries for a single model on the same year, with potentially different engines
With this data, it is difficult to directly generate simple bar charts without some kind of aggregation, and there are many different ways that you could group and compare the two measures available.
# User imports charts
from Bokeh.charts import facet, Bar, Data
# Utilize wrapper around blaze to assist in loading arbitrary data, interactive tabular display
d = Data('./mydata.csv')
d # shows interactive, sortable table
# generate bar chart
Bar(d, 'vehicle_make', 'mpg', agg='avg')
# add faceting by cylinders
facet(Bar(d, 'make', 'mpg', agg='avg'), 'cyl')
# add grouping by vehicle type
facet(Bar(d, 'make', 'mpg', agg='avg', grouped='type'), 'cyl')
Chart( data, <form fields>, <surface fields>, <figure options> )
- data: dataframe-like, labeled arrays, where arrays accessible with data['field']. Typically, data is 'taller' than it is 'wide' (de-aggregated), but this could vary based on the chart.
- form: position (x, y, lat, lon, theta, etc.) shape, size, rotation, etc.
- surface: color (hue, brightness, saturation), texture, blur, transparency
- figure: valid kwargs to configure bokeh figure (title, output file, etc.)
Chart takes a series of key value pairs that specifies which "dimensions"/"measures", identified by the field name, map to which Chart attributes. The dimensions specified are used to split data into multiple subsets, which then are used to visualize the various chart attributes
Chart is designed to support a number of functions which compose charts together, either by embedding two together, placing charts on a surface, and/or faceting charts into a grid layout. This interface is designed to takes one or more Charts, or another compositional function.
composition_func( <Chart(...)[1..*] or composition_func(...)>, <composition func options> )
It is possible that a compositional function will require access to the data that is input into the Chart. The functions and Chart should support delaying execution and retrieving data as needed, so that iterative use of Charts and functions doesn't require fundamentally changing how Chart is called on its own.
One opportunity with Charts is to build them on top of Blaze. What this would provide is a way to feed a Chart data directly from any abstract data source. This approach is something that will make Bokeh charts uniquely suited to building interactive dashboards, compared to existing capabilities (ggplot, matplotlib, etc.).
- Internal => Blaze
- Valid Inputs => A valid blaze resource with columnar-accessible fields. (dict of lists, pandas DataFrame, sqlite database table)
Reference to Blaze ColumnDataSource Conversion
What does this mean for using Charts? Any of the following are valid and handled identically:
Chart(df, 'mpg', 'disp', ... )
Chart('./path/to/hdf5::auto', 'mpg', 'disp', ... )
Chart('sqlite://path/to/sqlite::auto', 'mpg', 'disp', ... )
One implication with this approach is that you must think about what capabilities are available in Blaze, in regard to aggregation types. Additionally, you must both group and aggregate in the same operation with blaze, which is more limited as compared to Pandas. With Pandas, you can group, then iterate over groups to perform any custom operations. This isn't possible with Blaze.
One thing that would be useful is to be able to facilitate data joins across tables via API and/or a web interface that blaze would handle, ala Tableau.
Bokeh contains many low level glyphs, which can be customized to specific needs. It might make sense in some cases to implement the customized glyphs as higher-order glyphs, so they are available for reuse, and they can serve as a specification for the specific formatting that is being requested.
For example, the difference in a Step chart and a Line chart are very minimal. If the specific formatting was contained in a glyph, then the Line chart could be directly used and supplied the special glyph to use. A thin wrapper could be provided to create a Step Chart that just provides the Step(...)
interface.
Where the higher level glyphs would differ from Charts is that there would be a one-to-one relationship between creating a glyph and the objects created. There would be no handling of dataframes, or coloring by categories. Instead, they would take some semi-processed, more direct form of the data.
bar(data.mpg.sum())
dodge(bar(1), bar(2), bar(3))
Potential Higher Level Glyphs:
- step
- bar
- box
- area
These could also open the door for completely unique plots via the use of the overlay
operation.
This section discusses functionality that would be utilized across many chart implementations.
Often, charts will take a categorical column(s), and interpret that to mean we should find the distinct items in the column(s), then assign unique attributes to the glyphs to be rendered to represent them.
Things like coloring and faceting will need to turn continuous data into discrete.
Not only do we need the discrete data, we need a "pretty" form of it for communicating what was performed.
A generic tool for producing unique visual aesthetics for unique values in one to many columns. This would likely provide a nice interface for utilizing itertools to generate the combination of aesthetics (color, marker, line style, etc.).
Quickly utilizing smart color palettes, without need to manually specify colors is essential for quick exploratory analysis. The proposal is that charts utilizing coloring inputs would allow:
Chart(..., color='cat 1')
Chart(..., color=['cat 1', 'cat 2'])
# could interpret as needing a sequential color map, but what about discrete?
Chart(..., color=['val 1'])
# using a color function that returns an object or iterable can support more specific coloring
Chart(..., color=color('val 1', bins=5))
Chart(..., color=color('cat 1', brightness='cat 2'))
Additional things this should address:
- hierarchical coloring of groups (nesting)
- varying shade of a single color
palettable looks like a potential option for quickly adding many popular colormaps
Similar issue as with color, except we may want to iterate over a combination of colors and markers
Chart operations that may or may not be user-facing, but should be optional methods implemented on the custom Chart implementation, if special behavior is required.
Operations should be lowercase in most cases, compared to the uppercase Chart names.
ALL CHARTS SUPPORT: [facet, overlay]
This concept is used to simply merge two plots that share the same axes. You may do this internally when grouping by a categorical variable, then coloring glyphs differently for the cats.
overlay(Bar(df,...), Bar(df,...))
Sometimes stack and dodge can produce a similar-appearing graphic. Stack cumulates on a scale (e.g., a stacked bar chart) while dodge piles things in open space (e.g., a tally or dot plot) (GoG p168).
The stack method cumulates elements in order of the values on a splitter. For example, we can make a stacked bar chart by having superimposed bars stack on their second categorical dimension. The standard stacking option is asymmetric; that is, the bottoms of stacks are anchored on a common position. The other option is symmetric; the centers of stacks are anchored on a common position. (GoG p168)
Produces a relative shift to the other elements within the deepest dimension, on the scale of the measure.
# the deepest dimension is stacked on
# produces single stacked bar of height 15
# 'a' is 5, 'b' is 10
stack(bar((1, 'a'), 5), bar((1, 'b'), 10))
This generates two rects:
r1 = rect(x=1, y=2.5, width=0.5, height=5)
r2 = rect(x=1, y=(r1.height/2) + r1.y, width=0.5, height=10)
Note: stacking does not specify coloring, but it could be applied by default.
The dodge method does not cumulate. It simply moves objects around locally so they do not collide. (GoG p168)
The common purpose of dodge is to provide a way to group like elements around each other on a hierarchical scale (nesting), which is what a grouped bar chart is doing. The grouping is performed by assigning coordinates relative to the positions of the parent category.
For example, taking the auto dataset described in the opening. The unique car makes and the car types can be nested, where some types may not exist for some car makes. You end up with a group for each make (assigned to position 1, 2, 3 ...), then each type will receive a position within the bounds of the group, ((-0.25) 0.75, (0) 1, (+0.25) 1.25).
So, dodge is just calculating relative positioning for grouped elements.
# bar is special rect/interval fixed at min=0
# this is 1 main group, with an 'a', and 'b' subgroup
dodge(bar((1, 'a'), 5), bar((1, 'b'), 10))
Used to make sure it is easier to spot glyphs that might be laying directly on top of each other. This is common when using a scatter plot and having one or both dimensions as categorical variables.
Jitter method moves objects ran- domly in their local neighborhood. (GoG p168)
jitter(Chart(...), mean=<float>)
A grid already exists in Bokeh, need to look at how it fits in context of charts.
Horizontal layout already exists in Bokeh, need to look at how it fits in context of charts.
Vertical layout already exists in Bokeh, need to look at how it fits in context of charts.
Facet is handling the generation of subsets of data, where each subset is positioned into a different plot coordinates. The traditional approach to this is to place each plot into a grid arrangement of plots, where each plot receives a coordinate of (0..n-1, 0..m-1), where n and m correspond to the number of unique values in the dimensions being faceted.
However, you could also take the approach of faceting plots into a more abstract location, such as a tab in a user interface (1 dimensional: tab#), or a grid in a tab (2 dimensional: tab#, row, col), or a grid in a tab on a webpage (4 dimensional, page#, tab#, row, col).
# option 1
# better suites the functional/compositional style
facet(Scatter(df, 'col a', 'col b'), 'col c')
# option 2
# probably more natural, working better with tools, easier to add in one place (closing parens)
Scatter(df, 'col a', 'col b', facet=facet('col c'))
-
Delayed Execution - To use the approach shown above, the
Scatter
chart's rendering must be delayed, and must have access todf
. The reason for taking this approach is because the interactive use of Scatter is most likely to start with usingScatter
, then adding in faceting afterwards. The user should not have to significantly modify their use ofScatter
to add in faceting. - Generic Applicability - Most implementations of faceting center around the grid-based approach. To support grid, tab-based, or even geographic placement of charts, the approach should focus on yielding "coordinates" for a chart to be positioned into some abstract reference frame, whether the frame is cartesian, spherical, geographical, or abstract (web page).
- facet_grid
- facet_wrap (special facet_grid)
- facet_tab
- facet_page
Faceting with vega is accomplished with row/column designation. By specifying "type":"O"
, the numerical cylinder column is interpreted as categorical/ordinal.
{
"marktype": "point",
"enc": {
"x": {"name":"Horse_Power", "type":"Q"},
"y": {"name":"Miles_per_Gallon", "type":"Q"},
"col": {"name":"Cylinders", "type":"O"}
}
}
- Faceting with Matplotlib
- facet_grid with ggplot2
- facet_wrap with ggplot2
- Faceting/Data Aware Grids with Seaborn
- open vis conference vega-lite
The main thing that interact is doing is looking at the Charts it is given and helping in building an interactive context around one to many Charts. Works similar to ipython notebook interact, except with more knowledge about the input types, and the types of controls that should be generated.
Can be composed with inter-chart operations to create dashboards.
Two bar charts with shared controls. X control changes X input for both bar charts.
interact(df, link(hstack(Bar(...), Bar(...)), 'x'))
For completely exploratory analysis, you could input a database, instead of a table in a database, to also receive a selector for the table to use for populating the column selectors.
interact('sqlite://path/to/db', link(hstack(Bar(...), Bar(...)), 'x'))
By using the same ColumnDataSource, it is easy to enable linked selection/filtering. We may want to have some explicit operations for customizing the behavior of one or many charts. For example, you may not want filtering to apply to one of the charts.
By using ColumnDataSource internally, this is implied via interact:
filter(Chart(...), Chart(...), Chart(...))`
But, you may not want to filter in certain circumstances:
def my_filter(data):
return data['my_value'] > 0.5 & data['my_cat']=='High'
interact(Chart(...), Chart(...), Chart(...), filter_cb=my_filter)
select(Chart(...), callback=<custom_func>)
Or, likely part of interact:
interact(Chart(...), select_cb=<custom_func>)
Use cases:
- column selection in one chart, is linked to column selection in another
- Hover over item in one chart causes highlight in another
Link two charts by the same column selector.
link(hstack(Bar(...), Bar(...)), 'x')
Or, build the interactive dashboard object, then link the two columns through the object.
dash = interact(hstack(Bar(...), Bar(...)))
link(dash.charts[0]['x'], dash.charts[0]['x'])
Below are examples of different user-oriented applications of the core Chart types.
A type of plot that places a glyph directly into a coordinate system without any aggregation. Another variable can be mapped into attributes of the glyph, such as the color, rotation, transparency, etc.
-
GoG Example:
point(position(d*r))
Vega-lite Can't find documentation on their type system, seen only in examples and the vega-lite schema.
{
"marktype": "point",
"enc": {
"x": {"name":"Horse_Power", "type":"Q"},
"y": {"name":"Miles_per_Gallon", "type":"Q"},
"color": {"name":"Cylinders", "type":"O"}
}
}
Bokeh
Scatter(d, r, glyph=circle) #circle=default
Properties:
- Rectangular coordinates
- Performs a
cross
between the variables - index vs index
- no aggregation for positional placement of graphics, but possible for size, color, etc.
- Options: color, marker, line, size
One Dimension: When provided a single input of either value/cat, a scatter will cross
the variable with 1
, yielding a special "None" axis, that places all points equally spaced from the primary axis, but with no label.
Scatter(data, values)
Scatter(data, cat)
Scatter(data, values, 'index') # 'index' => range(len(values))
Two Dimensions
Scatter(data, 'cat', 'val')
Scatter(data, x='cat', y='val')
Scatter(data, y='val', x='cat')
Property Formatting
# unique marker for values associated with each cat in 'cat2'
Scatter(data, 'cat1', 'val', marker='cat2')
# cross color and marker column, each unique combination gets a unique color/marker combination
Scatter(data, 'cat1', 'val', color='cat3', marker='cat2')
# same col iterates over color and marker
Scatter(data, 'cat1', 'val', color='cat2', marker='cat2')
Advanced Coloring This strategy should be consistent across all charts, but scatter is a good test case. Generically, there are possible glyph attributes that can be colored, and here we just provide a column that should be discrete, or discretized, where each unique value is mapped to whatever color attribute. This is the same for rgb color, alpha, shade, etc.
There should probably be some thought put into a Charts coloring iterator, which can be used across all charts.
# someone more familiar with color theory should review this
# Option1: vary color by one category and shade/tint by another via color iterator
Scatter(data, 'cat1', 'val1', color=color('cat2', shade='cat3'))
# Option2: vary color by one category and shade/tint by another via nesting
Scatter(data, 'cat1', 'val1', color=nest('cat2', 'cat3'))
# Option3: vary color by one category and shade/tint by another via kwargs
Scatter(data, 'cat1', 'val1', color='cat2', shade='cat3')
# Using values for any option should discretize and work as normal
Scatter(data, 'cat1', 'val1', color='val2')
# might want to allow using the function used for previous example as input for more specific use cases where we control things a bit more
Scatter(data, 'cat1', 'val1', color=Discrete('val2', bins=5))
Wide Data
# color by series
Scatter(data, 'cat2', ['val1', 'val2', 'val3']))
# color by series explicit
Scatter(data, 'cat2', ['val1', 'val2', 'val3'], color='series')
# don't color by series
Scatter(data, 'cat2', ['val1', 'val2', 'val3'], color=None)
Pseudocode for what scatter's Chart model would look like.
# example in work
ColumnSelection(HasProps):
name = String
min = Optional(Int, 1)
max = Optional(Int, -1)
# encode reuseable constraints into classes to be used during constraint checking
ExclusiveMultiSelect(HasProps):
'''A reuseable constraint that only allows one column to have more than one selection.'''
EitherSelect(HasProps):
'''Either of the column names provided must have a selection.
This would drive what options are provided in selection widgets during interactive use.
error = String
'''
Scatter(Chart):
data = Resource()
# x and y both accept one or many columns
x = ColumnSelection('x', min=0, max=-1)
y = ColumnSelection('y', min=0, max=-1)
color = Either(ColumnSelection(), Instance(ColorIterator), String)
marker = Either(ColumnSelection(), Instance(MarkerIterator), String)
constraints = [EitherSelect('x', 'y'), ExclusiveMultiSelect('x', 'y')]
Same as scatter, except the points are connected by a line. This is just a specific formatting mode of scatter. There may be special cases that make leveraging scatter difficult, so need to identify any here.
-
GoG Example:
line(position(d*r))
-
Bokeh Equivalent:
Scatter(data, 'd', 'r', glyph=line)
# or
Line(data, 'd', 'r')
# default
Line(data, 'd', 'r', marker=None)
# specify marker
Line(data, 'd', 'r', marker='+')
index vs value, no computation, colormapping, marker_selection
A specific type of line plot that uses a time axis, and likely comes in the Stock style of table, but doesn't have to. The main feature of a timeseries is that you have a specific scale that is used, and you'd typically want to sort, so that the line glyph doesn't produce nonsensical output.
- GoG Notes:
Time syntax: time(dim(), min(), max(), origin(), cycle())
-
Is there a difference in a timeseries plot an a line plot?
- Or just a specialized version with additional formatting options?
- Should a line plot automatically provide the features if datetime column is detected? (offer to change level of plotting at year, month, day, etc)
# We'll start by creating some nonsense data with dates
df <- data.frame(
date = seq(Sys.Date(), len=100, by="1 day")[sample(100, 50)],
price = runif(50)
)
df <- df[order(df$date), ]
dt <- qplot(date, price, data=df, geom="line") + theme(aspect.ratio = 1/4)
# We can control the format of the labels, and the frequency of
# the major and minor tickmarks. See ?format.Date and ?seq.Date
# for more details.
library(scales) # to access breaks/formatting functions
dt + scale_x_date()
Single Series
TimeSeries(data, <time_column>, <value_column>)
Multiple Series
- Tall
stack(TimeSeries(data, <time_column>, <value_column>), <cat_column>)
# or
TimeSeries(data, <time_column>, <value_column>, stack=<cat_column>)
- Wide
TimeSeries(data, <time_column>, (<value_column>, <value_column>, ...))
Plots a variable against the index of each value, connected by a line.
-
GoG Example:
N/A
-
ggplot2 Example:
qplot(seq_along(x), x, geom="step")
-
Bokeh Equivalent:
Step(df, 'd', start=<'h'/'v'>)
# or
Line(df, 'd', glyph='step', ...)
Uses some kind of LLA, ECEF, ECER or other geographic coordinate system.
- GoG Example:
ELEMENT: polygon(position(longitude*latitude))
ELEMENT: point(position(longitude*latitude))
-
ggplot2 Example:
ggplot() +geom_polygon(data=counties, aes(x=long, y=lat, group=group))+ geom_point(data=mapdata, aes(x=x, y=y), color="red")
-
Bokeh Equivalent:
Map(df, 'lon', 'lat', glyph=circle)
This group of charts requires some computation be processed on groups of values associated with categories.
This category works by grouping by some categorical variable, then performing some aggregation function. One special case could be when you have a single value for each unique cat, you just plot the value.
Input Cases:
- categorical
- categorical vs values
- categorical vs count/proporation(categorical)
Options:
- computation
- grouping
- stacking
- colormapping
-
GoG Example:
interval(position(d*r))
- Bokeh Equivalent:
Bar(df, 'd', 'r')
The grammar of graphics term for the grouped bar is
- GoG Example:
interval.dodge(position(d*r), color(c))
Bokeh Equivalent:
Bar(df, cat, values, grouped=True, agg='sum')
Bar(df, cat, values, grouped="A", stacked=False, agg='sum')
- GoG Example:
interval.stack(position(summary.proportion(r)), color(c))
interval.stack(position(summary.proportion(d*r)), color(c))
Bokeh Equivalent:
Bar(df, cat, values, grouped=False, stacked=True, agg='sum')
Bar(df, cat, values, stacked="B", agg='sum')
Bar(df, cat, values, grouped=True, stacked=True, agg='sum') ??
Bar(df, cat, values, grouped="A", stacked="B", agg='sum')
Bar(df, cat, values, grouped="A", stacked=["B","C"], agg='sum')
df | year | sales | dept | region | revenue
Bar(df, year, (sales, revenue))
- GoG Example:
COORD: polar.theta(dim(1))
ELEMENT: interval.stack(position(summary.proportion(r)), color(c))
Very similar to a bar chart, except can be overlayed without an issue with overlap. marker_selection
Important to utilize optimal bin width calculation by default. Should this be in the core discretizing functionality instead of inside histogram?
A histogram typically will use bars for the visualization, but doesn't have to. If we only want the outline, then it must be generic to the type of glyph used. If there was a higher level Bar glyph that automatically expands its width until it reaches the next (also default to bottom being at 0), then you could just plot bar after bar at x=<bin center>
and y=<bar height>
.
So, maybe a bar glyph would look like this:
Bar(<x>, <y>, <(optional) margin>)
Then, a Step glyph could be used for the outline-only method. So, a histogram is defined by the values used, any categorical variable used to separate multiple histograms by color, the statistical function used, and the glyph used for representing the output of the statistical function.
GoG Example:
ELEMENT: interval(position(summary.count(bin.rect(y))))
Bokeh
Hist(data, 'val')
# specify stat to use (percent requires N samples to be meaningful)
Hist(data, 'val', stat='count')
Hist(data, 'val', stat='percent')
Hist(data, 'val', 'cat')
Hist(data, 'val', color='cat')
Hist(data, 'val', stat='count')
# no fill, black outline
Hist(data, 'val', color=None, line_color='black')
# no boxes, just height?
Hist(data, 'val', color=None, line_color='black', glyph='step')
Glyph used for displaying can vary.
GoG Examples
# rect
ELEMENT: polygon(position(bin.rect(rainfall*degdays)),
color.hue(summary.count()))
# hex
ELEMENT: polygon(position(bin.hex(rainfall*degdays)),
color.hue(summary.count()))
Bokeh
Instead of binning and summarizing, KDE estimates the distribution the variable was sampled from. This would require some bit of implementation or reuse of statistical algorithms, not available in Blaze.
-
GoG Example:
area(position(smooth.density.kernel(y)))
Bokeh
# filled by default
Hist(data, 'val', stat='kde')
Hist(data, 'val', 'cat', stat='kde')
Hist(data, 'val', stat='kde', color=None, line_color='black')
Handled sample as histogram, but calculated differently.
Hist(data, 'val', 'cat', stat='cdf')
Box/Violin plots are similar to a histogram, except you are creating a very specific glyph that represents the key elements of the distribution. The benefit is that you can compare a larger quantity of distinct things than with a histogram. The violin uses a slightly different glyph, but the interface should be the same. The interface for box/violin should be near identical to histogram, except for some specific options.
GoG Single Box Plot
schema(position(bin.quantile.letter(hp)))
GoG Grouped Bar vs. Grouped Box Plot
# The color function produces the separate groups, interval produces the bars, dodge makes them not overlap
interval.dodge(position(summary.mean(gov*birth)), color(urban))
# Bar and Box treated in similar ways
schema.dodge(position(bin.quantile.letter(gov*birth)), color(urban))
Bokeh
# one box
Box(data, 'val')
# One Box per unique cat
Box(data, 'val', 'cat')
# Group the boxes similar to bar charts, different color for each cat in 'cat2'
Box(data, 'val', 'cat', grouped='cat2')
# Grouped by discretized values using default num bins
Box(data, 'val', 'cat', grouped='val')
# Treats cat1->cat2 as a hierarchy for coloring
Box(data, 'val', 'cat1', grouped='cat2', color='cat1')
# possible sort function to sort categorical by max value in each group's values
Box(data, 'val', sort('cat', 'val', 'max'))
index vs value, computation, grouping, colormapping
The area chart is similar to a bar chart, except the independent variable is continuous, instead of discrete. Ggplot treats the area chart as a special case of a ribbon plot, where the minimum value of the ribbon is fixed to 0, then the height is provided by a measure.
There is no aggregation of the measures plotted, but the stacking is provided by summing the measures for each group, for the given value the variable is plotted against.
ggplot2 example: h <- ggplot(huron, aes(x=year)) + geom_ribbon(aes(ymin=0, ymax=water_level))
bokeh
# equivalent to ggplot2 example
Area(data, x='year', y='water_level')
# stacked area
Area(data, x='year', y='water_level', color='lake_name')
# stacked area alternative
Area(data, x='year', y='water_level', stack='lake_name')
# stacks and colors by state, then lake_name
# might want this to use different shades of the same color per state
Area(data, x='year', y='water_level', stack='lake_name', color='state')
# stacked area by lake name, but user provides custom coloring function
Area(data, x='year', y='water_level', stack='lake_name', color=<my_coloring_func>)
index vs value, no computation, colormapping, stack
A 2D graphic of tiles, colored by a third variable. If the x/y coordinates are continuous (floats), then you must first bin them.
GoG
DATA: x = reshape.rect(x(1..62), "colname")
DATA: y = reshape.rect(x(1..62), "rowname")
DATA: d = reshape.rect(x(1..62), "value")
ELEMENT: polygon(position(bin.rect(x*y)), color.hue(d))
Bokeh It seems that this could be generalized to a scatter plot with rect glyphs, colored by some value.
Heatmap(data, 'cat1', 'cat2', color='value')
# binning required
Heatmap(data, 'val1', 'val2', color='val3')
This chart combines position and color to reduce vertical space. If you decrease the height of an Area chart, you will lose sight of the lower values. Here you allow them to become "clipped", at which point you go to a darker shade.
Sometimes you stack many (See tableau example), one for each cat.
Wide data would be handled in the same way as Line/Area.
Bokeh
# one plot using standard neg to pos colormap
Horizon(data, 'values')
# special colormap generated from single color (atypical)
Horizon(data, 'time/value', 'values', color='blue')
# special colormap
Horizon(data, 'time/value', 'values', color=<custom_colormap>)
# special colormap
Horizon(data, 'time/value', 'values', neg_color='red', pos_color='green')
# stacking many skinny plots on each other (tableau example)
# need to consider how to specify tightly stacked plots, versus stacked glyphs
Horizon(data, 'time/value', 'values', stack='cat')
References:
This chart is a hybrid of a timeseries chart and something like a box plot. Time is typically on the x axis, then for each discrete date, a span is created to represent the min and max on that date. This helps demonstrate volatility that might be hidden if you just summed all values per day up and plotted that point.
GoG Example:
SCALE: time(dim(1))
GUIDE: axis(dim(1), format("mm/dd/yy"))
ELEMENT: interval(position(region.spread.range(date*(high+low))))
Bokeh
# good example to use for composability
Use concepts defined in blaze/odo to model the data sources for charts. Beyond knowing the datashape, it is also important for charts to perform some inspection of the data to understand what it is. This can drive how charts will interpret and display the data.
For example, a case where just looking at the datatype isn't good enough is when you have a small number of integer ids. If you consider any numerical column to be continuous, you'd misinterpret the data type. Instead, this really is a type of categorical label.
Resource: A bokeh model equivalent to odo's resource, used by Blaze. This is not necessarily a single table-like entity, but can be a database-like entity with many fields, where each is table-like.
Resource('/path/to/csv/file.csv')
Resource('/path/to/hdf5/file')
Resource('sql connection string')
Resource('/path/to/folder/of/csvs/')
Field: A type of labeled data. Can be a label for another Resource, or a Column.
ColumnDataSource: Exists today in bokeh. Represents a table-like entity
Column: A special type of field representing a column in a table-like entity. Columns are the main source of data for Bokeh charts.
Blaze utilizes datashape, which describes data in a type-oriented manner useful for data manipulation. Bokeh can benefit from mapping the data types into higher order types that are focused on attributes that affect the visual display. Providing these metatypes can limit the repeated work performed in general purpose chart building of inspecting, guarding, transforming, and providing feedback to the user.
This is especially important for interactive applications that you may want to fail gracefully. Bokeh can provide the feedback in a consistent way.
A challenge with visualizing data is especially difficult with categorical data, since you likely don't have communicated to you from the Resource what are the possible values the variable can take, in contrast to the distinct values that actually exist. For example, you may collect data from a column and only ever see True
.
An attribute of the data in the column, independent of the type. Continuous: can be any value between a range Discrete: can only be specific values. A continuous column can be transformed into a discrete one. Constant: only single value exists Numerical: integer, float, etc. Sparse: contains nulls
Measure: a continuous numerical value
Categorical: id, string, e.g., Ford/Toyota/BMW, can be hierarchical in nature Ordinal: Low/Med/High, Star Rating (1, 2, 3, 4, 5) Logical: string, boolean, Yes/No, True/False, T/F, 0/1 Date: Date, Time, DateTime
ToDo: look through GoG for chart-focused transformations. List them out here.
One opportunity with Charts is to specify additional metadata about the Chart, which can both reduce the edge cases the Chart must handle, and provide additional information for composing charts and controls for interactive applications. This enables another type of composition, view composition. Multiple Charts can be composed into a dashboard, to provide multiple views of the same data source.
For example, a Bar Chart implementor might decide that they only want to handle discrete data for the x axis, and continuous data for the y axis. They could handle the continuous data as the x-input in multiple ways. 1. Check for the dtype of the array, and throw an error 2. Use Chart modeling to automatically throw an error 3. Use Chart modeling to automatically convert continuous data to discrete
The approach taken is the same used for modeling Glyphs, Widgets, etc. in Bokeh.
class MyBar(Chart):
x = Discrete(transform=True)
y = Continuous(name=['y', 'height'])
grouped = Discrete(required=False, transform=True)
horizontal = Boolean(required=False)
constraints = [Either('x', 'y')]
This modeling can enable generation of default selectors for an interactive application. Use of MyBar could be as follows:
interact(MyBar(df))
In this example, Bokeh can infer that the following should be generated, using a default layout:
<Row>
<Col>
<ColumnSelector name='x' />
<ColumnSelector name='y' />
<ColumnSelector name='grouped' />
<Checkbox name='horizontal' />
</Col>
<Col>
<MyBar />
</Col>
</Row>
Because the contraints are specified, the interactive widget will not regenerate MyBar unless 'x' or 'y' selectors are set to a valid column.
Note: The above example takes a React-like approach for demonstrating how Bokeh would interpret the Chart configuration, which could be used to provide a custom specification in a configuration file, with:
interact(MyBar(df), config='./custom_layout.dashboard')
Below is some discussion on different types of data that might be input into Charts, which can provide multiple use cases to ensure that the Charts API can handle the most likely cases.
During the Bokeh Days meetup, it was identified that the data structures can cause the user to need to change how they would identify the columns to use for plot aesthetics. The main difference is between de-aggregated (normalized) and aggregated data. Someone that is using data directly from a database might see normalized forms more often.
NOTE: The following labels are used when describing the input types in the chart specifications.
A table that spreads many measures over additional columns. Very few index-like columns, with multiple columns containing similar types of measurements. A good identifier for wide table types is if you need to think about iterating over columns, it might lean more towards the wide category.
Sensor: | index | Measure1 | Measure2 | Measure3 |
A special case for a wide table is time series data, which instead of having multiple measurements, instead has a single measure type (e.g. profits), and each series represents a category of a categorical variable (e.g. Company Name). Technically, this is a form of pivoted data.
Stock: | Date | Series1 | Series2 | Series3 |
A table that tends to put multiple measures in a single column, then identify the measure type in a separate column.
The tall form of Sensor data is the following:
Stacked Sensor: | index | Measure Type | value |
The more likely tall type of data you will see in the real world is that, that has been joined, or de-normalized. In this case, you might have multiple tables that have been brought together into a tall table, with a combination of multiple categorical and numerical columns. This kind of table provides many opportunities to focus on different categorical columns, while collapsing the numerical columns with some type of aggregation.
Business: | click id | visit id | visit type | user name | revenue | platform | device |
A table that has been pre-aggregated via a pivoting-like operation, which places discrete categories along the columns and rows, then the intersection is typically characterized by a measure for the intersection.
Aggregated:
product category | Month 1 | Month 2 | Month 3 | ... |
---|---|---|---|---|
computers | $100 | $250 | $225 | ... |
mobile | ... | ... | ... | ... |
Use of graphical functions result in applying data transforms, or they may be used directly.
ToDo: cross reference with Blaze, dask, numpy, pandas, etc.
Partitions and meshes for tiling and histograms.
- rect, tri, hex, quantile, boundary, voronoi, dot, stem
Basic statistics.
- count, proportion, sum, mean, median, mode, std, ste, range, leaf
Interval and region bounds.
- spread (std, ste, range), confidence (mean, std, smooth)
Regression, smoothing, interpolation, and density estimation. Can be further qualified by the type of method used (e.g. linear.ols).
- linear, quadratic, cubic, log, mean, median, mode, spline, density (normal, kernel)
- join, sequence, mst, delaunay, hull, tsp, complete, neighbor
Source: GoG pg 113.
Great source for data sets that go beyond the typical toy examples.
Most of their data is pre-aggregated.
- Baseball Sqlite Database (suggested by blaze docs)
- Ergast Formula 1/E MySql Database - rothnic has some of a wrapper around REST API with formulapy
- northwind buisiness database - contains example of a typical business database with products, sales, sales team, etc.
A dataset contains potentially many variables (columns, labeled, or unlabeled data). Bokeh should provide efficient methods for separating a global dataset into smaller datasets, which are then visualized in a way that separates each subset. You might use a variable to separate the data using color, shape, orientation, frame (faceting), etc. With charts, you should identify which variable should be used for which visual aspect of the plot, or group of plots.
-
Cross (
*
) - a product of two separate variables. Like an outer join. -
Nest (
/
) - produces a decomposition of one categorical variable, into only the valid categories for another variable. You must know the domain metadata (valid categories) to perform nest operations. Can view this as the second variable conditioned on the first. (ex. nesting of sex={Male, Female}, Pregnant={True, False} would not produce (Male, True) in the result). Like a left or inner join. -
Blend (
+
) - combines two variables under a single variable. The operation is like an append.
- Can we come up with a consistent approach for
dodge
andstack
, as operations, at least internally?- Where does this apply, other than with Bar.
- How does this relate to
overlay
?
- How can we handle
facet
-like layouts that stay within the same frame? A grouped bar chart is similar to faceting, except with merged axes y axis, and a dodged x axis.- Should there be a simple implementation of Bar that is only ever a single group, then dodging of multiple bar charts creates a grouped bar chart?
- Consider whether data could be input into the operation once, instead of having to put it in multiple charts. This could make things easier if you are composing multiple charts, sharing the same data. Instead of
operation(Chart(df, ...), Chart(df, ...))
, could useoperation(df, Chart(...), Chart(...))
. - When in interactive mode, you may not want a static chart. The chart type may depend on the types of columns that are provided. For example, a Line chart might want to defer to a TimeSeries plot when certain conditions are encountered. This may be handled at a higher level than Chart, but should be considered.
- Should Bokeh Charts yield vega/vega-lite specifications where possible, then have the rendering portion separate? This would allow you do initial pre-processing with python initially, but eventually shift responsibility to javascript if possible, reducing the need for bokeh server.
- List requirements, constraints, and options for each chart type
- One idea is to allow users to provide some kind of styling via YAML, or a custom function. The custom function would need to take some specific inputs, like a tuple identifying the unique subset of data, plus any attributes required to produce the glyph they must return.
Things to consider: dimensions, needs computation, splitting/reduction operations (facet, group, stack, overlay, aggregation, colormapping, marker_selection)
- Prototype of a few Charts
- Scatter, Bar
- Create models for Chart input types
- Column (and specific column types), Option, etc.
- Prototype feeding chart directly with out of core source
Chart('./path/to/hdf5::auto', 'mpg', 'disp', ... )
Chart('sqlite://path/to/sqlite::auto', 'mpg', 'disp', ... )
- Reconcile concepts, where it makes sense, with Vega
- Create table marking valid input types to chart types (enable suggesting plots to user, like polestar)
- (Long Term) Prototype driving Chart with Polestar