-
Notifications
You must be signed in to change notification settings - Fork 0
Filterable Data Source
Right now, Bokeh (version 0.11.1) requires glyphs to use full columns of data from a ColumnDataSource (CDS). This makes it difficult to link plots using row-wise subsets of data.
For example, consider the following use-case taken from a question on the mailing list:
I have a case where I show a full set of records on a plot, and then list a subset of those records, with additional details, in a datatable. I want users to be able to select a row of the datatable, and have the corresponding data point show as selected in the plot.
Because there is no way in Bokeh to specify subsets of a data source, a user has to use two data sources, one for the scatter plot and one for the data table, even though the underlying data for the two plots is the same. Additionally, users have to write a CustomJS callback and keep track of the indices themselves to produce the simple linked selection shown in the gif above. This is made more confusing by the unusual structure of the selected
property on data sources that has 0d
, 1d
, and 2d
properties that then contain indices
.
from bokeh.plotting import Figure, output_file, show
from bokeh.models import CustomJS
from bokeh.models.sources import ColumnDataSource
from bokeh.io import vform
from bokeh.models.widgets import DataTable, TableColumn, Button
from random import randint
import pandas as pd
output_file("subset_example.html")
data = dict(
x=[i for i in range(10)],
y=[randint(0, 100) for i in range(10)],
z=['some other data'] * 10
)
df = pd.DataFrame(data)
#filtering dataframes with pandas keeps the index numbers consistent
filtered_df = df[df.y < 50]
#Creating CDSs from these dataframes gives you a column with indexes
source1 = ColumnDataSource(df)
source2 = ColumnDataSource(filtered_df)
fig1 = Figure(plot_width=300, plot_height=300)
fig1.circle(x='x', y='y', size=10, source=source1)
columns = [
TableColumn(field="y", title="Y"),
TableColumn(field="z", title="Text"),
]
data_table = DataTable(source=source2, columns=columns, width=400, height=280)
button = Button(label="Select")
button.callback = CustomJS(args=dict(source1=source1, source2=source2), code="""
var inds_in_source2 = source2.get('selected')['1d'].indices;
var d = source2.get('data');
var inds = []
if (inds_in_source2.length == 0) { return; }
for (i = 0; i < inds_in_source2.length; i++) {
ind2 = inds_in_source2[i]
inds.push(d['index'][ind2])
}
source1.get('selected')['1d'].indices = inds
source1.trigger('change');
""")
show(vform(fig1, data_table, button))
My proposal is to add a filterable data source that keeps track of which rows to provide to each renderer that is associated with it. This would allow users to specify subsets of data (e.g. filtered by the value of some column) for individual glyphs. Applications with multiple plots, each using a subset of the same data, would share the data in a similar way to how Bokeh allows plots to share full CDSs now. Linked selection between these plots would be automatic, so that users don't have to write a CustomJS callback to get the functionality shown above.
We don't want to break any of the API on the CDS.
Instead of changing the CDS to make it filterable, we can introduce a new data source, potentially called TableDataSource.
The TableDataSource would keep track of filters and indices for each renderer that uses it. The filters could be (as suggested by @bryedev here) either None
, a Seq(Int)
which lists the subset indices, or a function that returns a Seq(Int)
.
The TableDataSource would also implement __getitem__
, so that the following syntax would be possible:
tds = TableDataSource(df)
r = fig.circle(x = 'x', y = 'y', source = tds, filter = tds['weather'] == 'sunny')
The TableDataSource would subclass DataSource and inherit the selected
property which indexes the selection on the full dataset. With some work in the glyph renderer and selection manager (sort of like changes in this commit, though details would be different), linked selection will just work.
Instead of containing a data
property that contains the data itself like the CDS, the TDS's data
property would a data source object that could be shared with multiple TDSs (this could even be a CDS). This would separate the data and allow the subsets to be represented by the filters on that data.
cds = ColumnDataSource(df)
tds = TableDataSource(cds)
# If the tds is created from a cds, they are all automatically linked
r = fig0.circle(x='x', y='y', source=cds)
r = fig1.circle(x='x', y='y', source=tds, filter=lambda tds: tds['weather'] == 'sunny')
r = fig2.circle(x='x', y='y', source=tds, filter=[0, 1, 2])
No Button or CustomJS necessary!
from bokeh.plotting import Figure, output_file, show
from bokeh.models.layouts import VBox
from bokeh.models.sources import TableDataSource
from bokeh.models.widgets import DataTable, TableColumn
from random import randint
import pandas as pd
output_file("subset_example_better.html")
data = dict(
x=[i for i in range(10)],
y=[randint(0, 100) for i in range(10)],
z=['some other data'] * 10
)
source = TableDataSource(data)
fig1 = Figure(plot_width=300, plot_height=300)
fig1.circle(x='x', y='y', size=10, source=source)
columns = [
TableColumn(field="y", title="Y"),
TableColumn(field="z", title="Text"),
]
data_table = DataTable(source=source, filter= lambda tds: tds['y'] < 50, columns=columns, width=400, height=280)
show(VBox(fig1, data_table))