Skip to content

Filterable Data Source

Claire Tang edited this page May 13, 2016 · 1 revision

Proposal for a new filterable data source

Current pain point

Right now, Bokeh (version 0.11.1) requires glyphs to use full columns of data from a ColumnDataSource (CDS). This makes it difficult to link plots using row-wise subsets of data.

For example, consider the following use-case taken from a question on the mailing list:

I have a case where I show a full set of records on a plot, and then list a subset of those records, with additional details, in a datatable. I want users to be able to select a row of the datatable, and have the corresponding data point show as selected in the plot.

example subset selection

Because there is no way in Bokeh to specify subsets of a data source, a user has to use two data sources, one for the scatter plot and one for the data table, even though the underlying data for the two plots is the same. Additionally, users have to write a CustomJS callback and keep track of the indices themselves to produce the simple linked selection shown in the gif above. This is made more confusing by the unusual structure of the selected property on data sources that has 0d, 1d, and 2d properties that then contain indices.

from bokeh.plotting import Figure, output_file, show
from bokeh.models import CustomJS
from bokeh.models.sources import ColumnDataSource
from bokeh.io import vform
from bokeh.models.widgets import DataTable, TableColumn, Button

from random import randint
import pandas as pd

output_file("subset_example.html")

data = dict(
        x=[i for i in range(10)],
        y=[randint(0, 100) for i in range(10)],
        z=['some other data'] * 10
    )
df = pd.DataFrame(data)
#filtering dataframes with pandas keeps the index numbers consistent
filtered_df = df[df.y < 50]

#Creating CDSs from these dataframes gives you a column with indexes
source1 = ColumnDataSource(df)
source2 = ColumnDataSource(filtered_df)

fig1 = Figure(plot_width=300, plot_height=300)
fig1.circle(x='x', y='y', size=10, source=source1)

columns = [
        TableColumn(field="y", title="Y"),
        TableColumn(field="z", title="Text"),
    ]
data_table = DataTable(source=source2, columns=columns, width=400, height=280)

button = Button(label="Select")
button.callback = CustomJS(args=dict(source1=source1, source2=source2), code="""
        var inds_in_source2 = source2.get('selected')['1d'].indices;
        var d = source2.get('data');
        var inds = []

        if (inds_in_source2.length == 0) { return; }

        for (i = 0; i < inds_in_source2.length; i++) {
            ind2 = inds_in_source2[i]
            inds.push(d['index'][ind2])
        }

        source1.get('selected')['1d'].indices = inds
        source1.trigger('change');
    """)

show(vform(fig1, data_table, button))

Solution: a filterable data source

My proposal is to add a filterable data source that keeps track of which rows to provide to each renderer that is associated with it. This would allow users to specify subsets of data (e.g. filtered by the value of some column) for individual glyphs. Applications with multiple plots, each using a subset of the same data, would share the data in a similar way to how Bokeh allows plots to share full CDSs now. Linked selection between these plots would be automatic, so that users don't have to write a CustomJS callback to get the functionality shown above.

Constraints

We don't want to break any of the API on the CDS.

Proposed implementation: introducing TableDataSource

Instead of changing the CDS to make it filterable, we can introduce a new data source, potentially called TableDataSource.

The TableDataSource would keep track of filters and indices for each renderer that uses it. The filters could be (as suggested by @bryedev here) either None, a Seq(Int) which lists the subset indices, or a function that returns a Seq(Int).

The TableDataSource would also implement __getitem__, so that the following syntax would be possible:

tds = TableDataSource(df)
r = fig.circle(x = 'x', y = 'y', source = tds, filter = tds['weather'] == 'sunny')

The TableDataSource would subclass DataSource and inherit the selected property which indexes the selection on the full dataset. With some work in the glyph renderer and selection manager (sort of like changes in this commit, though details would be different), linked selection will just work.

Instead of containing a data property that contains the data itself like the CDS, the TDS's data property would a data source object that could be shared with multiple TDSs (this could even be a CDS). This would separate the data and allow the subsets to be represented by the filters on that data.

cds = ColumnDataSource(df)
tds = TableDataSource(cds)
# If the tds is created from a cds, they are all automatically linked
r = fig0.circle(x='x', y='y', source=cds)
r = fig1.circle(x='x', y='y', source=tds, filter=lambda tds: tds['weather'] == 'sunny')
r = fig2.circle(x='x', y='y', source=tds, filter=[0, 1, 2])

End result

No Button or CustomJS necessary!

from bokeh.plotting import Figure, output_file, show
from bokeh.models.layouts import VBox
from bokeh.models.sources import TableDataSource
from bokeh.models.widgets import DataTable, TableColumn

from random import randint
import pandas as pd

output_file("subset_example_better.html")

data = dict(
        x=[i for i in range(10)],
        y=[randint(0, 100) for i in range(10)],
        z=['some other data'] * 10
    )

source = TableDataSource(data)

fig1 = Figure(plot_width=300, plot_height=300)
fig1.circle(x='x', y='y', size=10, source=source)

columns = [
        TableColumn(field="y", title="Y"),
        TableColumn(field="z", title="Text"),
    ]
data_table = DataTable(source=source, filter= lambda tds: tds['y'] < 50, columns=columns, width=400, height=280)

show(VBox(fig1, data_table))