-
Notifications
You must be signed in to change notification settings - Fork 0
Working Document: Remote datasources
WD | Remote datasources |
---|---|
Authors | Hugo Shi |
Status | WIP |
Discussion | https://github.com/bokeh/bokeh/issues/1487 |
Implementation | Mostly implemented by #1713 |
This is a discussion of how Bokeh should integrate with remote data sources
- Hosted on a blaze server (possibly integrate one into the bokeh server via a flask blueprint)
- Served up via some AJAX endpoint
The scope of this dicussion is to support
- streaming pdates
- abstract rendering
- streaming updating plots without need for the Bokeh Server
- things like Ipython interact transforming blaze data sources feeding into abstract rendering
With special consideration for to implement
- linked brushing/selections for remote data sources
- linked brushing/selections for remote data sources, when abstract rendering is in play
- incremental updates for streaming data sets
-
It would be nice if all Bokeh APIs by default supported remote data sources as drop in replacements for ColumnDataSources
-
This is tricky, because some of the more sophisticated Bokeh APIs require pre-processing of the data in order to determine what plots to construct (Bokeh.charts) i.e. computing histograms, Facetting data to create grid plots, computing max/mins for determine various bounds
-
If the data volume is small, we can probably fetch a copy of the data, and process it. If the data is large, we probably require it to be in blaze, since that will be the strategy for AR going forward. Then we assume whatever computations are done on the dataset in the bokeh python code can be done in blaze
-
This also suggests that blaze should be a bokeh dependency - of course this gets a bit tricky because bokeh is pip installable, and blaze isn't really
-
What is the proper model for using a remote data source? The previous approach I advocated was subclassing DataSource, and adding a javascript update method which would query for the new data. However this gets tricky - In the bokeh pattern of sharing objects in order to express relationships, What if you created a line plot using abstract rendering, as well as a scatter plot using abstract rendering. The abstract rendering server would return a downsampled line for the line plot, and a downsampled image for the scatter plot. Sharing the same javascript object for this would result in both representations stomping on each other
-
The previous approach, was to detect a remote data source, and create a dummy ColumnDataSource for each renderer. This is fine but it means that all bokeh plotting APIs need to use this functionality
-
There is an incompatability with server data sources and Backbone REST API. Backbone assumes that the server is returning it's copy of the object whenever you do an update, so Backbone synchronization is really a state reconciliation. However for remote data sources, we don't actually persist the data. in the bokeh server, so whenever we go to save the datasource(for example, in response to a selection) the server sends it's copy of the data source(which contains no data) and that wipes out the copy we got from an ajax endpoint
- Bokeh supports linking selections across renderers by sharing the same data source object. Bokeh also supports (or will) an explicit link command, to link selections across different data sources. For ColumnDataSources, rendering selections is fairly straight forward, because we have access to the entire dataset in the clien. What should happen for remote data sources?
- We could filter the data out for remote data sources when it comes to selection
- We could continue to send the entire dataset, and use selections on the client side to render selection and non selection glyphs
- What about for abstract rendering - Does the abstract rendering function now need to understand how to render selection and non-selection glyphs?
- I think that Blaze should support some sort of streaming protocol (that other ajax sources could implement if they want
- I do not think we should try to do streaming with the column data source because we have to create copies of the bokeh object graph in order to create applications, or in order to publish the plot. And there is no way to update the disparate copies
- Possibly we do something in blaze where the blaze server (or some bokeh wrapper around the blaze server) returns a sequence number, and the datasource can be configured to poll (Sending the latest sequence number)
- Implement a multi user blaze server
- This server will have a single user and a multi user mode similar to how the bokeh server has both modes
- The server will also have global sources which are defined via a script which has no user affiliation and can only be modified by admin users
- Necessary endpoints
- datashape (from blaze)
- compute (from blaze)
- upload
- append (for selected file formats)
- In addition, the bokeh server will add one more endpoint for abstract rendering
- new APIs need to be added - a query request for row-based incremental updates
- as well as a data append operation
- We will be modifying ServerDataSource to extend ColumnDataSource. The understanding is that each distinct ServerDataSource object is it's own view on the data. One can link selections across them after creating them, but if you want different views of the same data set for different plots, then you need to create multiple ServerDataSource objects and point them at the same underlying dataset.
- Server Data sources will have a polling interval field, that can be used to poll for updates or incremental updates