Proposal: Factor out Spark #321
Labels
enhancement
New feature or request
layer-api
An issue involving the vizier API layer
layer-mimir
An issue involving caveats or lenses
layer-python
An issue involving the Python compatibility code
layer-scala
An issue involving Scala compatibility code
layer-ui
An issue involving the UI layer
Milestone
Challenge
Spark provides Vizier with significant value.
org.apache.spark.types
is a Fantastic collection of types, that is notably extensible.On the other hand, Spark introduces several substantial pain points:
Simply put, Spark is a very heavyweight solution. We don't want to get rid of it, but it would be nice to give users the option of Spark or something else.
Proposal Summary
(i) Migrate to substrait for data/query modeling, (ii) Factor Spark out into a plugin, (iii) Implement a new plugin based on a simpler query engine to provide analogous functionality.
Checklist
Proposal
Substrait appears to provide us with most of the generalizability that spark did:
Substrait does not provide a means of computation. However, Spark and DuckDB both provide support for executing substrait, and we can model that possibly by allowing both to provide an implementation of Iterable (and related interfaces) for a generic SubstraitRelation. (i.e., SubstraitRelation does not need to define Iterable itself).
Substrait is agnostic to scalability. If we do this right, there should be negligible overhead relative to the existing artifact model.
The text was updated successfully, but these errors were encountered: