Proposal: Factor out Spark #321

okennedy · 2025-01-08T14:42:57Z

Challenge

Spark provides Vizier with significant value.

It abstracts data loading / export. We don't need to have loaders for each and every type of data, and moreover, Spark provides a standard for data loading that people have written extensions for.
It scales well to large data, and makes it easy to deploy workflows to clusters.
It abstracts provenance. It gives us a single model of relational algebra that we can base dependency analysis tools on.
It abstracts computation. We can define operators that transform dataframes logically... without having to dig deep into the mechanics.
It provides us with a standard data model: org.apache.spark.types is a Fantastic collection of types, that is notably extensible.

On the other hand, Spark introduces several substantial pain points:

It ties us to Scala 2.x, which does not appear to be under continual development.
It ties us to Java 11, which is getting increasingly difficult to ramp people up on. Moreover, it breaks more/less without warning on later java versions...
Migrating across versions of Spark tends to break provenance analysis, as well as related hacks to do things like provide consistent tuple IDs.
Providing consistent tuple IDs is a huge pain.
Spark is almost half a gig... and providing python support requires that users download the same half-gig a second time, because Spark packages the full java assembly with spark-python
Spark is aimed at big data. It has a very slow ramp-up time (Spark loading Hadoop alone takes ~2s to load on my box), and generally spends a lot more time optimizing queries (.5-1s for most of TPCH) than actually running them. For big datasets, this is a worthwhile investment... but it makes Vizier far less interactive.

Simply put, Spark is a very heavyweight solution. We don't want to get rid of it, but it would be nice to give users the option of Spark or something else.

Proposal Summary

(i) Migrate to substrait for data/query modeling, (ii) Factor Spark out into a plugin, (iii) Implement a new plugin based on a simpler query engine to provide analogous functionality.

Checklist

Migrate to a trait-based artifact model Proposal: Extensible Artifact Model #318
Modify all Vizier commands that rely on dataframes to use Substrait instead. Exceptions: SQL, Load, Publish
Migrate from Spark's typesystem to Substrait's typesystem (SparkSchema, SparkPrimitive). This will likely require providing a translation layer for serialized types.
Modify any other remaining Vizier internals that rely on Spark to rely on Traits (or Substrait) as appropriate.
Factor out any remaining components of Vizier that rely on Spark into a plugin
Implement a new plugin that re-implements those components for, e.g., Sqlite or DuckDB

Proposal

Substrait appears to provide us with most of the generalizability that spark did:

It abstracts provenance: We should be able to rewrite any fine-/coarse-grained code analysis over substrait instead.
It abstracts computation: Dataframe commands can transform substrait, which can then be deployed back into Spark.
It abstracts types: It provides an equally comprehensive collection of types.

Substrait does not provide a means of computation. However, Spark and DuckDB both provide support for executing substrait, and we can model that possibly by allowing both to provide an implementation of Iterable (and related interfaces) for a generic SubstraitRelation. (i.e., SubstraitRelation does not need to define Iterable itself).

Substrait is agnostic to scalability. If we do this right, there should be negligible overhead relative to the existing artifact model.

The text was updated successfully, but these errors were encountered:

okennedy added this to the Version 2.1 milestone Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Factor out Spark #321

Proposal: Factor out Spark #321

okennedy commented Jan 8, 2025

Proposal: Factor out Spark #321

Proposal: Factor out Spark #321

Comments

okennedy commented Jan 8, 2025

Challenge

Proposal Summary

Checklist

Proposal