Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Factor out Spark #321

Open
6 tasks
okennedy opened this issue Jan 8, 2025 · 0 comments
Open
6 tasks

Proposal: Factor out Spark #321

okennedy opened this issue Jan 8, 2025 · 0 comments
Labels
enhancement New feature or request layer-api An issue involving the vizier API layer layer-mimir An issue involving caveats or lenses layer-python An issue involving the Python compatibility code layer-scala An issue involving Scala compatibility code layer-ui An issue involving the UI layer
Milestone

Comments

@okennedy
Copy link
Contributor

okennedy commented Jan 8, 2025

Challenge

Spark provides Vizier with significant value.

  • It abstracts data loading / export. We don't need to have loaders for each and every type of data, and moreover, Spark provides a standard for data loading that people have written extensions for.
  • It scales well to large data, and makes it easy to deploy workflows to clusters.
  • It abstracts provenance. It gives us a single model of relational algebra that we can base dependency analysis tools on.
  • It abstracts computation. We can define operators that transform dataframes logically... without having to dig deep into the mechanics.
  • It provides us with a standard data model: org.apache.spark.types is a Fantastic collection of types, that is notably extensible.

On the other hand, Spark introduces several substantial pain points:

  • It ties us to Scala 2.x, which does not appear to be under continual development.
  • It ties us to Java 11, which is getting increasingly difficult to ramp people up on. Moreover, it breaks more/less without warning on later java versions...
  • Migrating across versions of Spark tends to break provenance analysis, as well as related hacks to do things like provide consistent tuple IDs.
  • Providing consistent tuple IDs is a huge pain.
  • Spark is almost half a gig... and providing python support requires that users download the same half-gig a second time, because Spark packages the full java assembly with spark-python
  • Spark is aimed at big data. It has a very slow ramp-up time (Spark loading Hadoop alone takes ~2s to load on my box), and generally spends a lot more time optimizing queries (.5-1s for most of TPCH) than actually running them. For big datasets, this is a worthwhile investment... but it makes Vizier far less interactive.

Simply put, Spark is a very heavyweight solution. We don't want to get rid of it, but it would be nice to give users the option of Spark or something else.

Proposal Summary

(i) Migrate to substrait for data/query modeling, (ii) Factor Spark out into a plugin, (iii) Implement a new plugin based on a simpler query engine to provide analogous functionality.

Checklist

  • Migrate to a trait-based artifact model Proposal: Extensible Artifact Model #318
  • Modify all Vizier commands that rely on dataframes to use Substrait instead. Exceptions: SQL, Load, Publish
  • Migrate from Spark's typesystem to Substrait's typesystem (SparkSchema, SparkPrimitive). This will likely require providing a translation layer for serialized types.
  • Modify any other remaining Vizier internals that rely on Spark to rely on Traits (or Substrait) as appropriate.
  • Factor out any remaining components of Vizier that rely on Spark into a plugin
  • Implement a new plugin that re-implements those components for, e.g., Sqlite or DuckDB

Proposal

Substrait appears to provide us with most of the generalizability that spark did:

  • It abstracts provenance: We should be able to rewrite any fine-/coarse-grained code analysis over substrait instead.
  • It abstracts computation: Dataframe commands can transform substrait, which can then be deployed back into Spark.
  • It abstracts types: It provides an equally comprehensive collection of types.

Substrait does not provide a means of computation. However, Spark and DuckDB both provide support for executing substrait, and we can model that possibly by allowing both to provide an implementation of Iterable (and related interfaces) for a generic SubstraitRelation. (i.e., SubstraitRelation does not need to define Iterable itself).

Substrait is agnostic to scalability. If we do this right, there should be negligible overhead relative to the existing artifact model.

@okennedy okennedy added enhancement New feature or request layer-ui An issue involving the UI layer layer-mimir An issue involving caveats or lenses layer-api An issue involving the vizier API layer layer-python An issue involving the Python compatibility code layer-scala An issue involving Scala compatibility code labels Jan 8, 2025
@okennedy okennedy added this to the Version 2.1 milestone Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request layer-api An issue involving the vizier API layer layer-mimir An issue involving caveats or lenses layer-python An issue involving the Python compatibility code layer-scala An issue involving Scala compatibility code layer-ui An issue involving the UI layer
Projects
None yet
Development

No branches or pull requests

1 participant