Skip to content

Proposal: Factor out Spark #321

@okennedy

Description

@okennedy

Challenge

Spark provides Vizier with significant value.

  • It abstracts data loading / export. We don't need to have loaders for each and every type of data, and moreover, Spark provides a standard for data loading that people have written extensions for.
  • It scales well to large data, and makes it easy to deploy workflows to clusters.
  • It abstracts provenance. It gives us a single model of relational algebra that we can base dependency analysis tools on.
  • It abstracts computation. We can define operators that transform dataframes logically... without having to dig deep into the mechanics.
  • It provides us with a standard data model: org.apache.spark.types is a Fantastic collection of types, that is notably extensible.

On the other hand, Spark introduces several substantial pain points:

  • It ties us to Scala 2.x, which does not appear to be under continual development.
  • It ties us to Java 11, which is getting increasingly difficult to ramp people up on. Moreover, it breaks more/less without warning on later java versions...
  • Migrating across versions of Spark tends to break provenance analysis, as well as related hacks to do things like provide consistent tuple IDs.
  • Providing consistent tuple IDs is a huge pain.
  • Spark is almost half a gig... and providing python support requires that users download the same half-gig a second time, because Spark packages the full java assembly with spark-python
  • Spark is aimed at big data. It has a very slow ramp-up time (Spark loading Hadoop alone takes ~2s to load on my box), and generally spends a lot more time optimizing queries (.5-1s for most of TPCH) than actually running them. For big datasets, this is a worthwhile investment... but it makes Vizier far less interactive.

Simply put, Spark is a very heavyweight solution. We don't want to get rid of it, but it would be nice to give users the option of Spark or something else.

Proposal Summary

(i) Migrate to substrait for data/query modeling, (ii) Factor Spark out into a plugin, (iii) Implement a new plugin based on a simpler query engine to provide analogous functionality.

Checklist

Proposal

Substrait appears to provide us with most of the generalizability that spark did:

  • It abstracts provenance: We should be able to rewrite any fine-/coarse-grained code analysis over substrait instead.
  • It abstracts computation: Dataframe commands can transform substrait, which can then be deployed back into Spark.
  • It abstracts types: It provides an equally comprehensive collection of types.

Substrait does not provide a means of computation. However, Spark and DuckDB both provide support for executing substrait, and we can model that possibly by allowing both to provide an implementation of Iterable (and related interfaces) for a generic SubstraitRelation. (i.e., SubstraitRelation does not need to define Iterable itself).

Substrait is agnostic to scalability. If we do this right, there should be negligible overhead relative to the existing artifact model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlayer-apiAn issue involving the vizier API layerlayer-mimirAn issue involving caveats or lenseslayer-pythonAn issue involving the Python compatibility codelayer-scalaAn issue involving Scala compatibility codelayer-uiAn issue involving the UI layer

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions