Proposal: Factor out Spark

## Challenge

Spark provides Vizier with significant value.
- It abstracts data loading / export.  We don't need to have loaders for each and every type of data, and moreover, Spark provides a standard for data loading that people have written extensions for.
- It scales well to large data, and makes it easy to deploy workflows to clusters.
- It abstracts provenance.  It gives us a single model of relational algebra that we can base dependency analysis tools on.
- It abstracts computation.  We can define operators that transform dataframes logically... without having to dig deep into the mechanics.
- It provides us with a standard data model: `org.apache.spark.types` is a Fantastic collection of types, that is notably extensible.

On the other hand, Spark introduces several substantial pain points:
- It ties us to Scala 2.x, which does not appear to be under continual development.
- It ties us to Java 11, which is getting increasingly difficult to ramp people up on.  Moreover, it breaks more/less without warning on later java versions...
- Migrating across versions of Spark tends to break provenance analysis, as well as related hacks to do things like provide consistent tuple IDs.  
- Providing consistent tuple IDs is a huge pain.
- Spark is almost half a gig... and providing python support requires that users download the same half-gig **a second time**, because Spark packages the full java assembly with spark-python
- Spark is aimed at big data.  It has a very slow ramp-up time (Spark loading Hadoop alone takes ~2s to load on my box), and generally spends a lot more time optimizing queries (.5-1s for most of TPCH) than actually running them.  For big datasets, this is a worthwhile investment... but it makes Vizier far less interactive.

Simply put, Spark is a very heavyweight solution.  We don't want to get rid of it, but it would be nice to give users the option of Spark or something else.

## Proposal Summary

(i) Migrate to [substrait](https://substrait.io/) for data/query modeling, (ii) Factor Spark out into a plugin, (iii) Implement a new plugin based on a simpler query engine to provide analogous functionality.

## Checklist

- [ ] Migrate to a trait-based artifact model #318 
- [ ] #325
- [ ] Modify any other remaining Vizier internals that rely on Spark to rely on Traits (or Substrait) as appropriate.
- [ ] Factor out any remaining components of Vizier that rely on Spark into a plugin
- [ ] Implement a new plugin that re-implements those components for, e.g., Sqlite or DuckDB

## Proposal

[Substrait](https://substrait.io/) appears to provide us with most of the generalizability that spark did:
- It abstracts provenance: We should be able to rewrite any fine-/coarse-grained code analysis over substrait instead.  
- It abstracts computation: Dataframe commands can transform substrait, which can then be deployed back into Spark.
- It abstracts types: It provides an equally comprehensive collection of types.

Substrait does not provide a means of computation.  However, Spark and DuckDB both provide support for executing substrait, and we can model that possibly by allowing both to provide an implementation of Iterable (and related interfaces) for a generic SubstraitRelation. (i.e., SubstraitRelation does not need to define Iterable itself).

Substrait is agnostic to scalability.  If we do this right, there should be negligible overhead relative to the existing artifact model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Factor out Spark #321

Challenge

Proposal Summary

Checklist

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Factor out Spark #321

Description

Challenge

Proposal Summary

Checklist

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions