Support Spark including Spark SQL

To support more modern patterns of usage, and more complex processing, it would be good to support Spark.

Long term, this should likely integrate with the Archives Unleashed Toolkit, but at the moment this is not easy for us to transition to using that. This is mostly due to how it handles the record contents, which gets embedded in the data frames, which leads to some heavy memory pressure (TBA some notes).
 
The current [`hadoop3` branch `WarcLoader`](https://github.com/ukwa/webarchive-discovery/blob/hadoop3/warc-hadoop-indexer/src/main/java/uk/bl/wa/spark/WarcLoader.java#L57) provides an initial implementation. It works by building a RDD stream of WARC Records, but also supports running the analyser on that stream, which is able to work on the full 'local' byte streams as long as no re-partitioning has happened. This can then output a stream of objects that contain the extracted metadata fields and are no longer tied to the original WARC  input streams.  This can then be turned into a DataFrame and SQL can be run on it. It can also be exported as Parquet etc.

* [ ] Convert from input to POJO (POJO allows mapping JavaRDD<Memento> to a Dataframe): a Memento with:
    * [x] Named fields for core properties, naming consistent with CDX etc.
    * [x] The source file name and offset etc.
    * [ ] A @Transient reference to the underlying WritableArchiveRecord wrapped as HashCached etc.
    * [ ] A Hash<String,String> for arbitrary metadata extracted fields.
* [ ] Add a mechanism for configuring the analyser process behind`loadAndAnalyse` and `createDataFrame` (short term).
* [ ] Reconsider #107 and maybe separate out the different analysis steps.
* [ ] Possibly an 'enrich' convention, a bit like ArchiveSpark, with the WARC Indexer being wrapped to create an enriched Memento from a basic one.
    * e.g. rdd.mapPartitions(EnricherFunction) wrapped as...
    *  JavaRDD<Memento> rdd = WebArchiveLoader.load("paths", JavaSparkContext).enrich(WarcIndexerEnricherFunction);
* [ ] Use Spark SQL:
    * Register dataframe as temp table `df.createOrReplaceTempView("mementos")`
* [ ] Use Spark SQL + Iceberg to take the temp table and MERGE INTO a destination MegaTable.
    * (Partitioned by day? https://iceberg.apache.org/docs/latest/spark-ddl/#partitioned-by )
    * https://iceberg.apache.org/docs/latest/spark-writes/#merge-into

The POJO approach does mean we end up with a very wide schema with a lot of nulls if there's not been much analysis. Supporting a more dynamic schema would be nice, but then again fixing the schema aligns with the Solr schema.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Spark including Spark SQL #297

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Spark including Spark SQL #297

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions