Skip to content

Support Spark including Spark SQL #297

@anjackson

Description

@anjackson

To support more modern patterns of usage, and more complex processing, it would be good to support Spark.

Long term, this should likely integrate with the Archives Unleashed Toolkit, but at the moment this is not easy for us to transition to using that. This is mostly due to how it handles the record contents, which gets embedded in the data frames, which leads to some heavy memory pressure (TBA some notes).

The current hadoop3 branch WarcLoader provides an initial implementation. It works by building a RDD stream of WARC Records, but also supports running the analyser on that stream, which is able to work on the full 'local' byte streams as long as no re-partitioning has happened. This can then output a stream of objects that contain the extracted metadata fields and are no longer tied to the original WARC input streams. This can then be turned into a DataFrame and SQL can be run on it. It can also be exported as Parquet etc.

  • Convert from input to POJO (POJO allows mapping JavaRDD to a Dataframe): a Memento with:
    • Named fields for core properties, naming consistent with CDX etc.
    • The source file name and offset etc.
    • A @transient reference to the underlying WritableArchiveRecord wrapped as HashCached etc.
    • A Hash<String,String> for arbitrary metadata extracted fields.
  • Add a mechanism for configuring the analyser process behindloadAndAnalyse and createDataFrame (short term).
  • Reconsider Review configuration mechanism and consider a 'fluent' API #107 and maybe separate out the different analysis steps.
  • Possibly an 'enrich' convention, a bit like ArchiveSpark, with the WARC Indexer being wrapped to create an enriched Memento from a basic one.
    • e.g. rdd.mapPartitions(EnricherFunction) wrapped as...
    • JavaRDD rdd = WebArchiveLoader.load("paths", JavaSparkContext).enrich(WarcIndexerEnricherFunction);
  • Use Spark SQL:
    • Register dataframe as temp table df.createOrReplaceTempView("mementos")
  • Use Spark SQL + Iceberg to take the temp table and MERGE INTO a destination MegaTable.

The POJO approach does mean we end up with a very wide schema with a lot of nulls if there's not been much analysis. Supporting a more dynamic schema would be nice, but then again fixing the schema aligns with the Solr schema.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions