-
Notifications
You must be signed in to change notification settings - Fork 26
Description
To support more modern patterns of usage, and more complex processing, it would be good to support Spark.
Long term, this should likely integrate with the Archives Unleashed Toolkit, but at the moment this is not easy for us to transition to using that. This is mostly due to how it handles the record contents, which gets embedded in the data frames, which leads to some heavy memory pressure (TBA some notes).
The current hadoop3 branch WarcLoader provides an initial implementation. It works by building a RDD stream of WARC Records, but also supports running the analyser on that stream, which is able to work on the full 'local' byte streams as long as no re-partitioning has happened. This can then output a stream of objects that contain the extracted metadata fields and are no longer tied to the original WARC input streams. This can then be turned into a DataFrame and SQL can be run on it. It can also be exported as Parquet etc.
- Convert from input to POJO (POJO allows mapping JavaRDD to a Dataframe): a Memento with:
- Named fields for core properties, naming consistent with CDX etc.
- The source file name and offset etc.
- A @transient reference to the underlying WritableArchiveRecord wrapped as HashCached etc.
- A Hash<String,String> for arbitrary metadata extracted fields.
- Add a mechanism for configuring the analyser process behind
loadAndAnalyseandcreateDataFrame(short term). - Reconsider Review configuration mechanism and consider a 'fluent' API #107 and maybe separate out the different analysis steps.
- Possibly an 'enrich' convention, a bit like ArchiveSpark, with the WARC Indexer being wrapped to create an enriched Memento from a basic one.
- e.g. rdd.mapPartitions(EnricherFunction) wrapped as...
- JavaRDD rdd = WebArchiveLoader.load("paths", JavaSparkContext).enrich(WarcIndexerEnricherFunction);
- Use Spark SQL:
- Register dataframe as temp table
df.createOrReplaceTempView("mementos")
- Register dataframe as temp table
- Use Spark SQL + Iceberg to take the temp table and MERGE INTO a destination MegaTable.
The POJO approach does mean we end up with a very wide schema with a lot of nulls if there's not been much analysis. Supporting a more dynamic schema would be nice, but then again fixing the schema aligns with the Solr schema.