-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Labels
Description
Motivation: Many users are primarily interested in listing and processing the captured resources within a WARC file. The current WarcReader API makes that a bit tedious as you have to deal with the differences between resource, response and revisit records. If you need details from the request or metadata records you need manually associate them.
Use cases:
- indexing (CDX, full text)
- exporting resources
- batch processing for analysis and statistics
Idea: Provide a high-level reader that based around capture events. The response/resource/revisit record is considered the primary record of the event and other concurrent records (typically request and metadata) are secondary groups and are automatically grouped together. Secondary records are typically small so we buffer them in memory.
Sketch of possible design:
try (WarcCaptureReader reader = new WarcCaptureReader(Path.of("example.warc.gz.")) {
for (WarcCapture capture : reader) {
capture.target(); // String
capture.date(); // Instant
// do something with the payload
// tempted to name open* to encourage closing them
// and I'm thinking reusing an interface subset of the WarcCapture API
// for accessing remote indexed collections in which case this may actually open a new connection
try (InputStream stream = capture.openStream()) {
}
// convenience methods for commonly used information
capture.contentType(); // Optional<MediaType> (change name?)
capture.status(); // Optional<Integer>
capture.method(); // Optional<String>
// access to underlying records
capture.records(); // List<WarcCaptureRecord>
capture.record(); // WarcCaptureRecord (response/resource/revisit record) (change name?)
capture.request(); // Optional<WarcRequest>
capture.metadata(); // Optional<WarcMetadata> (first warc-fields metadata record)
}
}Compromises/alternatives:
- Optional
- I somewhat regret using it in jwarc, in practice I've found it often more awkward than helpful
- Dropping Optional would be pretty inconsistent with the rest of jwarc
- Eventually Java's going to get
?and!types but that's still years away - Alternative: JSpecify annotations?
- Won't handle non-sequential concurrent records.
- To solve this in general case would need two passes or building some kind of index. Which doesn't seem worthwhile given all popular WARC producers output them sequentially.
- Buffers secondary records in memory
- Typically they're small, although Heritrix metadata records could have a lot of outlinks
- Maybe have a limit: buffer the first part and if the channel is seekable backtrack to read the rest?
- Ignores conversion records for now
- I haven't seen these used in the wild. Probably wouldn't be sequential anyway?
- Reading secondary records may invalidate primary record's payload
- Primary payload can be large so we shouldn't autobuffer it
- Maybe have an API to allow intentional buffering
- Iterable API
- it's very convenient and readable but it has to wrap IOException in a runtime Exception
- an iterator is supposed to start from the beginning
- alternatives?
for (WarcCapture capture = reader.read(); capture != null; capture = reader.read()) {for (WarcCapture capture; (capture = reader.next().orElse(null)) != null;) {- Stream? But the exception situation there is even worse
forEach(capture -> { ... });not too bad but if user wants to throw their own checked exception?
Reactions are currently unavailable