Skip to content
This repository was archived by the owner on Nov 20, 2025. It is now read-only.
This repository was archived by the owner on Nov 20, 2025. It is now read-only.

WARC fields not populated in Kafka crawl log #90

@anjackson

Description

@anjackson

In #89 it was noted that the warc_filename and warc_offset appear to null when they should not be.

Note the fields are fine in the actual file-based crawl log:

2023-08-29T09:45:48.367Z   301          0 https://sneezecount.joyfeed.com/three-thousand-nine-hundred-and-two// LRRLL https://sneezecount.joyfeed.com/2014/04/ text/html #600 20230829094547420+614 sha1:3I42H3S6N
NFQ2MSVX7XZKYAYSCX5QBYJ - ip:212.84.88.220,geo:GB {"contentSize":308,"warcFilename":"BL-NPLD-20230829033616191-116600-105~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":918370261,"scopeDecision":"ACC
EPT by rule #5 ExternalGeoLocationDecideRule","warcFileRecordLength":1339}

So this is something to do with the Kafka version. We use

protected byte[] buildMessage(CrawlURI curi) {
JSONObject jo = CrawlLogJsonBuilder.buildJson(curi, getExtraFields(), getServerCache());
try {
return jo.toString().getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
}

Which calls

https://github.com/internetarchive/heritrix3/blob/8563f491a5b355c39a89f51b17c76aaa84752a8a/contrib/src/main/java/org/archive/modules/postprocessor/CrawlLogJsonBuilder.java#L15

So should be working, but perhaps this is just an order-of-operations problem?

Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.

https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions