This repository was archived by the owner on Nov 20, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 7
This repository was archived by the owner on Nov 20, 2025. It is now read-only.
WARC fields not populated in Kafka crawl log #90
Copy link
Copy link
Open
Labels
Description
In #89 it was noted that the warc_filename and warc_offset appear to null when they should not be.
Note the fields are fine in the actual file-based crawl log:
2023-08-29T09:45:48.367Z 301 0 https://sneezecount.joyfeed.com/three-thousand-nine-hundred-and-two// LRRLL https://sneezecount.joyfeed.com/2014/04/ text/html #600 20230829094547420+614 sha1:3I42H3S6N
NFQ2MSVX7XZKYAYSCX5QBYJ - ip:212.84.88.220,geo:GB {"contentSize":308,"warcFilename":"BL-NPLD-20230829033616191-116600-105~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":918370261,"scopeDecision":"ACC
EPT by rule #5 ExternalGeoLocationDecideRule","warcFileRecordLength":1339}
So this is something to do with the Kafka version. We use
ukwa-heritrix/src/main/java/uk/bl/wap/crawler/postprocessor/KafkaKeyedCrawlLogFeed.java
Lines 141 to 148 in 0c21b27
| protected byte[] buildMessage(CrawlURI curi) { | |
| JSONObject jo = CrawlLogJsonBuilder.buildJson(curi, getExtraFields(), getServerCache()); | |
| try { | |
| return jo.toString().getBytes("UTF-8"); | |
| } catch (UnsupportedEncodingException e) { | |
| throw new RuntimeException(e); | |
| } | |
| } |
Which calls
So should be working, but perhaps this is just an order-of-operations problem?
Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.
https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722