WARC fields not populated in Kafka crawl log

In #89 it was noted that the `warc_filename` and `warc_offset` appear to `null` when they should not be.

Note the fields are fine in the actual file-based crawl log:

```
2023-08-29T09:45:48.367Z   301          0 https://sneezecount.joyfeed.com/three-thousand-nine-hundred-and-two// LRRLL https://sneezecount.joyfeed.com/2014/04/ text/html #600 20230829094547420+614 sha1:3I42H3S6N
NFQ2MSVX7XZKYAYSCX5QBYJ - ip:212.84.88.220,geo:GB {"contentSize":308,"warcFilename":"BL-NPLD-20230829033616191-116600-105~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":918370261,"scopeDecision":"ACC
EPT by rule #5 ExternalGeoLocationDecideRule","warcFileRecordLength":1339}
```

So this is something to do with the Kafka version.  We use

https://github.com/ukwa/ukwa-heritrix/blob/0c21b2756c823697839013254a66f06f80cfea3b/src/main/java/uk/bl/wap/crawler/postprocessor/KafkaKeyedCrawlLogFeed.java#L141-L148

Which calls 

https://github.com/internetarchive/heritrix3/blob/8563f491a5b355c39a89f51b17c76aaa84752a8a/contrib/src/main/java/org/archive/modules/postprocessor/CrawlLogJsonBuilder.java#L15

So should be working, but perhaps this is just an order-of-operations problem?

Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.

https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722



	protected byte[] buildMessage(CrawlURI curi) {
	JSONObject jo = CrawlLogJsonBuilder.buildJson(curi, getExtraFields(), getServerCache());
	try {
	return jo.toString().getBytes("UTF-8");
	} catch (UnsupportedEncodingException e) {
	throw new RuntimeException(e);
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WARC fields not populated in Kafka crawl log #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WARC fields not populated in Kafka crawl log #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions