Skip to content

WARC Parsing sometimes results in truncated records. #34

@ikreymer

Description

@ikreymer

The WARC parsing sometimes results in records being truncated.

This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might be happening if there is a \r\n\r\n encountered in the body of the record.

The issue can be seen by running:

const { AutoWARCParser } = require('node-warc');
  
(async () => {
  for await (const record of new AutoWARCParser('test1.warc.gz')) {
    console.log(record.content.toString('utf-8'));
  }
})();

With these example files:
test1.warc.gz
(last couple of bytes are cut-off)

test2.warc.gz
(most of the file is cut-off after initial comment)

For comparison, the warcio version prints the full record:

from warcio import ArchiveIterator
  
for record in ArchiveIterator(open('./test1.warc.gz', 'rb')):
    print(record.content_stream().read().decode('utf-8'))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions