-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Description
The WARC parsing sometimes results in records being truncated.
This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might be happening if there is a \r\n\r\n encountered in the body of the record.
The issue can be seen by running:
const { AutoWARCParser } = require('node-warc');
(async () => {
for await (const record of new AutoWARCParser('test1.warc.gz')) {
console.log(record.content.toString('utf-8'));
}
})();
With these example files:
test1.warc.gz
(last couple of bytes are cut-off)
test2.warc.gz
(most of the file is cut-off after initial comment)
For comparison, the warcio version prints the full record:
from warcio import ArchiveIterator
for record in ArchiveIterator(open('./test1.warc.gz', 'rb')):
print(record.content_stream().read().decode('utf-8'))
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels