WARC Parsing sometimes results in truncated records.

The WARC parsing sometimes results in records being truncated.

This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might be happening if there is a \r\n\r\n encountered in the body of the record.


The issue can be seen by running:

```
const { AutoWARCParser } = require('node-warc');
  
(async () => {
  for await (const record of new AutoWARCParser('test1.warc.gz')) {
    console.log(record.content.toString('utf-8'));
  }
})();
```

With these example files:
[test1.warc.gz](https://github.com/N0taN3rd/node-warc/files/3418015/test1.warc.gz)
(last couple of bytes are cut-off)

[test2.warc.gz](https://github.com/N0taN3rd/node-warc/files/3418010/test2.warc.gz)
(most of the file is cut-off after initial comment)

For comparison, the warcio version prints the full record:

```
from warcio import ArchiveIterator
  
for record in ArchiveIterator(open('./test1.warc.gz', 'rb')):
    print(record.content_stream().read().decode('utf-8'))
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARC Parsing sometimes results in truncated records. #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

WARC Parsing sometimes results in truncated records. #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions