Skip to content

Conversion error for ARC files with just 1 http header #96

@thomasegense

Description

@thomasegense

When using the filtering tool on ARC-files, the output is a WARC-file. But for ARC records with just 1 http header line, the conversion fails to produce a valid WARC-record since it is missing the HTTP header and only has the new WARC-header.

This is the start of ARC record with only a single http header line.

http://wolfgrass.myetang.com:80/ 202.96.96.20 20010926085547 no-type 5913
<SCRIPT LANGUAGE="JavaScript" SRC="/-fs0/sys/pop-up.js"></SCRIPT>

<html>

And the start of the record in the converted WARC file: (no HTTP header)

WARC/1.1
Content-Length: 37069
Content-Type: application/http;msgtype=response
WARC-Date: 2001-09-28T13:38:05Z
WARC-IP-Address: 202.96.96.20
WARC-Target-URI: http://bigmouthnet.myetang.com:80/
WARC-Type: response

<SCRIPT LANGUAGE="JavaScript" SRC="/-fs0/sys/pop-up.js"></SCRIPT>

<html>

I can give you an arc file so you can reproduce it.(/netarkivet/042g/fildir/DENMARK-EXTRACTED-2001-part-00001117.arc.gz)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions