Skip to content

Parse error for HTTP content type when Content-Type header contains quotes #94

@gijshendriksen

Description

@gijshendriksen

Hi, we are using jwarc for parsing WARC files from the Common Crawl. For one of the response records, response.http().contentType() fails with the following error message:

java.lang.IllegalArgumentException: parse error at position 12: "application<-- HERE -->/octet-stream"
at org.netpreserve.jwarc.MediaType.parse(MediaType.java:399) ~[__app__.jar:3.5.5]
at org.netpreserve.jwarc.MediaType.parseLeniently(MediaType.java:271) ~[__app__.jar:3.5.5]
at java.util.Optional.map(Optional.java:265) ~[?:?]
at org.netpreserve.jwarc.Message.contentType(Message.java:61) ~[__app__.jar:3.5.5]
    ...

The record for which this happens can be found in the CC-MAIN-2021-17 crawl, in the file CC-MAIN-2021-17/segments/1618038056869.3/warc/CC-MAIN-20210410105831-20210410135831-00606.warc.gz and with WARC-Record-ID <urn:uuid:9427fcbc-ced9-406e-aebd-6420f4995cea>. The record looks like this:

WARC/1.0
WARC-Type: response
WARC-Date: 2021-04-10T12:40:15Z
WARC-Record-ID: <urn:uuid:9427fcbc-ced9-406e-aebd-6420f4995cea>
Content-Length: 106769
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:8714f168-71f6-481a-8291-f4c61cb4cbfd>
WARC-Concurrent-To: <urn:uuid:788ca445-2455-4fa5-acec-bfae5ad2d7a3>
WARC-IP-Address: 116.203.241.55
WARC-Target-URI: https://www.euzatebe.rs/project-tenders/file/45
WARC-Payload-Digest: sha1:2ZX7RK45JQQM4H2E5MASGOFN37F6EEJH
WARC-Block-Digest: sha1:V7XVIFRYLEZIGDKRA4DGR7HJ6UNHDTZY
WARC-Identified-Payload-Type: application/x-tika-ooxml

HTTP/1.1 200 OK
Date: Sat, 10 Apr 2021 12:40:14 GMT
Server: Apache/2.4.29 (Ubuntu)
Set-Cookie: ci_session=p7dod0t3n99fktqe5mi3tdb6nkfiko3d; expires=Sat, 10-Apr-2021 13:40:14 GMT; Max-Age=3600; path=/; HttpOnly
Expires: 0
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Content-Disposition: attachment; filename="Ad Junior NKE Higher Education.docx"
Content-Transfer-Encoding: binary
Content-Length: 106142
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET,POST,OPTIONS,DELETE,PUT
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: "application/octet-stream"

...

I suspect that the error occurs because the HTTP Content-Type header contains quotes, and MediaType::parseLeniently might not be able to parse that correctly (this is the only record in this WARC file for which it happens, and also the only one with a quoted Content-Type header). From what I can find, it seems like quoted values are valid for HTTP headers (though I'm not sure), so could you have a look at whether this is indeed the error, and if so, how it could be fixed? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions