-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Hi, we are using jwarc for parsing WARC files from the Common Crawl. For one of the response records, response.http().contentType() fails with the following error message:
java.lang.IllegalArgumentException: parse error at position 12: "application<-- HERE -->/octet-stream"
at org.netpreserve.jwarc.MediaType.parse(MediaType.java:399) ~[__app__.jar:3.5.5]
at org.netpreserve.jwarc.MediaType.parseLeniently(MediaType.java:271) ~[__app__.jar:3.5.5]
at java.util.Optional.map(Optional.java:265) ~[?:?]
at org.netpreserve.jwarc.Message.contentType(Message.java:61) ~[__app__.jar:3.5.5]
...
The record for which this happens can be found in the CC-MAIN-2021-17 crawl, in the file CC-MAIN-2021-17/segments/1618038056869.3/warc/CC-MAIN-20210410105831-20210410135831-00606.warc.gz and with WARC-Record-ID <urn:uuid:9427fcbc-ced9-406e-aebd-6420f4995cea>. The record looks like this:
WARC/1.0
WARC-Type: response
WARC-Date: 2021-04-10T12:40:15Z
WARC-Record-ID: <urn:uuid:9427fcbc-ced9-406e-aebd-6420f4995cea>
Content-Length: 106769
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:8714f168-71f6-481a-8291-f4c61cb4cbfd>
WARC-Concurrent-To: <urn:uuid:788ca445-2455-4fa5-acec-bfae5ad2d7a3>
WARC-IP-Address: 116.203.241.55
WARC-Target-URI: https://www.euzatebe.rs/project-tenders/file/45
WARC-Payload-Digest: sha1:2ZX7RK45JQQM4H2E5MASGOFN37F6EEJH
WARC-Block-Digest: sha1:V7XVIFRYLEZIGDKRA4DGR7HJ6UNHDTZY
WARC-Identified-Payload-Type: application/x-tika-ooxml
HTTP/1.1 200 OK
Date: Sat, 10 Apr 2021 12:40:14 GMT
Server: Apache/2.4.29 (Ubuntu)
Set-Cookie: ci_session=p7dod0t3n99fktqe5mi3tdb6nkfiko3d; expires=Sat, 10-Apr-2021 13:40:14 GMT; Max-Age=3600; path=/; HttpOnly
Expires: 0
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Content-Disposition: attachment; filename="Ad Junior NKE Higher Education.docx"
Content-Transfer-Encoding: binary
Content-Length: 106142
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET,POST,OPTIONS,DELETE,PUT
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: "application/octet-stream"
...
I suspect that the error occurs because the HTTP Content-Type header contains quotes, and MediaType::parseLeniently might not be able to parse that correctly (this is the only record in this WARC file for which it happens, and also the only one with a quoted Content-Type header). From what I can find, it seems like quoted values are valid for HTTP headers (though I'm not sure), so could you have a look at whether this is indeed the error, and if so, how it could be fixed? Thanks!