-
Notifications
You must be signed in to change notification settings - Fork 780
Open
Labels
Description
Using the advice in Issue #453, I successfully excluded unwanted PDF-documents from fetching and being written to WARC. But this method seems to generate misleading reports and stats.
mimetype-report
shows pdf- and zip-files with counts and bytes, both are excluded
[#urls] [#bytes] [mime-types]
6556 234271851 text/html
4193 8659344 application/pdf
42 1829002 image/jpeg
26 508206 text/css
23 239633 image/png
15 811627 application/javascript
14 1462995 application/vnd.openxmlformats-officedocument.wordprocessingml.document
9 18531 application/zip
7 1149664 image/svg+xml
4 49430 image/gif
2 97178 application/font-woff2
2 241457 application/vnd.ms-fontobject
2 240859 application/x-font-ttf
2 124253 application/x-font-woff
2 20934 text/xml
2 4400 unknown
1 212071 application/vnd.ms-excel
1 56 text/dns
1 2419 text/plaincount of content-type from WARC-file
If I grep and count the Content-Type fields from WARC, this is what I get. No pdf and zip:
6702 Content-Type: application/warc-fields
6701 Content-Type: application/http; msgtype=response
6701 Content-Type: application/http; msgtype=request
6190 Content-Type: text/html;charset=UTF-8
356 Content-Type: text/html; charset=iso-8859-1
42 Content-Type: image/jpeg;charset=UTF-8
26 Content-Type: text/css;charset=UTF-8
23 Content-Type: image/png;charset=UTF-8
15 Content-Type: application/javascript;charset=UTF-8
14 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
10 Content-Type: text/html
7 Content-Type: image/svg+xml;charset=UTF-8
4 Content-Type: image/gif;charset=UTF-8
2 Content-Type: text/xml;charset=UTF-8
2 Content-Type: application/x-font-woff;charset=UTF-8
2 Content-Type: application/x-font-ttf;charset=UTF-8
2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
2 Content-Type: application/font-woff2;charset=UTF-8
1 Content-Type: text/plain
1 Content-Type: text/dns
1 Content-Type: application/vnd.ms-excel;charset=UTF-8
Crawled Bytes
- Total crawled bytes according to the crawl summary are: 249943910 (238 MiB)
- Size of the zipped warc file is 70 MB, unzippped 333 MB
Problem
We use the reports and logs in our archive for an overview of the content. In this case, this is dangerous. Is there an explanation and maybe a fix to the problem?