Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453)

Using the advice in Issue #453, I successfully excluded unwanted PDF-documents from fetching and being written to WARC. But this method seems to generate misleading reports and stats. 

## mimetype-report
shows pdf- and zip-files with counts and bytes, both are excluded
```txt
[#urls] [#bytes] [mime-types]
6556 234271851 text/html
4193 8659344 application/pdf
42 1829002 image/jpeg
26 508206 text/css
23 239633 image/png
15 811627 application/javascript
14 1462995 application/vnd.openxmlformats-officedocument.wordprocessingml.document
9 18531 application/zip
7 1149664 image/svg+xml
4 49430 image/gif
2 97178 application/font-woff2
2 241457 application/vnd.ms-fontobject
2 240859 application/x-font-ttf
2 124253 application/x-font-woff
2 20934 text/xml
2 4400 unknown
1 212071 application/vnd.ms-excel
1 56 text/dns
1 2419 text/plain
```

## count of content-type from WARC-file
If I grep and count the `Content-Type` fields from WARC, this is what I get. No pdf and zip:
```
6702 Content-Type: application/warc-fields
6701 Content-Type: application/http; msgtype=response
6701 Content-Type: application/http; msgtype=request
6190 Content-Type: text/html;charset=UTF-8
 356 Content-Type: text/html; charset=iso-8859-1
  42 Content-Type: image/jpeg;charset=UTF-8
  26 Content-Type: text/css;charset=UTF-8
  23 Content-Type: image/png;charset=UTF-8
  15 Content-Type: application/javascript;charset=UTF-8
  14 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
  10 Content-Type: text/html
   7 Content-Type: image/svg+xml;charset=UTF-8
   4 Content-Type: image/gif;charset=UTF-8
   2 Content-Type: text/xml;charset=UTF-8
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: text/dns
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8
```

## Crawled Bytes
* Total crawled bytes according to the crawl summary are: 249943910 (238 MiB) 
* Size of the zipped warc file is 70 MB, unzippped 333 MB

## Problem
We use the reports and logs in our archive for an overview of the content. In this case, this is dangerous. Is there an explanation and maybe a fix to the problem?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453) #455

mimetype-report

count of content-type from WARC-file

Crawled Bytes

Problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453) #455

Description

mimetype-report

count of content-type from WARC-file

Crawled Bytes

Problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions