Skip to content

Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453) #455

@oschihin

Description

@oschihin

Using the advice in Issue #453, I successfully excluded unwanted PDF-documents from fetching and being written to WARC. But this method seems to generate misleading reports and stats.

mimetype-report

shows pdf- and zip-files with counts and bytes, both are excluded

[#urls] [#bytes] [mime-types]
6556 234271851 text/html
4193 8659344 application/pdf
42 1829002 image/jpeg
26 508206 text/css
23 239633 image/png
15 811627 application/javascript
14 1462995 application/vnd.openxmlformats-officedocument.wordprocessingml.document
9 18531 application/zip
7 1149664 image/svg+xml
4 49430 image/gif
2 97178 application/font-woff2
2 241457 application/vnd.ms-fontobject
2 240859 application/x-font-ttf
2 124253 application/x-font-woff
2 20934 text/xml
2 4400 unknown
1 212071 application/vnd.ms-excel
1 56 text/dns
1 2419 text/plain

count of content-type from WARC-file

If I grep and count the Content-Type fields from WARC, this is what I get. No pdf and zip:

6702 Content-Type: application/warc-fields
6701 Content-Type: application/http; msgtype=response
6701 Content-Type: application/http; msgtype=request
6190 Content-Type: text/html;charset=UTF-8
 356 Content-Type: text/html; charset=iso-8859-1
  42 Content-Type: image/jpeg;charset=UTF-8
  26 Content-Type: text/css;charset=UTF-8
  23 Content-Type: image/png;charset=UTF-8
  15 Content-Type: application/javascript;charset=UTF-8
  14 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
  10 Content-Type: text/html
   7 Content-Type: image/svg+xml;charset=UTF-8
   4 Content-Type: image/gif;charset=UTF-8
   2 Content-Type: text/xml;charset=UTF-8
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: text/dns
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8

Crawled Bytes

  • Total crawled bytes according to the crawl summary are: 249943910 (238 MiB)
  • Size of the zipped warc file is 70 MB, unzippped 333 MB

Problem

We use the reports and logs in our archive for an overview of the content. In this case, this is dangerous. Is there an explanation and maybe a fix to the problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions