Skip to content

WARC-Resource-Type field possibilities (feedback wanted) #96

@ikreymer

Description

@ikreymer

Browsers have different ways of reporting the 'resource type' for any resource that's being fetched. When using browser-based crawling, it is often easy to access this 'resource type' and store it in a custom WARC header.

It is possible to introduce a WARC-Resource-Type header to store this type. Unfortunately, there isn't a single standard of 'resource types' and various browser APIs expose different variations on this.

If a resource type is written to a WARC header, is there a way to make it future proof to support different vocabularies?

Some possibilities include:

  • Chrome Debug Protocol (CDP) resource type
    this is easiest for Chromium-browser based crawling as these fields are directly accessible, but is not especially well standardized and could change anytime.

  • Fetch Request.destination - this is well standardized vocabulary but not a one-to-one mapping and may not be accessible for non-Fetch data.

  • Extension API webRequest.resourceType - better standardized and supported by all the major browsers with some differences for browser extensions. Not quite one-to-one with CDP types.

One approach to make this more future proof might be to prefix the resourceType with a namespace based on where the data is coming from and which vocabulary is used.

For example, if using CDP, cdp:Document or cdp:Image, if using webRequest, might be webRequest:sub_frame, webRequest:image, if using destination, destination:image, destination:document, etc...

This allows for expanding into other vocabularies in the future, but may be harder to parse.

Alternatively, there could be a fixed vocabulary that is allowed that is a common subset of at least 2 of the above, which might be:
document, image, media, script, stylesheet, font, ping, websocket, fetch and a catch-all other.

(In this case, we should specify what the more specific values are recorded as, eg. main_frame / sub_frame would be recorded as document)

Other thoughts / suggestions welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions