-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Browsers have different ways of reporting the 'resource type' for any resource that's being fetched. When using browser-based crawling, it is often easy to access this 'resource type' and store it in a custom WARC header.
It is possible to introduce a WARC-Resource-Type header to store this type. Unfortunately, there isn't a single standard of 'resource types' and various browser APIs expose different variations on this.
If a resource type is written to a WARC header, is there a way to make it future proof to support different vocabularies?
Some possibilities include:
-
Chrome Debug Protocol (CDP) resource type
this is easiest for Chromium-browser based crawling as these fields are directly accessible, but is not especially well standardized and could change anytime. -
Fetch Request.destination - this is well standardized vocabulary but not a one-to-one mapping and may not be accessible for non-Fetch data.
-
Extension API webRequest.resourceType - better standardized and supported by all the major browsers with some differences for browser extensions. Not quite one-to-one with CDP types.
One approach to make this more future proof might be to prefix the resourceType with a namespace based on where the data is coming from and which vocabulary is used.
For example, if using CDP, cdp:Document or cdp:Image, if using webRequest, might be webRequest:sub_frame, webRequest:image, if using destination, destination:image, destination:document, etc...
This allows for expanding into other vocabularies in the future, but may be harder to parse.
Alternatively, there could be a fixed vocabulary that is allowed that is a common subset of at least 2 of the above, which might be:
document, image, media, script, stylesheet, font, ping, websocket, fetch and a catch-all other.
(In this case, we should specify what the more specific values are recorded as, eg. main_frame / sub_frame would be recorded as document)
Other thoughts / suggestions welcome!