-
Notifications
You must be signed in to change notification settings - Fork 92
Open
Description
The CCF host-level web graphs (since 2017) were created with the leading www. stripped from the host name. Unlike the SURT-normalization, the prefix is only stripped
- if at least two dot-separated segments are preserved (
www.comis kept intact) - no
www1.prefixes are stripped.
The reason for the stripping is the reduced size of the host-level web graphs:
- Back in 2017 this made the graphs approx. 10% smaller.
- This number went down over the years and the storage saving are now only about 5%.
While storage benefit became smaller, the stripping has a couple of disadvantages:
- Joining web graph data with other host-level data is more difficult.
- Fetching the homepage of a stripped host name requires to follow a redirect, or may even fail or return a different result.
- Extra documentation is required, in addition to the reverse domain name notation.
Starting with the first web graph in 2026 (cc-main-2025-26-nov-dec-jan-host), the leading www. in a host name will be preserved.
Metadata
Metadata
Assignees
Labels
No labels