Skip to content

Host-link extraction: preserve www. prefix #56

@sebastian-nagel

Description

@sebastian-nagel

The CCF host-level web graphs (since 2017) were created with the leading www. stripped from the host name. Unlike the SURT-normalization, the prefix is only stripped

  • if at least two dot-separated segments are preserved (www.com is kept intact)
  • no www1. prefixes are stripped.

The reason for the stripping is the reduced size of the host-level web graphs:

  • Back in 2017 this made the graphs approx. 10% smaller.
  • This number went down over the years and the storage saving are now only about 5%.

While storage benefit became smaller, the stripping has a couple of disadvantages:

  1. Joining web graph data with other host-level data is more difficult.
  2. Fetching the homepage of a stripped host name requires to follow a redirect, or may even fail or return a different result.
  3. Extra documentation is required, in addition to the reverse domain name notation.

Starting with the first web graph in 2026 (cc-main-2025-26-nov-dec-jan-host), the leading www. in a host name will be preserved.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions