Skip to content

Computing weighted links for subgraphs of cc-webgraph#58

Open
PeterCarragher wants to merge 4 commits intocommoncrawl:mainfrom
PeterCarragher:main
Open

Computing weighted links for subgraphs of cc-webgraph#58
PeterCarragher wants to merge 4 commits intocommoncrawl:mainfrom
PeterCarragher:main

Conversation

@PeterCarragher
Copy link

Closes: cc-webgraph issue

Description

This PR introduces a new spark job, "count_domain_links", using the cc-pyspark framework.
It takes as input as list of target domains, fetches all the WARC files for the specified crawl for those domains, and then parses those WARC files.
While parsing the WARC files, it counts all links between domains in the target domain list.

Finally, I have added a script that launches spark jobs using AWS EMR serverless.
This is simply a helper script that could be used with any of the other jobs in cc-pyspark, although I have only tested count_domain_links. If preferred, I can pull out that script as part of a separate PR.

While the issue and discussion for this happened over on the cc-webgraph repository, the spark job seems to belong here (please correct me if wrong).

Design Considerations

The approach described above was one choice of many. Here are some alternative approaches, and why they were not tried:

  • Use extracted link metadata in WAT format instead of parsing WARC files: This would not work because extracted metadata in this format does not count when multiple links to the same domain occur on a single webpage. In addition, the WAT metadata is not accessible in the columnar index, which is required given that we are dealing with small subgraphs (and need random access).
  • Use FastWARC instead of WARC: This only works for a job that parses the entire crawl. Here, we are only concerned with a subset of domains in the webgraph, and so instead use the columnar index to look them up. However, if in the future, weights are being computed over the entire webgraph, then a separate job should be created that uses FastWARC.

Performance

The logistics and costs of running this spark job on AWS EMR for a target_domain list of ~1400 popular domains are so follows:

  • ~3M WARC files parsed
  • ~3 hours runtime with 64vCPU limit
  • 204.477 vCPU-hours billed
  • 891.162 memoryGB-hours billed
  • 1022.383 storageGB-hours (not billed, WARC files are already on AWS)
  • Total AWS bill: ~$5

Testing & Analysis

So as not to add any further bloat to this PR, I have kept the code used to validate the results of the job in a separate repository:

  • The analysis notebook on the cc-webgraph-weighted repo shows that the computed link weights follow expected power law distributions, with an alpha of ~2.0 (within the expected range for webgraphs).
  • Navigational links (which are categorized separately to other link types) are emerge around news sites that are co-owned, which is expected. So the navigational and body link categories are classified correctly by the stack-based system when parsing WARC files.

Comments

The degree to which this "closes" the original issue is debatable, as the result of this job is links weights for subgraphs. Given the academic use case, it was not computational feasible or affordable to run this job across the entire webgraph. A true solution would likely rely on FastWARC.

Finally, the count_domain_links spark job serves a single use case, computing weights between a list of target domains. For use cases that require snowball sampling from a list of known domains to compute link weights for outlinking and backlinking domains as well, I have a separate script for this that relies on the pyccwebgraph library. Of my own doing, that library is based on the jshell interactive demo. Where this library should live, or if it should exist at all is up for debate in this issue.

Another approach for those interested in expanding seed domain lists before computing webgraph weights would be to use this quick interactive webapp to discover backlinking / outlinking domains via snowball sampling.

Feedback

I hope all this is deemed useful! As this is my first PR for this project, I am not sure what the best way to contribute is. Please don't hesitate to point out any and all issues!

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PeterCarragher, the PR is very appreciated.

I've run the job successfully on Spark locally over just a very small test set of 10 WARC records from a couple of domains:

  • The first run produced an empty graph as output. Ok, that was because intra-domain links were ignored and also the predefined set of domains was too small.
  • Most of the links were classified as "other", few as "navigation", only a single one as "body" (a "powered by" link in the footer). Looks like the classification works for modern and well-structured HTML5-based layouts, but not for those relying on div elements and CSS.

Use extracted link metadata in WAT format instead of parsing WARC files: This would not work because extracted metadata in this format does not count when multiple links to the same domain occur on a single webpage.

The WAT files should include all links, no deduplication is performed.

However, the "ExtractHostLinksJob" duplicates the links by aggregating them in a set.

But for your use case: the WAT only marks the links as A@/href or IMG@/src, not sufficient to classify links into navigation, etc.

Use FastWARC instead of WARC: This only works for a job that parses the entire crawl.

FastWARC can also be used to parse single WARC records. But I'd expect only a very marginal speed improvement using FastWARC on single records only.

The analysis notebook on the cc-webgraph-weighted repo shows that

Looks like the notebook does not include the output.

The degree to which this "closes" the original issue is debatable, as the result of this job is links weights for subgraphs.

We'd rather need a general solution. The predefined set of domains is one limitation. Here few other points:

  • Intra-domain links should be included. For some use case, e.g. spam detection, the ratio between internal and external links might be a good indicator. Without internal links, it would be difficult to normalize the weights in a meaningful way.
  • The Common Crawl web graphs include all kinds of links, not only <a href=...>: img, video, link, etc.
  • Of course, we'd start with a host-level graph, folding it to the domain-level in a second step.
  • Because the graphs are constructed from three consecutive crawls, should think how to treat revisits within this timespan. If not, links are counted multiple times. But maybe that's a negligible problem.
  • Any link classification needs to work for all HTML layouts and all languages (including class and id names). It's a hard problem, so rather, don't do it.

class/id suggests article body / prose content
related_count -- links inside elements whose class/id suggests related-
article widgets or recommendation carousels
other_count -- everything else: sidebars, bylines, share buttons,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sites which rely on <div> elements in combination with CSS to structure their pages, all links might end up in "other". That's not uncommon.

Copy link
Author

@PeterCarragher PeterCarragher Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for my test runs, this ended up as the rarest category. I will just leave a note here for now so others are aware if they run into this case.
For sites that do have this structure, other than summing all categories and just looking at overall uncategorized link counts, I am not what the simplest way to adjust this classification method would be.

For the largest set of domains I ran tests on (1.4k), here were the categories parsed:

Link volume by category:
nav_count 3,586,660 (90.4%)
body_count 365,924 (9.2%)
related_count 813 (0.0%)
other_count 15,265 (0.4%)

@PeterCarragher
Copy link
Author

Thanks for the quick review!

Looks like the notebook does not include the output.

My bad, here's a link to the notebook with output.

Intra-domain links should be included.

The decision to run on the domain graph rather than the host graph is mainly for my use case. However, domain level self-links are now included based on the last commit, so at least there is some signal on internal links now. Thanks for catching that!

Of course, we'd start with a host-level graph, folding it to the domain-level in a second step.

This makes a lot of sense! Certainly this job would need to be tweaked in multiple places to generate host graphs. The code changes would be relatively straightforward. The main issue would be identifying the subdomains to add for the subgraph approach used here ("target" domain filtering).

Because the graphs are constructed from three consecutive crawls, should think how to treat revisits within this timespan.

I did not realize the WARC files hadn't been deduplicated to remove revisits. It seems there would be no guarantee that the duplicated WARC files would end up in the same partition, so I'm not sure there is an efficient way to fix.

Any link classification needs to work for all HTML layouts and all languages (including class and id names). It's a hard problem, so rather, don't do it.

For my use case, the navigational links need to be separated at the very least, and I am mainly interested in the "body" link category. For news sites (which is what this job was tested on), the hope is that this counts most of the links that appear in articles as sources.

If the categorization is not accurate for other use cases, I can remove it here and keep my own forked version. But in the worst case, the link categories can always be summed, so I'm hoping it's okay as is.

Let me know if any of these remaining issues are blocking from closing the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Constructing a Weighted Webgraph

2 participants