Computing weighted links for subgraphs of cc-webgraph#58
Computing weighted links for subgraphs of cc-webgraph#58PeterCarragher wants to merge 4 commits intocommoncrawl:mainfrom
Conversation
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks @PeterCarragher, the PR is very appreciated.
I've run the job successfully on Spark locally over just a very small test set of 10 WARC records from a couple of domains:
- The first run produced an empty graph as output. Ok, that was because intra-domain links were ignored and also the predefined set of domains was too small.
- Most of the links were classified as "other", few as "navigation", only a single one as "body" (a "powered by" link in the footer). Looks like the classification works for modern and well-structured HTML5-based layouts, but not for those relying on div elements and CSS.
Use extracted link metadata in WAT format instead of parsing WARC files: This would not work because extracted metadata in this format does not count when multiple links to the same domain occur on a single webpage.
The WAT files should include all links, no deduplication is performed.
However, the "ExtractHostLinksJob" duplicates the links by aggregating them in a set.
But for your use case: the WAT only marks the links as A@/href or IMG@/src, not sufficient to classify links into navigation, etc.
Use FastWARC instead of WARC: This only works for a job that parses the entire crawl.
FastWARC can also be used to parse single WARC records. But I'd expect only a very marginal speed improvement using FastWARC on single records only.
The analysis notebook on the cc-webgraph-weighted repo shows that
Looks like the notebook does not include the output.
The degree to which this "closes" the original issue is debatable, as the result of this job is links weights for subgraphs.
We'd rather need a general solution. The predefined set of domains is one limitation. Here few other points:
- Intra-domain links should be included. For some use case, e.g. spam detection, the ratio between internal and external links might be a good indicator. Without internal links, it would be difficult to normalize the weights in a meaningful way.
- The Common Crawl web graphs include all kinds of links, not only
<a href=...>:img,video,link, etc. - Of course, we'd start with a host-level graph, folding it to the domain-level in a second step.
- Because the graphs are constructed from three consecutive crawls, should think how to treat revisits within this timespan. If not, links are counted multiple times. But maybe that's a negligible problem.
- Any link classification needs to work for all HTML layouts and all languages (including class and id names). It's a hard problem, so rather, don't do it.
| class/id suggests article body / prose content | ||
| related_count -- links inside elements whose class/id suggests related- | ||
| article widgets or recommendation carousels | ||
| other_count -- everything else: sidebars, bylines, share buttons, |
There was a problem hiding this comment.
For sites which rely on <div> elements in combination with CSS to structure their pages, all links might end up in "other". That's not uncommon.
There was a problem hiding this comment.
At least for my test runs, this ended up as the rarest category. I will just leave a note here for now so others are aware if they run into this case.
For sites that do have this structure, other than summing all categories and just looking at overall uncategorized link counts, I am not what the simplest way to adjust this classification method would be.
For the largest set of domains I ran tests on (1.4k), here were the categories parsed:
Link volume by category:
nav_count 3,586,660 (90.4%)
body_count 365,924 (9.2%)
related_count 813 (0.0%)
other_count 15,265 (0.4%)
|
Thanks for the quick review!
My bad, here's a link to the notebook with output.
The decision to run on the domain graph rather than the host graph is mainly for my use case. However, domain level self-links are now included based on the last commit, so at least there is some signal on internal links now. Thanks for catching that!
This makes a lot of sense! Certainly this job would need to be tweaked in multiple places to generate host graphs. The code changes would be relatively straightforward. The main issue would be identifying the subdomains to add for the subgraph approach used here ("target" domain filtering).
I did not realize the WARC files hadn't been deduplicated to remove revisits. It seems there would be no guarantee that the duplicated WARC files would end up in the same partition, so I'm not sure there is an efficient way to fix.
For my use case, the navigational links need to be separated at the very least, and I am mainly interested in the "body" link category. For news sites (which is what this job was tested on), the hope is that this counts most of the links that appear in articles as sources. If the categorization is not accurate for other use cases, I can remove it here and keep my own forked version. But in the worst case, the link categories can always be summed, so I'm hoping it's okay as is. Let me know if any of these remaining issues are blocking from closing the PR! |
Closes: cc-webgraph issue
Description
This PR introduces a new spark job, "count_domain_links", using the cc-pyspark framework.
It takes as input as list of target domains, fetches all the WARC files for the specified crawl for those domains, and then parses those WARC files.
While parsing the WARC files, it counts all links between domains in the target domain list.
Finally, I have added a script that launches spark jobs using AWS EMR serverless.
This is simply a helper script that could be used with any of the other jobs in cc-pyspark, although I have only tested count_domain_links. If preferred, I can pull out that script as part of a separate PR.
While the issue and discussion for this happened over on the cc-webgraph repository, the spark job seems to belong here (please correct me if wrong).
Design Considerations
The approach described above was one choice of many. Here are some alternative approaches, and why they were not tried:
Performance
The logistics and costs of running this spark job on AWS EMR for a target_domain list of ~1400 popular domains are so follows:
Testing & Analysis
So as not to add any further bloat to this PR, I have kept the code used to validate the results of the job in a separate repository:
Comments
The degree to which this "closes" the original issue is debatable, as the result of this job is links weights for subgraphs. Given the academic use case, it was not computational feasible or affordable to run this job across the entire webgraph. A true solution would likely rely on FastWARC.
Finally, the count_domain_links spark job serves a single use case, computing weights between a list of target domains. For use cases that require snowball sampling from a list of known domains to compute link weights for outlinking and backlinking domains as well, I have a separate script for this that relies on the pyccwebgraph library. Of my own doing, that library is based on the jshell interactive demo. Where this library should live, or if it should exist at all is up for debate in this issue.
Another approach for those interested in expanding seed domain lists before computing webgraph weights would be to use this quick interactive webapp to discover backlinking / outlinking domains via snowball sampling.
Feedback
I hope all this is deemed useful! As this is my first PR for this project, I am not sure what the best way to contribute is. Please don't hesitate to point out any and all issues!