I feel like there must be a more generic solution here. Presumably this problem has been solved a bunch of times, and we can just a library or similar to parse this stuff out?
We're using this on ads, and it's pretty effective: https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extraction
Originally posted by @ericholscher in #12757 (comment)