Open
Description
Describe the bug
Some Websites occasionally don't return the full HTML, but rather an HTML page mostly containing script elements. I first noticed this while working with BoersenZeitung
How to reproduce
from fundus import PublisherCollection, Crawler
from fundus.logging import set_log_level
from logging import DEBUG
publisher = PublisherCollection.de.BoersenZeitung
crawler = Crawler(publisher)
set_log_level(DEBUG)
for article in crawler.crawl(max_articles=50, only_complete=False, error_handling="suppress"):
print(article.html.responded_url)
print(article.title)
print("--------------------------------")
Expected behavior.
I would expect to consistently see a title being parsed and printed
Logs and Stack traces
No response
Screenshots
Logs in 1. iteration:
Logs in 2. iteration:
Additional Context
Here is an example of an incomplete HTML file test.zip
Environment
python==3.9
aiohttp==3.8.6
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.2.0
black==23.1.0
Brotli==1.1.0
certifi==2024.2.2
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
cssselect==1.2.0
decorator==5.1.1
dict2xml==1.7.6
dill==0.3.8
exceptiongroup==1.2.0
FastWARC==0.14.5
feedparser==6.0.11
frozenlist==1.4.1
-e git+https://github.com/flairNLP/fundus.git@05cc97dd8be59ac05d89456ac0db39cddce74e02#egg=fundus
idna==3.6
iniconfig==2.0.0
isort==5.12.0
langdetect==1.0.9
lxml==4.9.4
more-itertools==9.1.0
multidict==6.0.4
mypy==1.9.0
mypy-extensions==1.0.0
numpy==1.26.4
packaging==23.2
pandas==2.2.2
pathspec==0.12.1
platformdirs==4.1.0
pluggy==1.4.0
pytest==7.2.2
python-dateutil==2.8.2
pytz==2024.1
requests==2.31.0
robotspy==0.10.0
sgmllib3k==1.0.0
six==1.16.0
tomli==2.0.1
tqdm==4.66.1
types-colorama==0.4.15.20240106
types-lxml==2023.2.11
types-python-dateutil==2.8.19.20240106
types-requests==2.28.11.17
types-urllib3==1.26.25.14
typing_extensions==4.9.0
tzdata==2024.1
urllib3==2.2.0
validators==0.28.0
xmltodict==0.14.1
yarl==1.9.4