Skip to content

[Bug]: Reliable Parsing of Dynamic Websites  #644

Open
@addie9800

Description

@addie9800

Describe the bug

Some Websites occasionally don't return the full HTML, but rather an HTML page mostly containing script elements. I first noticed this while working with BoersenZeitung

How to reproduce

from fundus import PublisherCollection, Crawler
from fundus.logging import set_log_level
from logging import DEBUG

publisher = PublisherCollection.de.BoersenZeitung
crawler = Crawler(publisher)
set_log_level(DEBUG)
for article in crawler.crawl(max_articles=50, only_complete=False, error_handling="suppress"):
    print(article.html.responded_url)
    print(article.title)
    print("--------------------------------")

Expected behavior.

I would expect to consistently see a title being parsed and printed

Logs and Stack traces

No response

Screenshots

Logs in 1. iteration:

image

Logs in 2. iteration:

image

Additional Context

Here is an example of an incomplete HTML file test.zip

Environment

python==3.9

aiohttp==3.8.6
aioitertools==0.11.0     
aiosignal==1.3.1         
async-timeout==4.0.3     
attrs==23.2.0            
black==23.1.0            
Brotli==1.1.0            
certifi==2024.2.2        
chardet==5.2.0           
charset-normalizer==3.3.2
click==8.1.7             
colorama==0.4.6          
cssselect==1.2.0         
decorator==5.1.1         
dict2xml==1.7.6          
dill==0.3.8              
exceptiongroup==1.2.0
FastWARC==0.14.5
feedparser==6.0.11
frozenlist==1.4.1
-e git+https://github.com/flairNLP/fundus.git@05cc97dd8be59ac05d89456ac0db39cddce74e02#egg=fundus
idna==3.6
iniconfig==2.0.0
isort==5.12.0
langdetect==1.0.9
lxml==4.9.4
more-itertools==9.1.0
multidict==6.0.4
mypy==1.9.0
mypy-extensions==1.0.0
numpy==1.26.4
packaging==23.2
pandas==2.2.2
pathspec==0.12.1
platformdirs==4.1.0
pluggy==1.4.0
pytest==7.2.2
python-dateutil==2.8.2
pytz==2024.1
requests==2.31.0
robotspy==0.10.0
sgmllib3k==1.0.0
six==1.16.0
tomli==2.0.1
tqdm==4.66.1
types-colorama==0.4.15.20240106
types-lxml==2023.2.11
types-python-dateutil==2.8.19.20240106
types-requests==2.28.11.17
types-urllib3==1.26.25.14
typing_extensions==4.9.0
tzdata==2024.1
urllib3==2.2.0
validators==0.28.0
xmltodict==0.14.1
yarl==1.9.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions