-
Notifications
You must be signed in to change notification settings - Fork 34
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
Problem
After running 1_crawl_site --url https://thealliance.ai/ then 2_process_files, company names get merged into one long line. Can't tell where one name ends and another starts when checking the processed markdown.
How to reproduce
- Run
1_crawl_site --url https://thealliance.ai/ - Run
2_process_files - Check
workspace/processed/thealliance.ai__aia-members.md - Find that all organization names are concatenated together
What happens
HTML has each company separated:
<span>AMD</span>
<span>IBM</span>
<span>Meta</span>
<span>Oracle</span>
After docling processes it, they run together:
AMD IBM Meta Oracle Sony Uber Dell Technologies ServiceNow Snowflake Databricks Cornell University...
Impact on entity extraction
These get extracted:
- Acceleration Consortium (starts the line)
- Carnegie Mellon University (has "University")
- Meta (short and clear)
- Red Hat (distinctive)
These get lost:
- AMD (buried in "...AI AMD Anaconda...")
- IBM (buried in "...Jerusalem IBM Imperial...")
- Intel (buried in "...Technology Intel The...")
- Sony (buried in "...Snowflake Sony St Johns...")
Files affected
- Source HTML:
workspace/crawled/thealliance.ai__aia-members.html - Processed markdown:
workspace/processed/thealliance.ai__aia-members.md - Processing script:
2_process_files.py(uses docling conversion)
Folder structure
workspace/
├── crawled/ # Raw HTML files from website crawl
└── processed/ # Docling-converted markdown files
Fix options
- Change docling settings to preserve HTML structure
- Parse HTML pages separately before docling conversion
- Post-process concatenated text to split entity names properly
Expected Behavior
No response
Steps To Reproduce
No response
Environment
- OS:
- Browser:Anything else?
No response
Metadata
Metadata
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
Todo