Skip to content

[BUG] Docling HTML to Markdown conversion loses structure on members page #50

@niloydebbarma-code

Description

@niloydebbarma-code

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Problem

After running 1_crawl_site --url https://thealliance.ai/ then 2_process_files, company names get merged into one long line. Can't tell where one name ends and another starts when checking the processed markdown.

How to reproduce

  1. Run 1_crawl_site --url https://thealliance.ai/
  2. Run 2_process_files
  3. Check workspace/processed/thealliance.ai__aia-members.md
  4. Find that all organization names are concatenated together

What happens

HTML has each company separated:

<span>AMD</span>
<span>IBM</span>  
<span>Meta</span>
<span>Oracle</span>

After docling processes it, they run together:

AMD IBM Meta Oracle Sony Uber Dell Technologies ServiceNow Snowflake Databricks Cornell University...

Impact on entity extraction

These get extracted:

  • Acceleration Consortium (starts the line)
  • Carnegie Mellon University (has "University")
  • Meta (short and clear)
  • Red Hat (distinctive)

These get lost:

  • AMD (buried in "...AI AMD Anaconda...")
  • IBM (buried in "...Jerusalem IBM Imperial...")
  • Intel (buried in "...Technology Intel The...")
  • Sony (buried in "...Snowflake Sony St Johns...")

Files affected

  • Source HTML: workspace/crawled/thealliance.ai__aia-members.html
  • Processed markdown: workspace/processed/thealliance.ai__aia-members.md
  • Processing script: 2_process_files.py (uses docling conversion)

Folder structure

workspace/
├── crawled/           # Raw HTML files from website crawl
└── processed/         # Docling-converted markdown files

Fix options

  1. Change docling settings to preserve HTML structure
  2. Parse HTML pages separately before docling conversion
  3. Post-process concatenated text to split entity names properly

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- OS:
- Browser:

Anything else?

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions