Skip to content

[Bug]: URL ingestion of page links to sub-pages (crawl depth) needs reconciling with product requirements #1644

@mpawlow

Description

@mpawlow

OpenRAG Version

0.5.0

Deployment Method

Local development (make dev)

Operating System

Ubuntu 24.04.4 LTS

Python Version

3.13.13

Affected Area

Ingestion (document processing, upload, Docling)

Bug Description

URL ingestion of page links to sub-pages (crawl depth) needs reconciling with product requirements

Steps to Reproduce

  1. Go to Chat
  2. Enter prompt: "Ingest this URL: https://crawler-test.com/"
    • URL successfully ingested
    • Crawl Depth used by agent is 1 (instead of 2)
    • BUG(?): Unable to find any content from sub-pages
    • See screenshot below

Expected Behavior

  • Verify: Only pages up to the configured crawl depth (default 2) are ingested; no runaway crawl

Actual Behavior

  • Crawl Depth used by agent is 1 (instead of 2)

Relevant Logs

N/A

Screenshots

Image

Additional Context

ℹ️ Feedback from @lucaseduoli

  • Crawl depth should be based on the length of the page (rather than sub-pages)
  • Should delegate and let agent decide what sub-pages (if any) should be crawled
  • No known competitor RAG tools that automatically crawl to a depth of 2 (sub-pages) - only 1 (same page)
  • Default crawl depth should be 1 (which is the current behavior)
  • Should consult with Product team to verify
  • This test scenario is really valid

Checklist

  • I have searched existing issues to ensure this bug hasn't been reported before.
  • I have provided all the requested information.

Metadata

Metadata

Assignees

Labels

bug🔴 Something isn't working.

Type

No fields configured for Bug.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions