Skip to content

[agent] collection_stats reports 30 articles but collection has 25,313 chunks — likely undercounting documents #50

@acertainKnight

Description

@acertainKnight

What happened

collection_stats returned the following:

  • Total documents (chunks): 25,313
  • Document Types → article: 30

The user confirmed they have significantly more than 30 papers in their collection. 25,313 chunks from only 30 articles would mean ~844 chunks per paper on average, which is implausibly high. The article count appears to be severely undercounting the actual number of documents in the collection.

Expected behavior

The article: 30 count under "Document Types" should reflect the true number of distinct papers/documents in the collection, consistent with the chunk count.

Possible causes

  • The article count may be reading from a metadata table or index that isn't being updated when new documents are ingested (e.g., PDFs processed by the monitor service may create chunks but not register as articles in the stats source)
  • The count might only reflect articles added through a specific pathway (e.g., manual upload, discovery sources) and miss others (e.g., PDF monitor, direct ingestion)
  • There may be a distinction between "articles" (as a document_type) and other ingested documents that aren't being surfaced in the stats

Context

  • Skill active: onboarding (using collection_stats and search_articles)
  • User action: First interaction / collection deep dive
  • search_articles with various queries consistently returned results capped at the same ~30 articles, suggesting the article-level index (not the chunk store) is the bottleneck
  • The chunk collection (document_chunks via pgvector) appears to have the full corpus at 25,313 chunks

Steps to reproduce

  1. Call collection_stats
  2. Note the article: 30 count under Document Types
  3. Compare against the actual number of distinct papers the user has ingested
  4. Observe the discrepancy

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions