-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Description
What happened
collection_stats returned the following:
- Total documents (chunks): 25,313
- Document Types → article: 30
The user confirmed they have significantly more than 30 papers in their collection. 25,313 chunks from only 30 articles would mean ~844 chunks per paper on average, which is implausibly high. The article count appears to be severely undercounting the actual number of documents in the collection.
Expected behavior
The article: 30 count under "Document Types" should reflect the true number of distinct papers/documents in the collection, consistent with the chunk count.
Possible causes
- The article count may be reading from a metadata table or index that isn't being updated when new documents are ingested (e.g., PDFs processed by the monitor service may create chunks but not register as articles in the stats source)
- The count might only reflect articles added through a specific pathway (e.g., manual upload, discovery sources) and miss others (e.g., PDF monitor, direct ingestion)
- There may be a distinction between "articles" (as a document_type) and other ingested documents that aren't being surfaced in the stats
Context
- Skill active: onboarding (using
collection_statsandsearch_articles) - User action: First interaction / collection deep dive
search_articleswith various queries consistently returned results capped at the same ~30 articles, suggesting the article-level index (not the chunk store) is the bottleneck- The chunk collection (
document_chunksvia pgvector) appears to have the full corpus at 25,313 chunks
Steps to reproduce
- Call
collection_stats - Note the
article: 30count under Document Types - Compare against the actual number of distinct papers the user has ingested
- Observe the discrepancy
Reactions are currently unavailable