Skip to content

quickwit: add tag_fields on CounterID, drop positions on raw text#877

Open
alexey-milovidov wants to merge 1 commit intoadd-quickwit-entryfrom
quickwit-tag-fields-record-basic
Open

quickwit: add tag_fields on CounterID, drop positions on raw text#877
alexey-milovidov wants to merge 1 commit intoadd-quickwit-entryfrom
quickwit-tag-fields-record-basic

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

Two index-level changes to quickwit/index_config.yaml, keeping the rest of the benchmark setup identical.

  • tag_fields: [CounterID] — Q37-Q43 all filter CounterID = 62. Tagging it writes the per-split CounterID values into the metastore so the searcher can prune whole splits before opening them. This is the closest analogue we get to Elasticsearch's index.sort early-termination on the same column. Quickwit/Tantivy has no real multi-column doc-sort to match the full ES sort.field: [CounterID, EventDate, UserID, EventTime, WatchID], so this picks up just the CounterID dimension.
  • record: basic on every tokenizer: raw text field (28 fields). Tantivy defaults text postings to WithFreqsAndPositions, but raw-tokenized fields only ever hold one term per document — phrase queries can't run against them, so freqs and positions are dead weight in the index.

Validated against the running v0.9.0-nightly server (the same image benchmark.sh uses): the tag_fields and record: basic settings round-trip cleanly through the index-create API.

Test plan

  • Re-run bash benchmark.sh end-to-end on a fresh machine
  • Compare cold + warm timings against the previous results, especially Q37–Q43 (CounterID filter) for the tag_fields benefit
  • Confirm load time and on-disk size — both should stay flat or shrink slightly thanks to record: basic

🤖 Generated with Claude Code

tag_fields: [CounterID] writes per-split CounterID values into the
metastore so the searcher can prune whole splits before opening them
for queries 37-43, which all filter CounterID = 62 — the closest
analogue to Elasticsearch's index.sort early-termination here.

record: basic on every tokenizer: raw text field skips storing freqs
and positions in the postings; phrase queries can never run against
single-term raw fields, so the data was dead weight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant