Skip to content

Commit 1ff255c

Browse files
ruoliu2claudefulmicoton-dd
authored
feat(es-compat): add index_filter support for field capabilities API (#6102)
* feat(es-compat): add index_filter support for field capabilities API Implements index_filter parameter support for the ES-compatible _field_caps endpoint, allowing users to filter field capabilities based on document queries. Changes: - Add query_ast field to ListFieldsRequest and LeafListFieldsRequest protos - Parse index_filter from ES Query DSL and convert to QueryAst - Pass query_ast through to leaf nodes for future filtering support - Add unit tests for index_filter parsing - Add REST API integration tests Note: This implementation accepts and parses the index_filter parameter for API compatibility. Full split-level document filtering will be added as a follow-up enhancement. Closes #5693 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: ruo <ruoliu.dev@gmail.com> * feat(es-compat): implement split-level filtering for field_caps index_filter Address PR review comments for index_filter support in _field_caps API: - Extract `parse_index_filter_to_query_ast()` function with clean prototype - Implement split-level filtering via `split_matches_query()` using lightweight `query.count()` execution (no document materialization) - Add proper async handling with ByteRangeCache, warmup(), and run_cpu_intensive() for Quickwit's async-only storage - Add metastore-level pruning: - Tag extraction via `extract_tags_from_query()` - Time range extraction via `refine_start_end_timestamp_from_ast()` - Build DocMapper only when query_ast is provided (no overhead for common path without index_filter) - Fix REST API tests: use `json:` key (not `json_body:`), use lowercase term values to match tokenizer behavior - Update tests to run against both quickwit and elasticsearch engines Two-level filtering now implemented: 1. Metastore level: tags + time range from query AST 2. Split level: lightweight query execution for accurate filtering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: ruo <ruoliu.dev@gmail.com> * refactor(es-compat): use best-effort metadata filtering for index_filter Remove heavy split-level query execution for field_caps index_filter. The implementation now aligns with ES's "best-effort" approach that uses metadata-level filtering only (time range, tags) instead of opening splits and executing queries. Changes: - Remove split_matches_query function (no longer opens splits) - Remove query_ast and doc_mapper from LeafListFieldsRequest proto - Keep metadata-level filtering in root_list_fields: - Time range extraction from query AST - Tag-based split pruning - Simplify leaf_list_fields to just return fields from all splits This matches ES semantics: "filtering is done on a best-effort basis... this API may return an index even if the provided filter matches no document." Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: ruo <ruoliu.dev@gmail.com> * fix(es-compat): reject empty index_filter to match ES behavior - Remove empty object {} handling in parse_index_filter_to_query_ast - ES rejects empty index_filter with 400, now QW does too - Add tag_fields config and tag-based index_filter test - Update unit and integration tests accordingly Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: ruo <ruoliu.dev@gmail.com> * Added ref doc for the new functionality. * cargo fmt fix --------- Signed-off-by: ruo <ruoliu.dev@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: fulmicoton <paul.masurel@datadoghq.com>
1 parent 92bcc15 commit 1ff255c

File tree

8 files changed

+496
-46
lines changed

8 files changed

+496
-46
lines changed

docs/reference/es_compatible_api.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,6 +365,79 @@ Example response:
365365

366366
[HTTP accept header]: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
367367

368+
369+
### `_field_caps` &nbsp; Field capabilities API
370+
371+
```
372+
GET api/v1/_elastic/<index>/_field_caps
373+
```
374+
```
375+
POST api/v1/_elastic/<index>/_field_caps
376+
```
377+
```
378+
GET api/v1/_elastic/_field_caps
379+
```
380+
```
381+
POST api/v1/_elastic/_field_caps
382+
```
383+
384+
The [field capabilities API](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html) returns information about the capabilities of fields among multiple indices.
385+
386+
#### Supported Query string parameters
387+
388+
| Variable | Type | Description | Default value |
389+
| --------------------- | ---------- | ------------------------------------------------------------------------------ | ------------- |
390+
| `fields` | `String` | Comma-separated list of fields to retrieve capabilities for. Supports wildcards (`*`). | (Optional) |
391+
| `allow_no_indices` | `Boolean` | If `true`, missing or closed indices are not an error. | (Optional) |
392+
| `expand_wildcards` | `String` | Controls what kind of indices that wildcard patterns can match. | (Optional) |
393+
| `ignore_unavailable` | `Boolean` | If `true`, unavailable indices are ignored. | (Optional) |
394+
| `start_timestamp` | `Integer` | *(Quickwit-specific)* If set, restricts splits to documents with a timestamp range start >= `start_timestamp` (seconds since epoch). | (Optional) |
395+
| `end_timestamp` | `Integer` | *(Quickwit-specific)* If set, restricts splits to documents with a timestamp range end < `end_timestamp` (seconds since epoch). | (Optional) |
396+
397+
#### Supported Request Body parameters
398+
399+
| Variable | Type | Description | Default value |
400+
| ------------------ | ------------- | --------------------------------------------------------------------------- | ------------- |
401+
| `index_filter` | `Json object` | A query to filter indices. If provided, only fields from indices that can potentially match the filter are returned. See [index_filter](#index_filter). | (Optional) |
402+
| `runtime_mappings` | `Json object` | Accepted but not supported. | (Optional) |
403+
404+
#### `index_filter`
405+
406+
The `index_filter` parameter allows you to filter which indices contribute to the field capabilities response. When provided, Quickwit uses the filter query to prune indices (splits) that cannot match the filter, and only returns field capabilities for the remaining ones.
407+
408+
Like Elasticsearch, this is a **best-effort** approach: Quickwit may return field capabilities from indices that do not actually contain any matching documents. In Quickwit, the filtering is limited to the existing split-pruning based on metadata:
409+
410+
- **Time pruning**: Range queries on the timestamp field can eliminate splits whose time range does not overlap with the filter.
411+
- **Tag pruning**: Term queries on [tag fields](../configuration/index-config.md#tag-fields) can eliminate splits that do not contain the requested tag value.
412+
413+
Other filter types (e.g. full-text queries or term queries on non-tag fields) are accepted but will not prune any splits — all indices will be returned as if no filter was specified. In particular, Quickwit does not check whether terms are present in the term dictionary.
414+
415+
#### Request Body example
416+
417+
```json
418+
{
419+
"index_filter": {
420+
"range": {
421+
"timestamp": {
422+
"gte": "2024-01-01T00:00:00Z",
423+
"lt": "2024-02-01T00:00:00Z"
424+
}
425+
}
426+
}
427+
}
428+
```
429+
430+
```json
431+
{
432+
"index_filter": {
433+
"term": {
434+
"status": "active"
435+
}
436+
}
437+
}
438+
```
439+
440+
368441
## Query DSL
369442

370443
[Elasticsearch Query DSL reference](https://www.elastic.co/guide/en/elasticsearch/reference/8.8/query-dsl.html).

quickwit/quickwit-proto/protos/quickwit/search.proto

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,10 @@ message ListFieldsRequest {
125125
optional int64 start_timestamp = 3;
126126
optional int64 end_timestamp = 4;
127127

128+
// JSON-serialized QueryAst for index_filter support.
129+
// When provided, only fields from documents matching this query are returned.
130+
optional string query_ast = 5;
131+
128132
// Control if the request will fail if split_ids contains a split that does not exist.
129133
// optional bool fail_on_missing_index = 6;
130134
}
@@ -141,7 +145,6 @@ message LeafListFieldsRequest {
141145
// Optional limit query to a list of fields
142146
// Wildcard expressions are supported.
143147
repeated string fields = 4;
144-
145148
}
146149

147150
message ListFieldsResponse {

quickwit/quickwit-proto/src/codegen/quickwit/quickwit.search.rs

Lines changed: 4 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

quickwit/quickwit-search/src/list_fields.rs

Lines changed: 66 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,16 @@ use itertools::Itertools;
2424
use quickwit_common::rate_limited_warn;
2525
use quickwit_common::shared_consts::{FIELD_PRESENCE_FIELD_NAME, SPLIT_FIELDS_FILE_NAME};
2626
use quickwit_common::uri::Uri;
27+
use quickwit_config::build_doc_mapper;
28+
use quickwit_doc_mapper::tag_pruning::extract_tags_from_query;
2729
use quickwit_metastore::SplitMetadata;
2830
use quickwit_proto::metastore::MetastoreServiceClient;
2931
use quickwit_proto::search::{
3032
LeafListFieldsRequest, ListFields, ListFieldsEntryResponse, ListFieldsRequest,
3133
ListFieldsResponse, SplitIdAndFooterOffsets, deserialize_split_fields,
3234
};
3335
use quickwit_proto::types::{IndexId, IndexUid};
36+
use quickwit_query::query_ast::QueryAst;
3437
use quickwit_storage::Storage;
3538

3639
use crate::leaf::open_split_bundle;
@@ -310,6 +313,8 @@ impl FieldPattern {
310313
}
311314

312315
/// `leaf` step of list fields.
316+
///
317+
/// Returns field metadata from the assigned splits.
313318
pub async fn leaf_list_fields(
314319
index_id: IndexId,
315320
index_storage: Arc<dyn Storage>,
@@ -322,6 +327,12 @@ pub async fn leaf_list_fields(
322327
.map(|pattern_str| FieldPattern::from_str(pattern_str))
323328
.collect::<crate::Result<_>>()?;
324329

330+
// If no splits, return empty response
331+
if split_ids.is_empty() {
332+
return Ok(ListFieldsResponse { fields: Vec::new() });
333+
}
334+
335+
// Get fields from all splits
325336
let single_split_list_fields_futures: Vec<_> = split_ids
326337
.iter()
327338
.map(|split_id| {
@@ -375,7 +386,7 @@ pub async fn leaf_list_fields(
375386
Ok(ListFieldsResponse { fields })
376387
}
377388

378-
/// Index metas needed for executing a leaf search request.
389+
/// Index metas needed for executing a leaf list fields request.
379390
#[derive(Clone, Debug)]
380391
pub struct IndexMetasForLeafSearch {
381392
/// Index id.
@@ -399,29 +410,63 @@ pub async fn root_list_fields(
399410
if indexes_metadata.is_empty() {
400411
return Ok(ListFieldsResponse { fields: Vec::new() });
401412
}
402-
let index_uid_to_index_meta: HashMap<IndexUid, IndexMetasForLeafSearch> = indexes_metadata
403-
.iter()
404-
.map(|index_metadata| {
405-
let index_metadata_for_leaf_search = IndexMetasForLeafSearch {
406-
index_uri: index_metadata.index_uri().clone(),
407-
index_id: index_metadata.index_config.index_id.to_string(),
408-
};
409-
410-
(
411-
index_metadata.index_uid.clone(),
412-
index_metadata_for_leaf_search,
413+
414+
// Build index metadata map and extract timestamp field for time range refinement
415+
let mut index_uid_to_index_meta: HashMap<IndexUid, IndexMetasForLeafSearch> = HashMap::new();
416+
let mut index_uids: Vec<IndexUid> = Vec::new();
417+
let mut timestamp_field_opt: Option<String> = None;
418+
419+
for index_metadata in indexes_metadata {
420+
// Extract timestamp field for time range refinement (use first index's field)
421+
if timestamp_field_opt.is_none()
422+
&& list_fields_req.query_ast.is_some()
423+
&& let Ok(doc_mapper) = build_doc_mapper(
424+
&index_metadata.index_config.doc_mapping,
425+
&index_metadata.index_config.search_settings,
413426
)
414-
})
415-
.collect();
416-
let index_uids: Vec<IndexUid> = indexes_metadata
417-
.into_iter()
418-
.map(|index_metadata| index_metadata.index_uid)
419-
.collect();
427+
{
428+
timestamp_field_opt = doc_mapper.timestamp_field_name().map(|s| s.to_string());
429+
}
430+
431+
let index_metadata_for_leaf_search = IndexMetasForLeafSearch {
432+
index_uri: index_metadata.index_uri().clone(),
433+
index_id: index_metadata.index_config.index_id.to_string(),
434+
};
435+
436+
index_uids.push(index_metadata.index_uid.clone());
437+
index_uid_to_index_meta.insert(
438+
index_metadata.index_uid.clone(),
439+
index_metadata_for_leaf_search,
440+
);
441+
}
442+
443+
// Extract tags and refine time range from query_ast for split pruning
444+
let mut start_timestamp = list_fields_req.start_timestamp;
445+
let mut end_timestamp = list_fields_req.end_timestamp;
446+
let tags_filter_opt = if let Some(ref query_ast_json) = list_fields_req.query_ast {
447+
let query_ast: QueryAst = serde_json::from_str(query_ast_json)
448+
.map_err(|err| SearchError::InvalidQuery(err.to_string()))?;
449+
450+
// Refine time range from query AST if timestamp field is available
451+
if let Some(ref timestamp_field) = timestamp_field_opt {
452+
crate::root::refine_start_end_timestamp_from_ast(
453+
&query_ast,
454+
timestamp_field,
455+
&mut start_timestamp,
456+
&mut end_timestamp,
457+
);
458+
}
459+
460+
extract_tags_from_query(query_ast)
461+
} else {
462+
None
463+
};
464+
420465
let split_metadatas: Vec<SplitMetadata> = list_relevant_splits(
421466
index_uids,
422-
list_fields_req.start_timestamp,
423-
list_fields_req.end_timestamp,
424-
None,
467+
start_timestamp,
468+
end_timestamp,
469+
tags_filter_opt,
425470
&mut metastore,
426471
)
427472
.await?;

0 commit comments

Comments
 (0)