-
Notifications
You must be signed in to change notification settings - Fork 98
Description
New query API for the semantic index
Magda's existing search API covers datasets metadata only and backed by the new hybrid search engine (#3549 and also epic: #3503).
In V6, we're introducing a new semantic indexer that will focus on indexing dataset distribution content into vector store (OpenSearch) with content type specific indexing strategy. e.g.:
- PDF file: convert the file content as markdown and index the markdown doc content
- CSV / tabular data: indexing data dictionary type information e.g. column name / type etc.
- API: indexing API access information. e.g. API reference doc etc.
The common goal of the content type specific indexing strategy is to collect sufficient information that can help an AI agent to find relevant dataset distributions that help its data analysis / Q&A tasks.
The new semantic index query API serves the same purpose by offering API access to the indexed data.
The semantic indexer index mapping defined in ticket:
Requirements of the new query API
- Ability to find top N result that are most relevant to the user query via semantic / vector search.
- We should expect
N
are much lower than 100 for most of cases as the result will used by AI agent for solving the user inquiry
- We should expect
- Return only user accessible documents
Authorisation Enforcement Problem
The new semantic indexer index mapping doesn't contains any dataset / distribution metadata fields. Thus, authorisation enforcement can't be implement with solely information of this index.
There are a few potential solutions might help:
- Nested Field Type. i.e. defined the new semantic indexer index fields as nested object of the existing main datasets metadata index. Therefore, we have all information in one index.
- problems:
- Have to reindex the whole dataset in order to update one document in the nested fields
- the number of the documents for a dataset / distribution could be quite large depends on content size & chunk size
- distributions have been defined as "nested field" on the main dataset index - we can't defined distribution content index documents as nested field on distribution.
- problems:
- parent / child relationship defined by join field
- problems:
- Queries are more expensive and memory-intensive
- Documents for both parents and children must be indexed on the same shard
- Only one join field is allow on one index. However, distributions are defined as "nested field" on the main dataset index (metadata source). Our semantic indexer documents might be related to a dataset or distribution.
- problems:
- Enrich Policies (or Opensearch ingestion pipeline / Data Prepper)can help to enrich documents at ingest time by looking up reference data from another index. But enrich data is static at enrich-time. Not suitable for dynamic access control.
- Create another indexer / minion to sync selected access control related to the semantic indexer index
- Problem:
- Complexity
- Development efforts
- Data duplication and the number of the documents per distribution record could be large
- Problem:
Proposed Solution
We propose a two-phase search strategy combining early vector-based recall with a fallback precision-mode retrieval scoped to accessible datasets.
-
Implement a new registry API endpoint (named
filter records by access
) that takes:- a list of records IDs
- And propose operation e.g.
object/record/read
- and return a list of records IDs that the user has access to.
-
Phase 1: Broad Vector Search
- Query Semantic indexer index for the top N=500 most relevant documents
- Filter unauthorised document via
filter records by access
API with record IDs of the retrieved documents - If any authorised results found, return as result. If not, proceed to
Phase 2
query.
-
Phase 2: Access-Aware Narrowed Vector Search (Fallback)
- If no results after phase 1, to avoid relevant results are silently dropped due to authorisation filtering. We will:
- Query existing search API for top
500
datasets the user has access to. The search result leverage the existing hybrid search engine and make sure:- return datasets are most relevant to user query based on metadata
- only authorisation datasets are returned
- Compiled a list of record IDs from the result including all result included dataset & distribution records IDs.
- Query the semantic indexer index for the top N=500 most relevant documents like in phrase 1 but restrict the query to the compiled record ID subset using a terms filter
- If results found, return result. If not, respond "no accessible results found" response.
- Query existing search API for top
- If no results after phase 1, to avoid relevant results are silently dropped due to authorisation filtering. We will:
This layered query strategy benefits us by:
- Recall-first (broad search): helps maximise semantic relevance
- Precision-second (access-aware fallback): ensures no relevant results are silently dropped due to authorisation
- Keeps our dataset metadata-light in the semantic indexer index, avoiding sync/duplication issues
Limitation:
- Longer respond time. Especially when the Phase 2 step is required.
- When the users ask for top N relevant records, we might respond less than N. Worst case, we might respond none.
- The chance should be very low as the Phase 2 query will retrieve both accessible and relevant (based on metadata) datasets.
Semantic Search API Query interface
The new semantic query search API comes with the following endpoint
/search
: search semantic index- We can use
- HTTP Method: GET or POST
- When Method =
GET
, all parameters are passed as query string - When Method =
POST
, all parameters are passed as JSON data body.
- When Method =
- parameters:
max_num_results
: (number). Optional. Default to 10 if not specified. max number of results to be retrieved- need to be a number between 1 and 500.
query
: (string). The query string.itemType
: (string). Optionally filter the result by itemType. Possible value:storageObject
orregistryRecord
fileFormat
: (string). Optionally filter the result by fileFormat.recordId
: (string). Optionally filter by the recordID of a registry record.subObjectId
: (string). Optionally filter by the subObjectIdsubObjectType
: (string). Optionally filter by the subObjectTypeminScore
: (number). Optional. Only return result higher than the specified score.- When Radial search is available (i.e. non-Disk-based vector search), we can set
min_score
field of KNN query to filter the result. - When use Disk-based vector search, we can retrieve 2 times of
max_num_results
and filter it at the API endpoint after retrieve the result.
- When Radial search is available (i.e. non-Disk-based vector search), we can set
- Response: JSON array with each item has the following field:
id
: the opensearch document IDitemType
recordId
fileFormat
subObjectId
subObjectType
text
: the original text of the indexed text chunkonly_one_index_text_chunk
index_text_chunk_length
index_text_chunk_position
index_text_chunk_overlap
/retrieve
: retrieve extended text context by document IDs.- The
/search
API only returns result at chunk level. This API enables the extended text context to be retrieved. - We should probably leave this endpoint for future implementation
- Method:
POST
- Parameters:
ids
: (string[]) a list of document IDs (returned via/search
API)mode
: retrieve mode. Either:full
: assemble all chunks of thestorageObject
orRegistryRecord
orsubObject
partial
: assemble specified number of "preceding" chunks and/orsubsequent
chunks- Default to
full
precedingChunksNum
: (number) optional. Default to 0. Number of preceding chunks to be retrieved.subsequentChunksNum
: (number) optional. Default to 0. Number of preceding chunks to be retrieved.
- Response: JSON array with each item with the following fields:
id
: the original supplied opensearch document IDitemType
recordId
fileFormat
subObjectId
subObjectType
text
: the extended text content produced based onmode
,precedingChunksNum
&subsequentChunksNum
- The
Additional Implementation Information
- We need to create a new module in core repo. The name could be
magda-semantic-search-api
- We need to add a route in the gateway default route config to make the micro-service accessible. e.g.:
semantic-search:
to: http://magda-semantic-search/
auth: true
Here, magda-semantic-search
is the k8s svc name defined in the helm chart.
The above setup will make the /search
API endpoint accessible at https://example.com/api/v0/semantic-search/search