New query API for the semantic index

## New query API for the semantic index

Magda's [existing search API](https://magda-io.github.io/api-docs/#api-Search-GetV0SearchDatasets) covers datasets metadata only and backed by the new hybrid search engine (#3549 and also epic: #3503).

In V6, we're introducing a new semantic indexer that will focus on indexing dataset distribution content into vector store (OpenSearch) with content type specific indexing strategy. e.g.:
- [PDF file](https://github.com/magda-io/magda-pdf-semantic-indexer): convert the file content as markdown and index the markdown doc content
- [CSV](https://github.com/magda-io/magda-csv-semantic-indexer) / tabular data: indexing data dictionary type information e.g. column name / type etc.
- API: indexing API access information. e.g. API reference doc etc.

The common goal of the content type specific indexing strategy is to collect sufficient information that can help an AI agent to find relevant dataset distributions that help its data analysis / Q&A tasks.

The new semantic index query API serves the same purpose by offering API access to the indexed data.

The semantic indexer index mapping defined in ticket:

- #3536 

### Requirements of the new query API

- Ability to find top N result that are most relevant to the user query via semantic / vector search.
  - We should expect `N` are much lower than 100 for most of cases as the result will used by AI agent for solving the user inquiry
- Return only user accessible documents

### Authorisation Enforcement Problem

The new semantic indexer index mapping doesn't contains any dataset / distribution metadata fields. Thus, authorisation enforcement can't be implement with solely information of this index. 

There are a few potential solutions might help:
- [Nested Field Type](https://docs.opensearch.org/docs/latest/field-types/supported-field-types/nested/). i.e. defined the new semantic indexer index fields as nested object of the existing main datasets metadata index. Therefore, we have all information in one index.
  - problems:
    - Have to reindex the whole dataset in order to update one document in the nested fields
    - the number of the documents for a dataset / distribution could be quite large depends on content size & chunk size
    - distributions have been defined as "nested field" on the main dataset index - we can't defined distribution content index documents as nested field on distribution.
- [parent / child relationship defined by join field](https://docs.opensearch.org/docs/latest/field-types/supported-field-types/join/)
  - problems:
    - Queries are more expensive and memory-intensive
    - Documents for both parents and children must be indexed on the same shard
    - Only one join field is allow on one index. However, distributions are defined as "nested field" on the main dataset index (metadata source). Our semantic indexer documents might be related to a dataset or distribution.
- [Enrich Policies](https://www.elastic.co/docs/manage-data/ingest/transform-enrich/data-enrichment) (or Opensearch ingestion pipeline / [Data Prepper](https://docs.opensearch.org/docs/latest/data-prepper/))can help to enrich documents at ingest time by looking up reference data from another index. But enrich data is static at enrich-time. Not suitable for dynamic access control.
- Create another indexer / minion to sync selected access control related to the semantic indexer index
  - Problem:
    - Complexity
    - Development efforts
    - Data duplication and the number of the documents per distribution record could be large

### Proposed Solution

We propose a two-phase search strategy combining early vector-based recall with a fallback precision-mode retrieval scoped to accessible datasets.

- Implement a new registry API endpoint (named `filter records by access`) that takes:
  - a list of records IDs 
  - And propose operation e.g. `object/record/read`
  - and return a list of records IDs that the user has access to.

- Phase 1: Broad Vector Search 
  - Query Semantic indexer index for the top N=500 most relevant documents 
  - Filter unauthorised document via `filter records by access` API with record IDs of the retrieved documents
  -  If any authorised results found, return as result. If not, proceed to `Phase 2` query.
- Phase 2: Access-Aware Narrowed Vector Search (Fallback)
  - If no results after phase 1, to avoid relevant results are silently dropped due to authorisation filtering. We will:
    - Query [existing search API](https://magda-io.github.io/api-docs/#api-Search-GetV0SearchDatasets) for top `500` datasets the user has access to. The search result leverage the existing hybrid search engine and make sure:
      - return datasets are most relevant to user query based on metadata
      - only authorisation datasets are returned
    - Compiled a list of record IDs from the result including all result included dataset & distribution records IDs.
    - Query the semantic indexer index for the top N=500 most relevant documents like in phrase 1 but restrict the query to the compiled record ID subset using a terms filter
    - If results found, return result. If not, respond "no accessible results found" response.


This layered query strategy benefits us by:
- Recall-first (broad search): helps maximise semantic relevance
- Precision-second (access-aware fallback): ensures no relevant results are silently dropped due to authorisation
- Keeps our dataset metadata-light in the semantic indexer index, avoiding sync/duplication issues
 
Limitation:
- Longer respond time. Especially when the Phase 2 step is required.
- When the users ask for top N relevant records, we might respond less than N. Worst case, we might respond none.
  - The chance should be very low as the Phase 2 query will retrieve both accessible and relevant (based on metadata) datasets.

### Semantic Search API Query interface

The new semantic query search API comes with the following endpoint
- `/search`: search semantic index
  - We can use 
  - HTTP Method: GET or POST
    - When Method = `GET`, all parameters are passed as query string
    - When Method = `POST`, all parameters are passed as JSON data body.
  - parameters:
    - `max_num_results`: (number). Optional. Default to 10 if not specified. max number of results to be retrieved
      - need to be a number between 1 and 500.
    - `query`: (string). The query string.
    - `itemType`: (string). Optionally filter the result by itemType. Possible value: `storageObject` or `registryRecord`
    - `fileFormat`: (string). Optionally filter the result by fileFormat. 
    - `recordId`: (string). Optionally filter by the recordID of a registry record. 
    - `subObjectId`: (string). Optionally filter by the subObjectId
    - `subObjectType`: (string). Optionally filter by the subObjectType
    - `minScore`: (number). Optional. Only return result higher than the specified score.
      - When [Radial search](https://docs.opensearch.org/latest/vector-search/specialized-operations/radial-search-knn/) is available (i.e. non-Disk-based vector search), we can set `min_score` field of KNN query to filter the result.
      - When use [Disk-based vector search](https://docs.opensearch.org/latest/vector-search/optimizing-storage/disk-based-vector-search/), we can retrieve 2 times of `max_num_results` and filter it at the API endpoint after retrieve the result.
   - Response: JSON array with each item has the following field:
     - `id`: the opensearch document ID
     - `itemType`
     - `recordId`
     - `fileFormat`
     - `subObjectId`
     - `subObjectType`
     - `text`: the original text of the indexed text chunk
     - `only_one_index_text_chunk`
     - `index_text_chunk_length`
     - `index_text_chunk_position`
     - `index_text_chunk_overlap`
- `/retrieve`: retrieve extended text context by document IDs. 
  - The `/search` API only returns result at chunk level. This API enables the extended text context to be retrieved. 
  - **We should probably leave this endpoint for future implementation**
  - Method: `POST`
  - Parameters:
    - `ids`: (string[]) a list of document IDs (returned via `/search` API)
    - `mode`: retrieve mode. Either:
      - `full`: assemble all chunks of the `storageObject` or `RegistryRecord` or `subObject`
      - `partial`: assemble specified number of "preceding" chunks and/or `subsequent` chunks
      - Default to `full`
    - `precedingChunksNum`: (number) optional. Default to 0. Number of preceding chunks to be retrieved.
    - `subsequentChunksNum`: (number) optional. Default to 0. Number of preceding chunks to be retrieved.
  - Response: JSON array with each item with the following fields:
    - `id`: the original supplied opensearch document ID
    - `itemType`
    - `recordId`
    - `fileFormat`
    - `subObjectId`
    - `subObjectType`
    - `text`: the extended text content produced based on `mode`, `precedingChunksNum` & `subsequentChunksNum`

### Additional Implementation Information

- We need to create a new module in core repo. The name could be `magda-semantic-search-api`
- We need to add a route in [the gateway default route config](https://github.com/magda-io/magda/blob/8577d15cf7a7d1300b28c7b88fa1071031d9ebc7/deploy/helm/internal-charts/gateway/values.yaml#L90) to make the micro-service accessible.  e.g.:

```yaml
semantic-search:
    to: http://magda-semantic-search/
    auth: true
```

Here, `magda-semantic-search` is the k8s svc name defined in the helm chart.
The above setup will make the `/search` API endpoint accessible at `https://example.com/api/v0/semantic-search/search`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New query API for the semantic index #3608