Skip to content

Vocabulary id field is mapped as text in OpenSearch, causing term queries to fail for non-lowercase IDs #2337

@billyziege

Description

@billyziege

Package version (if known): v28.5.0

  • invenio-rdm-records (query side — uses dsl.Q("term", id=...) on a text field)
  • invenio-vocabularies (mapping side — defines id as text)

Describe the bug

The OpenSearch mapping for the vocabulary index maps the id field as type: text (with a keyword sub-field). The get_vocabulary_props function in invenio-rdm-records searches for vocabulary terms using dsl.Q("term", id=id_) — a term query on the text field. For IDs that contain uppercase letters or are split by the text analyzer (e.g., EHVM-H119), the term query returns zero hits and raises VocabularyItemNotFoundError, even though the record exists in both the database and the OpenSearch index.

Steps to Reproduce

  1. Load a vocabulary type that contains terms with non-lowercase IDs (e.g., CCMM resource types, which use identifiers like EHVM-H119).
  2. Create a dataset record that references one of these terms as its resource_type.
  3. Navigate to the dataset record's detail page in the UI.

Expected behavior

invenio_rdm_records.resources.serializers.errors.VocabularyItemNotFoundError:
  The 'resourcetypes' vocabulary item 'EHVM-H119' was not found.

Additional context

Root cause

Mapping side (invenio-vocabularies)

The vocabulary index mapping defines the id field as:

{"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}}

When the standard OpenSearch text analyzer processes EHVM-H119, it produces lowercase tokens (e.g., ehvm and h119 after hyphen splitting). The original string EHVM-H119 is not preserved as an exact token in the text field's inverted index.

Query side (invenio-rdm-records)

get_vocabulary_props in invenio_rdm_records/resources/serializers/utils.py:

results = vocabulary_service.read_all(
    system_identity,
    ["id"] + fields,
    vocabulary,
    extra_filter=dsl.Q("term", id=id_),  # <-- term query on text field
)

A term query on a text field matches analyzed tokens exactly. Because EHVM-H119 is not stored as an exact token, the query returns zero hits, and the function raises VocabularyItemNotFoundError.

Standard InvenioRDM resource type IDs (e.g., dataset, publication, software) are lowercase and single-word, so they survive text analysis. This bug is invisible until a vocabulary type with uppercase or hyphenated IDs is loaded.

Impact

Any vocabulary term whose ID contains uppercase letters or other characters that the text analyzer transforms will be unreachable by get_vocabulary_props. This causes the DataCite serializer (called during signposting on every record detail page) to crash with a 500 error for any record using such a term.

Triggered by the CCMM resource type vocabulary, which uses COAR IDs such as EHVM-H119, c_18cc, C_dcae04ef, etc.

Associated PR

Resolved in PR #2336 in this repository. Single line fix with test case and documentation about bug. Alternative single-line resolution could be implemented in invenio-vocabularies, but I chose to implement and document inter-package bug here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions