Package version (if known): v28.5.0
invenio-rdm-records (query side — uses dsl.Q("term", id=...) on a text field)
invenio-vocabularies (mapping side — defines id as text)
Describe the bug
The OpenSearch mapping for the vocabulary index maps the id field as type: text (with a keyword sub-field). The get_vocabulary_props function in invenio-rdm-records searches for vocabulary terms using dsl.Q("term", id=id_) — a term query on the text field. For IDs that contain uppercase letters or are split by the text analyzer (e.g., EHVM-H119), the term query returns zero hits and raises VocabularyItemNotFoundError, even though the record exists in both the database and the OpenSearch index.
Steps to Reproduce
- Load a vocabulary type that contains terms with non-lowercase IDs (e.g., CCMM resource types, which use identifiers like
EHVM-H119).
- Create a dataset record that references one of these terms as its
resource_type.
- Navigate to the dataset record's detail page in the UI.
Expected behavior
invenio_rdm_records.resources.serializers.errors.VocabularyItemNotFoundError:
The 'resourcetypes' vocabulary item 'EHVM-H119' was not found.
Additional context
Root cause
Mapping side (invenio-vocabularies)
The vocabulary index mapping defines the id field as:
{"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}}
When the standard OpenSearch text analyzer processes EHVM-H119, it produces lowercase tokens (e.g., ehvm and h119 after hyphen splitting). The original string EHVM-H119 is not preserved as an exact token in the text field's inverted index.
Query side (invenio-rdm-records)
get_vocabulary_props in invenio_rdm_records/resources/serializers/utils.py:
results = vocabulary_service.read_all(
system_identity,
["id"] + fields,
vocabulary,
extra_filter=dsl.Q("term", id=id_), # <-- term query on text field
)
A term query on a text field matches analyzed tokens exactly. Because EHVM-H119 is not stored as an exact token, the query returns zero hits, and the function raises VocabularyItemNotFoundError.
Standard InvenioRDM resource type IDs (e.g., dataset, publication, software) are lowercase and single-word, so they survive text analysis. This bug is invisible until a vocabulary type with uppercase or hyphenated IDs is loaded.
Impact
Any vocabulary term whose ID contains uppercase letters or other characters that the text analyzer transforms will be unreachable by get_vocabulary_props. This causes the DataCite serializer (called during signposting on every record detail page) to crash with a 500 error for any record using such a term.
Triggered by the CCMM resource type vocabulary, which uses COAR IDs such as EHVM-H119, c_18cc, C_dcae04ef, etc.
Associated PR
Resolved in PR #2336 in this repository. Single line fix with test case and documentation about bug. Alternative single-line resolution could be implemented in invenio-vocabularies, but I chose to implement and document inter-package bug here.
Package version (if known): v28.5.0
invenio-rdm-records(query side — usesdsl.Q("term", id=...)on a text field)invenio-vocabularies(mapping side — definesidastext)Describe the bug
The OpenSearch mapping for the vocabulary index maps the
idfield astype: text(with akeywordsub-field). Theget_vocabulary_propsfunction ininvenio-rdm-recordssearches for vocabulary terms usingdsl.Q("term", id=id_)— a term query on the text field. For IDs that contain uppercase letters or are split by the text analyzer (e.g.,EHVM-H119), the term query returns zero hits and raisesVocabularyItemNotFoundError, even though the record exists in both the database and the OpenSearch index.Steps to Reproduce
EHVM-H119).resource_type.Expected behavior
Additional context
Root cause
Mapping side (
invenio-vocabularies)The vocabulary index mapping defines the
idfield as:{"type": "text", "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}}When the standard OpenSearch text analyzer processes
EHVM-H119, it produces lowercase tokens (e.g.,ehvmandh119after hyphen splitting). The original stringEHVM-H119is not preserved as an exact token in the text field's inverted index.Query side (
invenio-rdm-records)get_vocabulary_propsininvenio_rdm_records/resources/serializers/utils.py:A
termquery on a text field matches analyzed tokens exactly. BecauseEHVM-H119is not stored as an exact token, the query returns zero hits, and the function raisesVocabularyItemNotFoundError.Standard InvenioRDM resource type IDs (e.g.,
dataset,publication,software) are lowercase and single-word, so they survive text analysis. This bug is invisible until a vocabulary type with uppercase or hyphenated IDs is loaded.Impact
Any vocabulary term whose ID contains uppercase letters or other characters that the text analyzer transforms will be unreachable by
get_vocabulary_props. This causes the DataCite serializer (called during signposting on every record detail page) to crash with a 500 error for any record using such a term.Triggered by the CCMM resource type vocabulary, which uses COAR IDs such as
EHVM-H119,c_18cc,C_dcae04ef, etc.Associated PR
Resolved in PR #2336 in this repository. Single line fix with test case and documentation about bug. Alternative single-line resolution could be implemented in invenio-vocabularies, but I chose to implement and document inter-package bug here.