diff --git a/.claude/claude-docs/bingo-elastic-python.md b/.claude/claude-docs/bingo-elastic-python.md index 0510eae928..349c3eca65 100644 --- a/.claude/claude-docs/bingo-elastic-python.md +++ b/.claude/claude-docs/bingo-elastic-python.md @@ -4,10 +4,10 @@ The standalone Python library at `bingo/bingo-elastic/python/` that indexes Indi ## Module map -- `bingo_elastic/elastic.py` — `ElasticRepository` and `AsyncElasticRepository` (parallel sync/async classes, both take a `tau_search: bool` flag), `IndexName` enum (`BINGO_MOLECULE`, `BINGO_REACTION`, `BINGO_CUSTOM`), `build_index_body(tau_search)` builder for the index mapping, and `compile_query` (dispatches a query subject + kwargs to the right query class; reroutes substructure to the tautomer path when `options` contains `TAU`). +- `bingo_elastic/elastic.py` — `ElasticRepository` and `AsyncElasticRepository` (parallel sync/async classes, both take `tau_search: bool` and `custom_properties: CustomPropertiesMapping` flags), `IndexName` enum (`BINGO_MOLECULE`, `BINGO_REACTION`, `BINGO_CUSTOM`), `build_index_body(tau_search, custom_properties)` builder for the index mapping (merges `custom_properties` into `mappings.properties` and rejects collisions with `RESERVED_FIELDS`), and `compile_query` (dispatches a query subject + kwargs to the right query class; reroutes substructure to the tautomer path when `options` contains `TAU`). `CustomPropertiesMapping = Dict[str, Dict[str, Any]]`: keys are field names, values are ES property-mapping fragments (e.g. `{"n": {"type": "integer"}}`). - `bingo_elastic/queries.py` — `CompilableQuery` hierarchy: `SubstructureQuery`, `TautomerSubstructureQuery` (subclass swapping in the `sub-tau` fingerprint and `tau_fingerprint` field), `ExactMatch`, similarity matches (`TanimotoSimilarityMatch`, `EuclidSimilarityMatch`, `TverskySimilarityMatch`), plus `KeywordQuery`, `RangeQuery`, `WildcardQuery` for non-chemical fields. `query_factory` maps kwarg keys (`"substructure"`, `"tautomer"`, `"exact"`, …) to a class. -- `bingo_elastic/model/record.py` — `IndigoRecord` (abstract), `IndigoRecordMolecule`, `IndigoRecordReaction`, and the `WithIndigoObject` descriptor that extracts fingerprints + `cmf` + `hash` from an `IndigoObject` at construction time. The descriptor also computes the `sub-tau` fingerprint when the record was built with `tau_search=True`. -- `bingo_elastic/model/helpers.py` — file iterators (`iterate_file`, `load_reaction`). +- `bingo_elastic/model/record.py` — `IndigoRecord` (abstract), `IndigoRecordMolecule`, `IndigoRecordReaction`, and the `WithIndigoObject` descriptor that extracts fingerprints + `cmf` + `hash` from an `IndigoObject` at construction time. The descriptor also computes the `sub-tau` fingerprint when the record was built with `tau_search=True`, and copies non-reserved properties from `iterateProperties()` (SDF tags) onto the record. `IndigoRecord(custom_properties=…)` accepts an iterable of property names used as a per-record allowlist; `RESERVED_FIELDS` lists names the extractor never overwrites (`cmf`, `hash`, fingerprints, etc.). +- `bingo_elastic/model/helpers.py` — file iterators (`iterate_file` generic dispatcher plus format-specific wrappers `iterate_sdf` / `iterate_smiles` / `iterate_cml`) and single-file loaders (`load_molecule`, `load_reaction`). All iterators accept `custom_properties=` (an iterable of allowed property names) and forward it to the records they yield — pass the keys of the repo's `custom_properties` mapping so extraction and the ES mapping stay aligned. - `tests/` — its own pytest suite with `conftest.py` fixtures that connect to `localhost:9200`. ## Core flow (the non-obvious bit) @@ -29,6 +29,21 @@ The tautomer path is opt-in on **both sides**: `tau_search=True` on the record g Side effect to remember: fingerprints are computed at **record construction time** in the `WithIndigoObject` descriptor, not at `index_records()` time. By the time records reach the repo, the fingerprint is already frozen on the instance. +## Custom SDF properties + +SDF `> ` lines (and any kwargs passed to `IndigoRecord(**kwargs)`) become record attributes via `WithIndigoObject` calling `iterateProperties()`. Tag capture is **opt-in**: with no `custom_properties` configured, `WithIndigoObject` skips the `iterateProperties()` loop entirely, so no SDF tags are extracted or indexed. To capture tags, pass a `custom_properties` mapping to the repo **and** the matching keys to the iterator: + +```python +mapping = {"n": {"type": "integer"}, "CAS": {"type": "keyword"}} +repo = ElasticRepository(IndexName.BINGO_MOLECULE, custom_properties=mapping) +for rec in iterate_sdf("file.sdf", custom_properties=mapping): # keys-only OK too + repo.index_record(rec) +``` + +The same dict drives both consumers: keys are the extraction allowlist, values are the ES `properties` fragments. `build_index_body` raises `ValueError` if a key clashes with `RESERVED_FIELDS`. Leaving `custom_properties=None` (default on both sides) means no SDF tags are extracted or indexed — only fingerprints, `cmf`, `name`, `hash`, and `has_error`. Forgetting to pass `custom_properties` to the iterator side silently drops every tag even when the repo has a typed mapping declared. + +Caveat: ES mappings are immutable after first index creation. Changing `custom_properties` later requires `ElasticRepository.delete_all_records()` first — `create_index` swallows `resource_already_exists_exception` and keeps the old mapping otherwise. + ## Tests Run from `bingo/bingo-elastic/python/`: @@ -45,7 +60,7 @@ For spinning up Elasticsearch see [claude-docs/testing.md](testing.md). ## Sync/Async parity -`ElasticRepository` and `AsyncElasticRepository` are independent classes — there is no shared base. Any signature or behavior change on one must be mirrored on the other (constructors, `filter`, `index_records`, `index_record`). Tests pair every sync `test_*` with an async `test_a_*`; follow that pattern. +`ElasticRepository` and `AsyncElasticRepository` are independent classes — there is no shared base. Any signature or behavior change on one must be mirrored on the other (constructors — including `tau_search` and `custom_properties`, `filter`, `index_records`, `index_record`). Tests pair every sync `test_*` with an async `test_a_*`; follow that pattern. Note `delete_all_records` currently exists only on the sync class; the autouse `clear_index` fixture in `tests/conftest.py` uses it to wipe both indices before every test. ## Java sibling diff --git a/bingo/bingo-elastic/python/bingo_elastic/elastic.py b/bingo/bingo-elastic/python/bingo_elastic/elastic.py index 684b30dec2..87ce077738 100644 --- a/bingo/bingo-elastic/python/bingo_elastic/elastic.py +++ b/bingo/bingo-elastic/python/bingo_elastic/elastic.py @@ -25,6 +25,7 @@ from indigo import Indigo, IndigoObject # type: ignore from bingo_elastic.model.record import ( + RESERVED_FIELDS, IndigoRecord, IndigoRecordMolecule, IndigoRecordReaction, @@ -34,6 +35,13 @@ ElasticRepositoryT = TypeVar("ElasticRepositoryT") +# Mapping of custom (e.g. SDF tag) field name -> ES property mapping fragment, +# e.g. {"MolecularWeight": {"type": "float"}, "CAS": {"type": "keyword"}}. +# The same dict drives (1) the index mapping passed to Elasticsearch and +# (2) the allowlist used by record extraction (see IndigoRecord and +# iterate_file). +CustomPropertiesMapping = Dict[str, Dict[str, Any]] + MAX_ALLOWED_SIZE = 1000 @@ -110,7 +118,10 @@ def get_client( return client_type(**arguments) # type: ignore -def build_index_body(tau_search: bool = False) -> Dict: +def build_index_body( + tau_search: bool = False, + custom_properties: Optional[CustomPropertiesMapping] = None, +) -> Dict: index_body = { "mappings": { "properties": { @@ -142,6 +153,15 @@ def build_index_body(tau_search: bool = False) -> Dict: } ) + if custom_properties: + collisions = set(custom_properties).intersection(RESERVED_FIELDS) + if collisions: + raise ValueError( + "custom_properties uses reserved field name(s): " + f"{sorted(collisions)}" + ) + index_body["mappings"]["properties"].update(custom_properties) + return index_body @@ -216,7 +236,7 @@ def response_to_records( class AsyncElasticRepository: - def __init__( + def __init__( # pylint: disable=too-many-arguments self, index_name: IndexName, *, @@ -228,6 +248,7 @@ def __init__( request_timeout: int = 60, retry_on_timeout: bool = True, tau_search: bool = False, + custom_properties: Optional[CustomPropertiesMapping] = None, ) -> None: """ :param index_name: use function get_index_name for setting this argument @@ -241,10 +262,18 @@ def __init__( :param tau_search: declare tau_fingerprint in the index mapping so tautomer-aware substructure search is available via filter(..., options="TAU ...") + :param custom_properties: ES mapping fragments for caller-defined + fields (SDF tags or kwargs passed to IndigoRecord). Keys are field + names; values are ES property mappings, e.g. + {"MolecularWeight": {"type": "float"}, "CAS": {"type": "keyword"}}. + The same keys must also be passed as ``custom_properties=`` to + iterate_sdf/iterate_file — without that, no SDF tags are + extracted and the typed mapping has nothing to populate. """ self.index_name = index_name.value self.tau_search = tau_search - self.index_body = build_index_body(tau_search) + self.custom_properties = custom_properties + self.index_body = build_index_body(tau_search, custom_properties) self.el_client = get_client( client_type=AsyncElasticsearch, @@ -394,7 +423,7 @@ async def __aexit__(self, *args, **kwargs) -> None: class ElasticRepository: - def __init__( + def __init__( # pylint: disable=too-many-arguments self, index_name: IndexName, *, @@ -406,6 +435,7 @@ def __init__( request_timeout: int = 60, retry_on_timeout: bool = True, tau_search: bool = False, + custom_properties: Optional[CustomPropertiesMapping] = None, ) -> None: """ :param index_name: use function get_index_name for setting this argument @@ -419,10 +449,18 @@ def __init__( :param tau_search: declare tau_fingerprint in the index mapping so tautomer-aware substructure search is available via filter(..., options="TAU ...") + :param custom_properties: ES mapping fragments for caller-defined + fields (SDF tags or kwargs passed to IndigoRecord). Keys are field + names; values are ES property mappings, e.g. + {"MolecularWeight": {"type": "float"}, "CAS": {"type": "keyword"}}. + The same keys must also be passed as ``custom_properties=`` to + iterate_sdf/iterate_file — without that, no SDF tags are + extracted and the typed mapping has nothing to populate. """ self.index_name = index_name.value self.tau_search = tau_search - self.index_body = build_index_body(tau_search) + self.custom_properties = custom_properties + self.index_body = build_index_body(tau_search, custom_properties) self.el_client = get_client( client_type=Elasticsearch, diff --git a/bingo/bingo-elastic/python/bingo_elastic/model/helpers.py b/bingo/bingo-elastic/python/bingo_elastic/model/helpers.py index 1c6dd99f94..11fe1a3b55 100644 --- a/bingo/bingo-elastic/python/bingo_elastic/model/helpers.py +++ b/bingo/bingo-elastic/python/bingo_elastic/model/helpers.py @@ -1,5 +1,5 @@ from pathlib import Path -from typing import Callable, Generator, Optional, Union +from typing import Callable, Generator, Iterable, Optional, Union from indigo import Indigo, IndigoObject # type: ignore @@ -14,6 +14,7 @@ def iterate_file( iterator: Optional[str] = None, error_handler: Optional[Callable[[object, BaseException], None]] = None, session: Optional[Indigo] = None, + custom_properties: Optional[Iterable[str]] = None, ) -> Generator[IndigoRecordMolecule, None, None]: """ :param file: @@ -24,6 +25,11 @@ def iterate_file( :param error_handler: lambda for catching exceptions :type error_handler: Optional[Callable[[object, BaseException], None]] :type session: Optional[Indigo] + :param custom_properties: SDF tag names to extract; pass the keys of the + ElasticRepository's custom_properties mapping so extracted attributes + match what the index mapping declares. None or empty (default) means + no SDF tags are extracted from the file. + :type custom_properties: Optional[Iterable[str]] :return: """ iterators = { @@ -43,7 +49,9 @@ def iterate_file( indigo_object: IndigoObject for indigo_object in getattr(session, iterator_fn)(str(file)): yield IndigoRecordMolecule( - indigo_object=indigo_object, error_handler=error_handler + indigo_object=indigo_object, + error_handler=error_handler, + custom_properties=custom_properties, ) @@ -51,12 +59,14 @@ def iterate_sdf( file: Union[Path, str], error_handler: Optional[Callable[[object, BaseException], None]] = None, session: Optional[Indigo] = None, + custom_properties: Optional[Iterable[str]] = None, ) -> Generator: yield from iterate_file( Path(file) if isinstance(file, str) else file, "sdf", error_handler=error_handler, session=session, + custom_properties=custom_properties, ) @@ -64,12 +74,14 @@ def iterate_smiles( file: Union[Path, str], error_handler: Optional[Callable[[object, BaseException], None]] = None, session: Optional[Indigo] = None, + custom_properties: Optional[Iterable[str]] = None, ) -> Generator: yield from iterate_file( Path(file) if isinstance(file, str) else file, "smiles", error_handler=error_handler, session=session, + custom_properties=custom_properties, ) @@ -77,12 +89,14 @@ def iterate_cml( file: Union[Path, str], error_handler: Optional[Callable[[object, BaseException], None]] = None, session: Optional[Indigo] = None, + custom_properties: Optional[Iterable[str]] = None, ) -> Generator: yield from iterate_file( Path(file) if isinstance(file, str) else file, "cml", error_handler=error_handler, session=session, + custom_properties=custom_properties, ) diff --git a/bingo/bingo-elastic/python/bingo_elastic/model/record.py b/bingo/bingo-elastic/python/bingo_elastic/model/record.py index 784dd9cd91..3c48ae48ad 100644 --- a/bingo/bingo-elastic/python/bingo_elastic/model/record.py +++ b/bingo/bingo-elastic/python/bingo_elastic/model/record.py @@ -1,12 +1,33 @@ from __future__ import annotations -from typing import Callable, Dict, List, Optional +from typing import Callable, Dict, FrozenSet, Iterable, List, Optional from uuid import uuid4 from indigo import Indigo, IndigoException, IndigoObject # type: ignore MOL_TYPES = ["#02: ", "#03: ", "#12: "] REAC_TYPES = ["#04: ", "#05: "] +RESERVED_FIELDS = frozenset( + { + "cmf", + "name", + "hash", + "has_error", + "rawData", + "sim_fingerprint", + "sim_fingerprint_len", + "sub_fingerprint", + "sub_fingerprint_len", + "tau_fingerprint", + "tau_fingerprint_len", + "record_id", + "error_handler", + "skip_errors", + "tau_search", + "indigo_object", + "elastic_response", + } +) # pylint: disable=unused-argument @@ -31,7 +52,7 @@ def __set__(self, instance: IndigoRecord, value: Dict): class WithIndigoObject: - def __set__( # pylint: disable=too-many-branches + def __set__( # pylint: disable=too-many-statements, too-many-branches, too-many-locals self, instance: IndigoRecord, value: IndigoObject ) -> None: try: @@ -92,6 +113,19 @@ def __set__( # pylint: disable=too-many-branches except IndigoException as err_: check_error(instance, err_) + allowed = getattr(instance, "_custom_properties", None) + if allowed: + try: + for prop in value_dup.iterateProperties(): + prop_name = prop.name() + if prop_name in RESERVED_FIELDS: + continue + if prop_name not in allowed: + continue + setattr(instance, prop_name, prop.rawData()) + except IndigoException as err_: + check_error(instance, err_) + class IndigoRecord: """ @@ -114,6 +148,7 @@ class IndigoRecord: elastic_response = WithElasticResponse() record_id: Optional[str] = None error_handler: Optional[Callable[[object, BaseException], None]] = None + _custom_properties: Optional[FrozenSet[str]] = None def __new__(cls, *args, **kwargs): if cls is IndigoRecord: @@ -143,6 +178,13 @@ def __init__(self, **kwargs) -> None: :param skip_errors: if True, all errors will be skipped, no error_handler is required :type skip_errors: bool + :param custom_properties: iterable of SDF tag names to extract from the + IndigoObject. If None or empty (default), no SDF + tags are extracted. Pass the keys of the + ElasticRepository's custom mapping here so the + indexed schema matches what the index mapping + declares. + :type custom_properties: Optional[Iterable[str]] """ # First check if skip_errors flag passed @@ -155,12 +197,26 @@ def __init__(self, **kwargs) -> None: self.record_id = uuid4().hex self.tau_search = kwargs.pop("tau_search", False) + # Must be set before indigo_object assignment so the descriptor sees it + custom_properties: Optional[Iterable[str]] = kwargs.pop( + "custom_properties", None + ) + self._custom_properties = ( + frozenset(custom_properties) + if custom_properties is not None + else None + ) for arg, val in kwargs.items(): setattr(self, arg, val) def as_dict(self) -> Dict: # Add system fields here to exclude from indexing - filtered_fields = {"error_handler", "skip_errors", "tau_search"} + filtered_fields = { + "error_handler", + "skip_errors", + "tau_search", + "_custom_properties", + } return { key: value for key, value in self.__dict__.items() diff --git a/bingo/bingo-elastic/python/tests/test_elastic.py b/bingo/bingo-elastic/python/tests/test_elastic.py index 26b8f0e83d..1d3a9d17f1 100644 --- a/bingo/bingo-elastic/python/tests/test_elastic.py +++ b/bingo/bingo-elastic/python/tests/test_elastic.py @@ -12,7 +12,7 @@ ElasticRepository, IndexName, ) -from bingo_elastic.model.helpers import iterate_file +from bingo_elastic.model.helpers import iterate_file, iterate_sdf from bingo_elastic.model.record import ( IndigoRecord, IndigoRecordMolecule, @@ -527,10 +527,110 @@ async def test_a_custom_fields( assert iupac_inch == "RDHQFKQIGNGIED-UHFFFAOYSA-N" +def test_sdf_custom_properties(resource_loader): + custom_properties = {"n": {"type": "integer"}} + repo = ElasticRepository( + IndexName.BINGO_MOLECULE, + host="127.0.0.1", + port=9200, + custom_properties=custom_properties, + ) + repo.delete_all_records() + for rec in iterate_sdf( + resource_loader("molecules/rand_queries_small.sdf"), + custom_properties=custom_properties, + ): + repo.index_record(rec) + time.sleep(1) + + hits = list(repo.filter(n="1")) + assert len(hits) >= 1 + assert hits[0].n == "1" + + # The integer mapping enables a numeric range query that would silently + # misbehave if `n` were left as a dynamically mapped text field. + range_hits = list(repo.filter(n=RangeQuery(2, 4))) + assert {hit.n for hit in range_hits} == {"2", "3", "4"} # type: ignore + + +@pytest.mark.asyncio +async def test_a_sdf_custom_properties(resource_loader): + custom_properties = {"n": {"type": "integer"}} + + def make_repo(): + return AsyncElasticRepository( + IndexName.BINGO_MOLECULE, + host="127.0.0.1", + port=9200, + custom_properties=custom_properties, + ) + + async with make_repo() as rep: + for rec in iterate_sdf( + resource_loader("molecules/rand_queries_small.sdf"), + custom_properties=custom_properties, + ): + await rep.index_record(rec) + await rep.el_client.indices.refresh( + index=IndexName.BINGO_MOLECULE.value + ) + + async with make_repo() as rep: + result = rep.filter(n="1") + hits = [item async for item in result] + assert len(hits) >= 1 + assert hits[0].n == "1" + + async with make_repo() as rep: + result = rep.filter(n=RangeQuery(2, 4)) + range_hits = [item async for item in result] + assert {hit.n for hit in range_hits} == {"2", "3", "4"} # type: ignore + + +def test_sdf_no_custom_properties_default(resource_loader): + repo = ElasticRepository( + IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200 + ) + repo.delete_all_records() + records = list( + iterate_sdf(resource_loader("molecules/rand_queries_small.sdf")) + ) + for rec in records: + repo.index_record(rec) + time.sleep(1) + + assert not any(hasattr(rec, "n") for rec in records) + assert list(repo.filter(n="1")) == [] + + +@pytest.mark.asyncio +async def test_a_sdf_no_custom_properties_default(resource_loader): + def make_repo(): + return AsyncElasticRepository( + IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200 + ) + + records = list( + iterate_sdf(resource_loader("molecules/rand_queries_small.sdf")) + ) + async with make_repo() as rep: + for rec in records: + await rep.index_record(rec) + await rep.el_client.indices.refresh( + index=IndexName.BINGO_MOLECULE.value + ) + + assert not any(hasattr(rec, "n") for rec in records) + + async with make_repo() as rep: + result = rep.filter(n="1") + hits = [item async for item in result] + assert hits == [] + + def test_search_empty_fingerprint( elastic_repository_molecule: ElasticRepository, indigo_fixture: Indigo, - resource_loader, ): for smile in ["[H][H]", "[H][F]"]: rec = IndigoRecordMolecule( @@ -558,7 +658,6 @@ def test_search_empty_fingerprint( async def test_a_search_empty_fingerprint( a_elastic_repository_molecule: AsyncRepositoryT, indigo_fixture: Indigo, - resource_loader, ): async with a_elastic_repository_molecule() as rep: for smile in ["[H][H]", "[H][F]"]: