Add OBO/OWL/SSSOM ontology-to-KB converters#7
Conversation
Implements Tasks 1-4 of the SSSOM converter: - SSSOM TSV parser (YAML metadata header + TSV data via csv.DictReader) - Confidence transforms (identity, floor_ceil, rescale) with named registry - Config model (SSSOMConverterConfig, MappingRule) with prefix matching - Core conversion (sssom_to_kb) with predicate mapping, label extraction, auto disjoint groups, rule overrides, prefix filters, and min probability Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add load_sssom_config() for loading converter config from YAML files - Remove **_kw from NAMED_TRANSFORMS lambdas (silently swallowed typos) - Raise ValueError with available transforms for unknown transform names - Guard against IDs without colons in disjoint group generation - Fix _make_fact docstring (broadMatch reverses, not narrowMatch) - Use typed dict annotations (dict[str, Any]) in parse_sssom_tsv - Clean up tests: remove unused imports, dead loop, use module constant - Add tests for unknown transforms and IDs without colons - Add sssom_config.yaml test fixture and integration tests - Add pyboomer CLI entry point Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- extract_neighborhood(kb, seeds, max_hops) finds all entities transitively connected to seed IDs via undirected BFS, then extracts the sub-KB with extract_sub_kb - CLI: pyboomer extract now accepts --id flags (repeatable) and --max-hops for neighborhood extraction, keeping backward compat with the IDS_FILE positional argument - 7 new tests covering full component, hop limits, multiple seeds, unknown seeds, and subsumption direction traversal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Hand-rolled OBO parser (parse_obo) with OBOTerm/OBODocument dataclasses - OntologyConverterConfig with per-prefix xref probabilities and SKOS settings - obo_to_kb() converts structural axioms to hard facts, xrefs/SKOS to pfacts - OWL backend (owl_to_kb) via optional py-horned-owl dependency - ontology_to_kb() dispatch by file extension - load_ontology_config() for YAML config files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "obo" and "owl" to KBLoader.SUPPORTED_FORMATS - Add .obo/.owl/.owx/.ofn extension detection in detect_format() - Add _load_ontology() dispatch method in KBLoader - Add obo/owl to all CLI input format choices - Add optional py-horned-owl dependency in pyproject.toml - Add loader and CLI integration tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Make py-horned-owl a core dependency (not optional) - Fix owl_to_kb() for py-horned-owl v1.4.0 API (.component, get_iri(), SimpleLiteral, IRI types, PrefixMapping dict conversion) - Add OWL Functional Syntax test fixture (test_ontology.ofn) - Add SSSOM format to loader and CLI (auto-detects .sssom.tsv files) - Add comprehensive tests: OWL unit tests (TestOwlToKb), OBO/OWL parity tests (TestOboOwlParity), and unified CLI tests across OBO, OWL, and SSSOM (TestCLIConvert) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add docs/ontology-conversion.md with full guide covering OBO, OWL, config, seed extraction, and merging workflows - Update docs/formats.md to list OBO, OWL, SSSOM as supported input formats and replace "Future: OWL Support" with actual documentation - Add ontology-conversion.md to mkdocs nav Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-executed notebook showing the full workflow: - Create MONDO hierarchy (OBO), ORDO hierarchy (OWL), and cross-ontology mappings (SSSOM) - Convert, merge, and solve from CLI - Demonstrates how OWL disjointness constraints reject a false mapping (prior 0.30 -> posterior 0.005) - Includes Python API equivalent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job PR Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. ✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
There was a problem hiding this comment.
Pull request overview
Adds direct ontology/mapping ingestion (OBO, OWL via py-horned-owl, and SSSOM TSV) into boomer KBs, plus seed-based neighborhood extraction and CLI/loader support.
Changes:
- Introduces OBO/OWL converters and an SSSOM converter with configurable probability handling and transforms.
- Extends loader + CLI to detect/load/convert/solve these formats and adds seed-neighborhood extraction (
extract_neighborhood). - Adds extensive tests and documentation/tutorial content for end-to-end ontology alignment workflows.
Reviewed changes
Copilot reviewed 19 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_sssom_converter.py | Adds unit/integration tests for SSSOM parsing, config, transforms, and KB conversion. |
| tests/test_splitter.py | Adds tests for new neighborhood extraction behavior in the splitter. |
| tests/test_ontology_converter.py | Adds tests for OBO/OWL parsing, conversion parity, loader/CLI integration. |
| tests/test_loaders.py | Updates supported formats expectations to include ontology/mapping formats. |
| tests/input/test_ontology.ofn | Adds OWL Functional Syntax fixture for OWL-to-KB tests. |
| tests/input/test_ontology.obo | Adds OBO fixture for OBO parser/conversion tests. |
| tests/input/test_mappings.sssom.tsv | Adds SSSOM TSV fixture with metadata + mappings. |
| tests/input/sssom_config.yaml | Adds YAML config fixture for SSSOM conversion rules/transforms. |
| tests/input/ontology_config.yaml | Adds YAML config fixture for ontology conversion probabilities/options. |
| src/boomer/sssom_converter.py | Implements SSSOM TSV parsing + row→PFact conversion + config loading. |
| src/boomer/splitter.py | Adds extract_neighborhood() for seed-based subgraph extraction. |
| src/boomer/ontology_converter.py | Implements hand-rolled OBO parser and OWL conversion backend. |
| src/boomer/loaders.py | Extends format detection and loading to support OBO/OWL/SSSOM. |
| src/boomer/cli.py | Extends format choices and enhances extract with seed neighborhood options. |
| pyproject.toml | Adds py-horned-owl dependency and pyboomer script alias. |
| mkdocs.yml | Adds ontology conversion docs and tutorial to nav. |
| docs/tutorial/ontology-alignment.ipynb | Adds executed end-to-end tutorial notebook (OBO+OWL+SSSOM→merge→solve). |
| docs/ontology-conversion.md | Adds guide for OBO/OWL conversion and seed-based extraction. |
| docs/formats.md | Documents new supported formats and how they map to KB facts. |
| .gitignore | Ignores .worktrees/. |
| for line in text.splitlines(): | ||
| if line.startswith("#"): | ||
| # Strip leading '#' (and optional space) to get YAML content | ||
| meta_lines.append(line[1:]) | ||
| else: | ||
| tsv_lines.append(line) | ||
|
|
||
| # Parse YAML metadata | ||
| metadata: dict[str, Any] = {} | ||
| if meta_lines: | ||
| yaml_text = "\n".join(meta_lines) | ||
| parsed = yaml.safe_load(yaml_text) | ||
| if isinstance(parsed, dict): | ||
| metadata = parsed | ||
|
|
||
| # Parse TSV rows | ||
| tsv_text = "\n".join(tsv_lines) | ||
| reader = csv.DictReader(io.StringIO(tsv_text), delimiter="\t") | ||
| rows = list(reader) |
There was a problem hiding this comment.
parse_sssom_tsv() currently appends empty/non-data lines into tsv_lines. If the TSV contains blank lines, csv.DictReader can produce rows with None values; later code calls .strip() on row.get("confidence", ""), which will crash when the value is None. Filter out empty TSV lines (and/or normalize None values to "") before feeding the data to DictReader, e.g., skip line when it is empty/whitespace.
| prop_iri = str(ax.ann.ap.first) | ||
| subject_iri = str(ax.subject) | ||
| subject_id = _iri_to_curie(subject_iri, prefix_map) |
There was a problem hiding this comment.
subject_id conversion for AnnotationAssertion bypasses the local curie() helper that strips angle brackets, so subjects serialized as <...> may not match the prefix map / OBO IRI logic (and can leak raw bracketed IRIs into the KB). Use the same normalization path as the other axiom handlers (e.g., derive subject_id via the curie(...) helper), and ensure property IRIs are normalized consistently (strip angle brackets) before comparing to _RDFS_LABEL, _HAS_DBXREF, and SKOS IRIs.
| prop_iri = str(ax.ann.ap.first) | |
| subject_iri = str(ax.subject) | |
| subject_id = _iri_to_curie(subject_iri, prefix_map) | |
| raw_prop_iri = str(ax.ann.ap.first) | |
| # Normalize property IRI by stripping angle brackets if present | |
| if raw_prop_iri.startswith("<") and raw_prop_iri.endswith(">"): | |
| prop_iri = raw_prop_iri[1:-1] | |
| else: | |
| prop_iri = raw_prop_iri | |
| subject_iri = str(ax.subject) | |
| # Use the same CURIE normalization path as other axioms | |
| subject_id = curie(subject_iri) |
| seen_ids: set[str] = set() | ||
| for row in rows: | ||
| for col in ("subject_id", "object_id"): | ||
| eid = row.get(col, "") | ||
| if eid and eid not in seen_ids and ":" in eid: | ||
| seen_ids.add(eid) | ||
| prefix = id_prefix(eid) | ||
| facts.append(MemberOfDisjointGroup(sub=eid, group=prefix)) | ||
|
|
There was a problem hiding this comment.
Disjoint-group facts are generated from all parsed rows, even for mappings that are later skipped by rules/prefix filters/min-probability. This can introduce MemberOfDisjointGroup facts for entities that otherwise don’t appear in pfacts, making the resulting KB harder to reason about/debug (and potentially bloating facts). Consider generating groups from the entities that actually made it into the KB (e.g., collect IDs from pfacts and/or from labels tied to retained mappings) rather than iterating over raw rows.
| seen_ids: set[str] = set() | |
| for row in rows: | |
| for col in ("subject_id", "object_id"): | |
| eid = row.get(col, "") | |
| if eid and eid not in seen_ids and ":" in eid: | |
| seen_ids.add(eid) | |
| prefix = id_prefix(eid) | |
| facts.append(MemberOfDisjointGroup(sub=eid, group=prefix)) | |
| # Only generate disjoint-group facts for entities that actually | |
| # appear in the KB (i.e., those that made it into ``pfacts``). | |
| entity_ids: set[str] = set() | |
| for pf in pfacts: | |
| if isinstance(pf, (EquivalentTo, ProperSubClassOf)): | |
| # Both fact types are binary relations with ``sub`` / ``obj``. | |
| entity_ids.add(pf.sub) | |
| entity_ids.add(pf.obj) | |
| for eid in entity_ids: | |
| if eid and ":" in eid: | |
| prefix = id_prefix(eid) | |
| facts.append(MemberOfDisjointGroup(sub=eid, group=prefix)) |
| except FileNotFoundError: | ||
| raise click.ClickException(f"IDs file '{ids_file}' not found") |
There was a problem hiding this comment.
The updated IDS_FILE reading path only handles FileNotFoundError. Other common I/O failures (e.g., permission errors, encoding errors, directory passed instead of file) will raise unhandled exceptions and bypass Click’s user-friendly error reporting. Restore a broader exception handler (e.g., except OSError as e: / except Exception as e:) and re-raise as click.ClickException with context.
| except FileNotFoundError: | |
| raise click.ClickException(f"IDs file '{ids_file}' not found") | |
| except OSError as e: | |
| raise click.ClickException(f"Failed to read IDs file '{ids_file}': {e}") |
| click.echo(f"Original KB: {len(kb.facts)} facts, {len(kb.pfacts)} pfacts") | ||
| click.echo(f"Extracted KB: {len(sub_kb.facts)} facts, {len(sub_kb.pfacts)} pfacts") | ||
| click.echo(f"Used {len(entity_ids)} entity IDs from {ids_file}") | ||
| click.echo(f"Seeds: {sorted(entity_ids)}") |
There was a problem hiding this comment.
Printing the fully-sorted seed list can become very expensive/noisy for large IDS files (and can dominate CLI output). Prefer printing a count plus a small preview (first N seeds) or only printing the list when the size is below a reasonable threshold.
| click.echo(f"Seeds: {sorted(entity_ids)}") | |
| sorted_seeds = sorted(entity_ids) | |
| preview_limit = 10 | |
| if len(sorted_seeds) <= preview_limit: | |
| click.echo(f"Seeds ({len(sorted_seeds)}): {sorted_seeds}") | |
| else: | |
| click.echo( | |
| f"Seeds ({len(sorted_seeds)} total, showing first {preview_limit}): " | |
| f"{sorted_seeds[:preview_limit]}" | |
| ) |
- Move example OBO/OWL/SSSOM files to docs/tutorial/ontology-alignment/ instead of creating them with bash heredocs - Simplify notebook: cat files to show them, then merge+solve directly - Remove redundant per-file convert steps - Fewer, cleaner cells for better mkdocs rendering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job Code Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. Review Checklist:
✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
- Rename fixture dir to ontology-alignment-files/ to avoid collision with the notebook's URL path - Use IPython Markdown display with fenced code blocks instead of bash cat, so files render as single syntax-highlighted blocks - OBO rendered as yaml, OWL as turtle, SSSOM as tsv Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job PR Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. Review Checklist:
✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
- Move show() helper to docs/tutorial/notebook_utils.py - CLI merge/solve outputs saved to files, then rendered with show() instead of raw bash output — consistent syntax-highlighted blocks - Solution rendered as both YAML and TSV Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job Code Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. Review Checklist:
✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
- Note in OBO section that xrefs become EquivalentTo pfacts at 0.7, linking to ontology-conversion.md for the full mapping table - Add admonition explaining why some mappings appear twice (xref + SSSOM) with link to OntologyConverterConfig docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job PR Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. Review Checklist:
✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
The CI workflow runs `just test` but the justfile only had notebook recipes. Add test, doctest, lint, format, and lint-fix recipes matching the Makefile equivalents. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job PR Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. Review Checklist:
✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
The extract command's help text and output format changed but the tests were not updated. Fix assertions to match current output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job PR Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. Review Checklist:
✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
Clarify that MemberOfDisjointGroup facts are generated as a post-processing step by all converters (OBO, OWL, SSSOM), not during parsing. Every CURIE is split on ':' and assigned to a disjoint group by prefix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job PR Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. Review Checklist:
✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
Add two new output formats for boomer solutions: - SSSOM TSV: standard ontology mapping format with metadata header, curie_map, and mapping_justification columns - OBOGraphs JSON: standard graph exchange format with nodes, edges, and probability metadata Wire both into CLI (-O sssom / -O obographs), update output_dir to use FORMAT_EXTENSIONS dict, and add export examples to the ontology-alignment tutorial. Includes 27 unit tests, 3 doctests, and roundtrip verification against parse_sssom_tsv. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job PR Review: OBO/OWL/SSSOM Ontology ConvertersThis is a substantial and well-crafted PR that adds comprehensive ontology conversion capabilities to boomer-py. The implementation is impressive in scope and quality. Review Checklist:
✅ StrengthsExcellent Architecture & Design
Robust OBO Parser (
|
Summary
OntologyConverterConfig(pydantic) for per-prefix xref probabilities, SKOS match probabilities, obsolete filtering, and disjoint group generationdocs/ontology-conversion.mdguide and pre-executeddocs/tutorial/ontology-alignment.ipynbend-to-end tutorial (MONDO OBO + ORDO OWL + SSSOM mappings → merge → solve)Test plan
uv run pytest tests/test_ontology_converter.py -v— 69 passeduv run pytest --doctest-modules src/boomer/ontology_converter.py— 7 passeduv run pytest -v— 307 passed, 1 skipped (2 pre-existing CLI extract failures)uv run ruff check src/boomer/ontology_converter.py— cleanuv run mkdocs build— clean🤖 Generated with Claude Code