Skip to content

Commit 8efac0f

Browse files
committed
Merge main into template-update-v2-quadbio-cell-annotator-v0.6.0
Resolved conflicts: - .pre-commit-config.yaml: Use template's pyproject-fmt v2.6.0 - pyproject.toml: Merge test dependencies (kept all providers + colors + coverage>=7.10 + flaky), kept -W flag for docs build, use Python 3.11-3.13 (dropped 3.10), removed duplicate overrides section - .readthedocs.yaml: Keep template's hatch-based build - .github/workflows/test.yaml: Keep template's improved test commands with codecov v5, add API key env vars from main
2 parents a192f33 + f3d0133 commit 8efac0f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+8152
-327
lines changed

.github/copilot-instructions.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Copilot Instructions for CellAnnotator
2+
3+
## Important Notes
4+
- Avoid drafting summary documents or endless markdown files. Just summarize in chat what you did, why, and any open questions.
5+
- Don't update Jupyter notebooks - those are managed manually.
6+
- When running terminal commands, activate the appropriate environment first (`mamba activate cell_annotator`).
7+
- Rather than making assumptions, ask for clarification when uncertain.
8+
- **GitHub workflows**: Use GitHub CLI (`gh`) when possible. For GitHub MCP server tools, ensure Docker Desktop is running first (`open -a "Docker Desktop"`).
9+
10+
## Project Overview
11+
12+
**CellAnnotator** is an scverse ecosystem package for automated cell type annotation in scRNA-seq data using Large Language Models (LLMs). It's provider-agnostic, supporting OpenAI, Google Gemini, and Anthropic Claude. The tool sends cluster marker genes (not expression values) to LLMs, which return structured cell type annotations with confidence scores.
13+
14+
### Domain Context (Brief)
15+
- **AnnData**: Standard single-cell data structure. Contains `.X`, `.obs` (cell metadata), `.var` (gene metadata).
16+
- **Marker genes**: Differentially expressed genes that characterize cell types/clusters (computed via scanpy).
17+
- **LLM providers**: OpenAI (GPT), Google (Gemini), Anthropic (Claude). Uses Pydantic for structured outputs.
18+
- **Workflow**: 1) Compute marker genes per cluster, 2) Send to LLM with biological context, 3) Get structured annotations, 4) Harmonize across samples.
19+
20+
### Key Dependencies`
21+
- **Core**: scanpy, pydantic, python-dotenv, rich
22+
- **LLM providers**: openai, anthropic, google-genai (all optional)
23+
- **Optional**: rapids-singlecell (GPU), colorspacious (colors)
24+
25+
## Architecture & Code Organization
26+
27+
### Module Structure (follows scverse conventions)
28+
- Use `AnnData` objects as primary data structure
29+
- Type annotations use modern syntax: `str | None` instead of `Optional[str]`
30+
- Supports Python 3.11, 3.12, 3.13 (see `pyproject.toml`)
31+
- Avoid local imports unless necessary for circular import resolution
32+
33+
### Core Components
34+
1. **`src/cell_annotator/model/cell_annotator.py`**: Main `CellAnnotator` class
35+
- Orchestrates annotation across multiple samples
36+
- `annotate_clusters()`: Main entry point for annotation
37+
2. **`src/cell_annotator/model/sample_annotator.py`**: `SampleAnnotator` class
38+
- Handles annotation for single sample
39+
- Computes marker genes, queries LLM, stores results
40+
3. **`src/cell_annotator/model/base_annotator.py`**: `BaseAnnotator` abstract class
41+
- Shared LLM provider logic and validation
42+
4. **`src/cell_annotator/_response_formats.py`**: Pydantic models for structured LLM outputs
43+
5. **`src/cell_annotator/_prompts.py`**: LLM prompt templates
44+
6. **`src/cell_annotator/utils.py`**: Helper functions (marker gene filtering, formatting)
45+
46+
## Development Workflow
47+
48+
### Environment Management (Hatch-based)
49+
```bash
50+
# Testing - NEVER use pytest directly
51+
hatch test # test with highest Python version
52+
hatch test --all # test all Python 3.11 & 3.13 + pre-release
53+
54+
# Documentation
55+
hatch run docs:build # build Sphinx docs
56+
hatch run docs:open # open in browser
57+
hatch run docs:clean # clean build artifacts
58+
59+
# Environment inspection
60+
hatch env show # list environments
61+
```
62+
63+
### Testing Strategy
64+
- Test matrix defined in `[[tool.hatch.envs.hatch-test.matrix]]` in `pyproject.toml`
65+
- Tests Python 3.11 & 3.13 with stable deps, 3.13 with pre-release deps
66+
- Tests live in `tests/`, use pytest with `@pytest.mark.real_llm_query` for actual LLM calls
67+
- Run via `hatch test` to ensure proper environment isolation
68+
- Optional dependencies tested via `features = ["test"]` which includes all providers
69+
70+
### Code Quality Tools
71+
- **Ruff**: Linting and formatting (120 char line length)
72+
- **Biome**: JSON/JSONC formatting with trailing commas
73+
- **Pre-commit**: Auto-runs ruff, biome. Install with `pre-commit install`
74+
- Use `git pull --rebase` if pre-commit.ci commits to your branch
75+
76+
## Key Configuration Files
77+
78+
### `pyproject.toml`
79+
- **Build**: `hatchling` with `hatch-vcs` for git-based versioning
80+
- **Dependencies**: Minimal core (scanpy, pydantic); provider packages are optional extras
81+
- **Extras**: `[openai]`, `[anthropic]`, `[gemini]`, `[all-providers]`, `[test]`, `[doc]`
82+
- **Ruff**: 120 char line length, NumPy docstring convention
83+
- **Test matrix**: Python 3.11 & 3.13
84+
85+
### Version Management
86+
- Version from git tags via `hatch-vcs`
87+
- Release: Create GitHub release with tag `vX.X.X`
88+
- Follows **Semantic Versioning**
89+
90+
## Project-Specific Patterns
91+
92+
### Basic Usage
93+
```python
94+
from cell_annotator import CellAnnotator
95+
96+
# Annotate across multiple samples
97+
cell_ann = CellAnnotator(
98+
adata,
99+
species="human",
100+
tissue="heart",
101+
cluster_key="leiden",
102+
sample_key="batch",
103+
provider="openai", # or "gemini", "anthropic"
104+
).annotate_clusters()
105+
106+
# Results in adata.obs['cell_type_predicted']
107+
```
108+
109+
### LLM Provider Selection
110+
- Providers: `"openai"` (default), `"gemini"`, `"anthropic"`
111+
- API keys via environment variables or `.env` file (loaded with python-dotenv)
112+
- Models: `gpt-4o-mini`, `gemini-2.5-flash-lite`, `claude-haiku-4-5` (defaults)
113+
- Anthropic is most expensive ($1/$5 per 1M tokens), minimize usage in tests
114+
- All providers use model aliases that auto-update to latest snapshots
115+
116+
### Structured Outputs with Pydantic
117+
- `CellTypeListOutput`: List of expected cell types
118+
- `ExpectedMarkerGeneOutput`: Dict of cell type → marker genes
119+
- Ensures reliable, parseable LLM responses
120+
121+
### AnnData Conventions
122+
- Marker genes computed via `scanpy.tl.rank_genes_groups()`
123+
- Results stored in `adata.obs[cell_type_key]` (default: `"cell_type_predicted"`)
124+
- Confidence scores in `adata.obs[f"{cell_type_key}_confidence"]`
125+
126+
## Common Gotchas
127+
128+
1. **Hatch for testing**: Always use `hatch test`, never standalone `pytest`. CI matches hatch test matrix.
129+
2. **API keys**: Must be set as env vars or in `.env` file. Package auto-loads via python-dotenv.
130+
3. **Provider packages**: Install provider extras (`pip install cell-annotator[openai]`) to use specific LLMs.
131+
4. **Real LLM tests**: Use `@pytest.mark.real_llm_query` and skip in CI unless explicitly enabled.
132+
5. **Marker gene filtering**: Package automatically filters marker genes to genes present in `adata.var_names`.
133+
6. **Pre-commit conflicts**: Use `git pull --rebase` to integrate pre-commit.ci fixes.
134+
7. **Line length**: Ruff set to 120 chars, but keep docstrings readable (~80 chars per line).
135+
136+
## Related Resources
137+
138+
- **Contributing guide**: `docs/contributing.md`
139+
- **Tutorials**: `docs/notebooks/tutorials/`
140+
- **OpenAI structured outputs**: https://platform.openai.com/docs/guides/structured-outputs
141+
- **scanpy docs**: https://scanpy.readthedocs.io/
142+
- **Pydantic docs**: https://docs.pydantic.dev/

.github/workflows/test.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,13 @@ jobs:
6161
name: ${{ matrix.env.label }}
6262
runs-on: ${{ matrix.os }}
6363

64+
env:
65+
OS: ${{ matrix.os }}
66+
PYTHON: ${{ matrix.python }}
67+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
68+
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
69+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
70+
6471
steps:
6572
- uses: actions/checkout@v4
6673
with:
@@ -87,6 +94,8 @@ jobs:
8794
uvx hatch run ${{ matrix.env.name }}:coverage xml # create report for upload
8895
- name: Upload coverage
8996
uses: codecov/codecov-action@v5
97+
with:
98+
token: ${{ secrets.CODECOV_TOKEN }}
9099

91100
# Check that all tests defined above pass. This makes it easy to set a single "required" test in branch
92101
# protection instead of having to update it frequently. See https://github.com/re-actors/alls-green#why.

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,13 @@ __pycache__/
1919
# docs
2020
/docs/generated/
2121
/docs/_build/
22+
/docs/notebooks/tests/
23+
24+
# Jupyter
25+
.ipynb_checkpoints
26+
27+
# Data files
28+
*.h5ad
29+
30+
# Environment files
31+
*.env

.pylintrc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[FORMAT]
2+
max-line-length=120

CHANGELOG.md

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,84 @@
33
All notable changes to this project will be documented in this file.
44

55
The format is based on [Keep a Changelog][],
6-
and this project adheres to [Semantic Versioning][].
6+
and this project adheres to [Semantic Versioning][]. Full commit history is available in the [commit logs][].
77

88
[keep a changelog]: https://keepachangelog.com/en/1.0.0/
99
[semantic versioning]: https://semver.org/spec/v2.0.0.html
10+
[commit logs]: https://github.com/quadbio/cell-annotator/commits
1011

11-
## [Unreleased]
12+
## Version 0.2
1213

13-
### Added
14+
### Unreleased
1415

15-
- Basic tool, preprocessing and plotting functions
16+
### 0.2.0 (2025-07-26)
17+
18+
#### Added
19+
- Added a generic LLM backend that supports OpenAI, Claude and Gemini models {pr}`53`
20+
- Add the possibility to provide the current gene set when querying expected marker genes {pr}`53`
21+
- Add the possibility to filter expected marker genes to those presend in AnnData {pr}`53`
22+
- Add a new tutorial on spatial data annotation {pr}`53`
23+
- Added and improved tests for the new classes (e.g. ObsBeautifier, LLMBackend, etc) {pr}`53`
24+
- For each backend, add a small `test_query` method which can be used for diagnostics {pr}`53`
25+
26+
#### Changed
27+
- Moved the `reorder_and_color` utility into a new class: `ObsBeautifier` {pr}`54`
28+
- Improved class representations throughout the package {pr}`53`
29+
30+
#### Fixed
31+
- Fix the `ObsBeautifier` modifying cluster colors when only their order should be updated {pr}`54`
32+
33+
## Version 0.1
34+
35+
### 0.1.5 (2025-07-24)
36+
37+
#### Changed
38+
- Update tutorials to use `gpt-4.1` {pr}`51`
39+
40+
### 0.1.4 (2025-03-28)
41+
42+
#### Added
43+
44+
- Use `rapids_singlecell`, `cupy` and `cuml` to accelerate cluster marker computation on GPUs {pr}`37`.
45+
- Add the possibility to softly enforce adherence to expected cell types {pr}`42`.
46+
47+
#### Changed
48+
49+
- Run cluster label harmonization also for a single sample {pr}`37`.
50+
- Re-format prompts into a dataclass {pr}`42`.
51+
52+
#### Fixed
53+
54+
- Fixed a bug with integer sample labels {pr}`37`.
55+
56+
### 0.1.3 (2025-02-07)
57+
58+
#### Added
59+
60+
- Added tests for the single-sample case {pr}`29`.
61+
- Refer to issues and PRs with sphinx {pr}`30`.
62+
63+
#### Removed
64+
65+
- Removed `tenacity` for query retries {pr}`28`.
66+
67+
#### Fixed
68+
69+
- Fixed `_get_annotation_summary_string` for the single-sample case {pr}`29`.
70+
- Fixed the expected cell type marker test by adding additional marker genes {pr}`28`.
71+
72+
### 0.1.2 (2025-01-29)
73+
74+
#### Added
75+
76+
- Update the documentation, in particular the installation instructions.
77+
78+
### 0.1.1 (2025-01-29)
79+
80+
#### Added
81+
82+
- Initial push to PyPI
83+
84+
### 0.1.0 (2025-01-29)
85+
86+
Initial package release

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2024, Marius Lange
3+
Copyright (c) 2024, QuaDBio Lab
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

0 commit comments

Comments
 (0)