|
| 1 | +# Copilot Instructions for CellAnnotator |
| 2 | + |
| 3 | +## Important Notes |
| 4 | +- Avoid drafting summary documents or endless markdown files. Just summarize in chat what you did, why, and any open questions. |
| 5 | +- Don't update Jupyter notebooks - those are managed manually. |
| 6 | +- When running terminal commands, activate the appropriate environment first (`mamba activate cell_annotator`). |
| 7 | +- Rather than making assumptions, ask for clarification when uncertain. |
| 8 | +- **GitHub workflows**: Use GitHub CLI (`gh`) when possible. For GitHub MCP server tools, ensure Docker Desktop is running first (`open -a "Docker Desktop"`). |
| 9 | + |
| 10 | +## Project Overview |
| 11 | + |
| 12 | +**CellAnnotator** is an scverse ecosystem package for automated cell type annotation in scRNA-seq data using Large Language Models (LLMs). It's provider-agnostic, supporting OpenAI, Google Gemini, and Anthropic Claude. The tool sends cluster marker genes (not expression values) to LLMs, which return structured cell type annotations with confidence scores. |
| 13 | + |
| 14 | +### Domain Context (Brief) |
| 15 | +- **AnnData**: Standard single-cell data structure. Contains `.X`, `.obs` (cell metadata), `.var` (gene metadata). |
| 16 | +- **Marker genes**: Differentially expressed genes that characterize cell types/clusters (computed via scanpy). |
| 17 | +- **LLM providers**: OpenAI (GPT), Google (Gemini), Anthropic (Claude). Uses Pydantic for structured outputs. |
| 18 | +- **Workflow**: 1) Compute marker genes per cluster, 2) Send to LLM with biological context, 3) Get structured annotations, 4) Harmonize across samples. |
| 19 | + |
| 20 | +### Key Dependencies` |
| 21 | +- **Core**: scanpy, pydantic, python-dotenv, rich |
| 22 | +- **LLM providers**: openai, anthropic, google-genai (all optional) |
| 23 | +- **Optional**: rapids-singlecell (GPU), colorspacious (colors) |
| 24 | + |
| 25 | +## Architecture & Code Organization |
| 26 | + |
| 27 | +### Module Structure (follows scverse conventions) |
| 28 | +- Use `AnnData` objects as primary data structure |
| 29 | +- Type annotations use modern syntax: `str | None` instead of `Optional[str]` |
| 30 | +- Supports Python 3.11, 3.12, 3.13 (see `pyproject.toml`) |
| 31 | +- Avoid local imports unless necessary for circular import resolution |
| 32 | + |
| 33 | +### Core Components |
| 34 | +1. **`src/cell_annotator/model/cell_annotator.py`**: Main `CellAnnotator` class |
| 35 | + - Orchestrates annotation across multiple samples |
| 36 | + - `annotate_clusters()`: Main entry point for annotation |
| 37 | +2. **`src/cell_annotator/model/sample_annotator.py`**: `SampleAnnotator` class |
| 38 | + - Handles annotation for single sample |
| 39 | + - Computes marker genes, queries LLM, stores results |
| 40 | +3. **`src/cell_annotator/model/base_annotator.py`**: `BaseAnnotator` abstract class |
| 41 | + - Shared LLM provider logic and validation |
| 42 | +4. **`src/cell_annotator/_response_formats.py`**: Pydantic models for structured LLM outputs |
| 43 | +5. **`src/cell_annotator/_prompts.py`**: LLM prompt templates |
| 44 | +6. **`src/cell_annotator/utils.py`**: Helper functions (marker gene filtering, formatting) |
| 45 | + |
| 46 | +## Development Workflow |
| 47 | + |
| 48 | +### Environment Management (Hatch-based) |
| 49 | +```bash |
| 50 | +# Testing - NEVER use pytest directly |
| 51 | +hatch test # test with highest Python version |
| 52 | +hatch test --all # test all Python 3.11 & 3.13 + pre-release |
| 53 | + |
| 54 | +# Documentation |
| 55 | +hatch run docs:build # build Sphinx docs |
| 56 | +hatch run docs:open # open in browser |
| 57 | +hatch run docs:clean # clean build artifacts |
| 58 | + |
| 59 | +# Environment inspection |
| 60 | +hatch env show # list environments |
| 61 | +``` |
| 62 | + |
| 63 | +### Testing Strategy |
| 64 | +- Test matrix defined in `[[tool.hatch.envs.hatch-test.matrix]]` in `pyproject.toml` |
| 65 | +- Tests Python 3.11 & 3.13 with stable deps, 3.13 with pre-release deps |
| 66 | +- Tests live in `tests/`, use pytest with `@pytest.mark.real_llm_query` for actual LLM calls |
| 67 | +- Run via `hatch test` to ensure proper environment isolation |
| 68 | +- Optional dependencies tested via `features = ["test"]` which includes all providers |
| 69 | + |
| 70 | +### Code Quality Tools |
| 71 | +- **Ruff**: Linting and formatting (120 char line length) |
| 72 | +- **Biome**: JSON/JSONC formatting with trailing commas |
| 73 | +- **Pre-commit**: Auto-runs ruff, biome. Install with `pre-commit install` |
| 74 | +- Use `git pull --rebase` if pre-commit.ci commits to your branch |
| 75 | + |
| 76 | +## Key Configuration Files |
| 77 | + |
| 78 | +### `pyproject.toml` |
| 79 | +- **Build**: `hatchling` with `hatch-vcs` for git-based versioning |
| 80 | +- **Dependencies**: Minimal core (scanpy, pydantic); provider packages are optional extras |
| 81 | +- **Extras**: `[openai]`, `[anthropic]`, `[gemini]`, `[all-providers]`, `[test]`, `[doc]` |
| 82 | +- **Ruff**: 120 char line length, NumPy docstring convention |
| 83 | +- **Test matrix**: Python 3.11 & 3.13 |
| 84 | + |
| 85 | +### Version Management |
| 86 | +- Version from git tags via `hatch-vcs` |
| 87 | +- Release: Create GitHub release with tag `vX.X.X` |
| 88 | +- Follows **Semantic Versioning** |
| 89 | + |
| 90 | +## Project-Specific Patterns |
| 91 | + |
| 92 | +### Basic Usage |
| 93 | +```python |
| 94 | +from cell_annotator import CellAnnotator |
| 95 | + |
| 96 | +# Annotate across multiple samples |
| 97 | +cell_ann = CellAnnotator( |
| 98 | + adata, |
| 99 | + species="human", |
| 100 | + tissue="heart", |
| 101 | + cluster_key="leiden", |
| 102 | + sample_key="batch", |
| 103 | + provider="openai", # or "gemini", "anthropic" |
| 104 | +).annotate_clusters() |
| 105 | + |
| 106 | +# Results in adata.obs['cell_type_predicted'] |
| 107 | +``` |
| 108 | + |
| 109 | +### LLM Provider Selection |
| 110 | +- Providers: `"openai"` (default), `"gemini"`, `"anthropic"` |
| 111 | +- API keys via environment variables or `.env` file (loaded with python-dotenv) |
| 112 | +- Models: `gpt-4o-mini`, `gemini-2.5-flash-lite`, `claude-haiku-4-5` (defaults) |
| 113 | +- Anthropic is most expensive ($1/$5 per 1M tokens), minimize usage in tests |
| 114 | +- All providers use model aliases that auto-update to latest snapshots |
| 115 | + |
| 116 | +### Structured Outputs with Pydantic |
| 117 | +- `CellTypeListOutput`: List of expected cell types |
| 118 | +- `ExpectedMarkerGeneOutput`: Dict of cell type → marker genes |
| 119 | +- Ensures reliable, parseable LLM responses |
| 120 | + |
| 121 | +### AnnData Conventions |
| 122 | +- Marker genes computed via `scanpy.tl.rank_genes_groups()` |
| 123 | +- Results stored in `adata.obs[cell_type_key]` (default: `"cell_type_predicted"`) |
| 124 | +- Confidence scores in `adata.obs[f"{cell_type_key}_confidence"]` |
| 125 | + |
| 126 | +## Common Gotchas |
| 127 | + |
| 128 | +1. **Hatch for testing**: Always use `hatch test`, never standalone `pytest`. CI matches hatch test matrix. |
| 129 | +2. **API keys**: Must be set as env vars or in `.env` file. Package auto-loads via python-dotenv. |
| 130 | +3. **Provider packages**: Install provider extras (`pip install cell-annotator[openai]`) to use specific LLMs. |
| 131 | +4. **Real LLM tests**: Use `@pytest.mark.real_llm_query` and skip in CI unless explicitly enabled. |
| 132 | +5. **Marker gene filtering**: Package automatically filters marker genes to genes present in `adata.var_names`. |
| 133 | +6. **Pre-commit conflicts**: Use `git pull --rebase` to integrate pre-commit.ci fixes. |
| 134 | +7. **Line length**: Ruff set to 120 chars, but keep docstrings readable (~80 chars per line). |
| 135 | + |
| 136 | +## Related Resources |
| 137 | + |
| 138 | +- **Contributing guide**: `docs/contributing.md` |
| 139 | +- **Tutorials**: `docs/notebooks/tutorials/` |
| 140 | +- **OpenAI structured outputs**: https://platform.openai.com/docs/guides/structured-outputs |
| 141 | +- **scanpy docs**: https://scanpy.readthedocs.io/ |
| 142 | +- **Pydantic docs**: https://docs.pydantic.dev/ |
0 commit comments