|
| 1 | +# Elsvier Coordinate Extraction – Development Blueprint |
| 2 | + |
| 3 | +## Package Layout & Interfaces |
| 4 | + |
| 5 | +``` |
| 6 | +elsevier_coordinate_extraction/ |
| 7 | +├── __init__.py |
| 8 | +├── settings.py |
| 9 | +├── client.py |
| 10 | +├── cache.py |
| 11 | +├── rate_limits.py |
| 12 | +├── nimads.py |
| 13 | +├── types.py |
| 14 | +├── search/ |
| 15 | +│ ├── __init__.py |
| 16 | +│ └── api.py |
| 17 | +├── download/ |
| 18 | +│ ├── __init__.py |
| 19 | +│ └── api.py |
| 20 | +├── extract/ |
| 21 | +│ ├── __init__.py |
| 22 | +│ └── coordinates.py |
| 23 | +└── pipeline.py |
| 24 | +tests/ |
| 25 | +├── test_settings.py |
| 26 | +├── search/ |
| 27 | +│ └── test_api.py |
| 28 | +├── download/ |
| 29 | +│ └── test_api.py |
| 30 | +├── extract/ |
| 31 | +│ └── test_coordinates.py |
| 32 | +└── test_pipeline.py |
| 33 | +``` |
| 34 | + |
| 35 | +### `settings.py` |
| 36 | +- `@dataclass class Settings`: `api_key`, `base_url`, `timeout`, `concurrency`, `cache_dir`, `user_agent`. |
| 37 | +- `get_settings() -> Settings`: loads `.env` via `python-dotenv`, memoizes the resulting object, validates required fields. |
| 38 | + |
| 39 | +### `client.py` |
| 40 | +- `class ScienceDirectClient`: async context manager wrapping `httpx.AsyncClient`. |
| 41 | + - Injects API key header (`X-Elsevier-APIKey`), default query params (e.g., `httpAccept`). |
| 42 | + - Accepts `Settings`, optional external `AsyncClient`. |
| 43 | + - Exposes `async get_json(path: str, params: dict[str, str]) -> dict`. |
| 44 | + - Exposes `async get_xml(path: str, params: dict[str, str]) -> str`. |
| 45 | + - Handles retry/backoff using response status and headers; `rate_limits.py` assists with parsing `Retry-After` and known ScienceDirect policy (fallback to static ceiling if headers absent). |
| 46 | + |
| 47 | +**Client usage example** |
| 48 | + |
| 49 | +```python |
| 50 | +from elsevier_coordinate_extraction.client import ScienceDirectClient |
| 51 | +from elsevier_coordinate_extraction.settings import get_settings |
| 52 | + |
| 53 | +settings = get_settings() |
| 54 | + |
| 55 | +async with ScienceDirectClient(settings) as client: |
| 56 | + result = await client.get_json( |
| 57 | + "/search/sciencedirect", |
| 58 | + params={"query": "TITLE(fmri)", "count": "1"}, |
| 59 | + ) |
| 60 | +``` |
| 61 | + |
| 62 | +The client automatically applies API key and user agent headers, enforces the configured concurrency limit, and retries when the Elsevier API returns `Retry-After` metadata. |
| 63 | + |
| 64 | +### `cache.py` |
| 65 | +- `class FileCache`: |
| 66 | + - `async get(namespace: str, key: str) -> bytes | None`. |
| 67 | + - `async set(namespace: str, key: str, data: bytes, metadata: dict | None = None) -> None`. |
| 68 | + - Namespaces for `search`, `articles`, `assets`; keys derived from deterministic hashes. |
| 69 | +- `CacheKey` helpers to hash query params and article identifiers. |
| 70 | + |
| 71 | +### `types.py` |
| 72 | +- `StudyMetadata`, `ArticleContent`, `AnalysisPayload`, `PointPayload` defined as `TypedDict`/`dataclass` to mirror NIMADS schema. |
| 73 | +- `StudysetPayload` alias for the top-level structure handed between modules. |
| 74 | + |
| 75 | +### `nimads.py` |
| 76 | +- `def build_study(study_meta: StudyMetadata, *, analyses: list[AnalysisPayload] | None = None) -> dict`. |
| 77 | +- `def build_studyset(name: str, studies: list[dict], metadata: dict | None = None) -> dict`. |
| 78 | +- Optional `validate(payload: dict) -> None` hook using LinkML schemas if we decide to integrate them later (validation deferred for now). |
| 79 | + |
| 80 | +### `search/api.py` |
| 81 | +- `async def search_articles(query: str, *, max_results: int = 25, client: ScienceDirectClient | None = None, cache: FileCache | None = None) -> StudysetPayload`. |
| 82 | + - Builds ScienceDirect search endpoint params (`query`, `count`, `start`, requested fields for DOI/title/abstract/authors/openaccess flags). |
| 83 | + - Handles pagination until `max_results`. |
| 84 | + - Collects minimal study metadata (title, abstract, authors, journal, year, open access flag, DOI/PII/Scopus ID) in NIMADS `Study` format; attaches ScienceDirect identifiers to `metadata`. |
| 85 | + - Persists search responses via cache keyed by query hash. |
| 86 | + |
| 87 | +### `download/api.py` |
| 88 | +- `async def download_articles(studies: Sequence[StudyMetadata], *, formats: Sequence[str] = ("xml", "html"), client: ScienceDirectClient | None = None, cache: FileCache | None = None) -> list[ArticleContent]`. |
| 89 | + - Resolves best-available format per study (prioritize XML; fallback to HTML/PDF). |
| 90 | + - Uses `asyncio.Semaphore(settings.concurrency)` for parallel downloads. |
| 91 | + - Stores raw payload bytes and associated metadata (`content_type`, `is_open_access`, `retrieved_at`) in `ArticleContent`. |
| 92 | + - Persists downloads to cache using article-specific keys. |
| 93 | + |
| 94 | +### `extract/coordinates.py` |
| 95 | +- `def extract_coordinates(articles: Sequence[ArticleContent]) -> StudysetPayload`. |
| 96 | + - Parses XML with `lxml` (mirroring Pubget heuristics). |
| 97 | + - `def _iter_table_fragments(xml_root) -> Iterable[TableFragment]`: yields raw table XML and metadata. |
| 98 | + - `def _parse_table(table_fragment: str) -> list[PointPayload]`: ported logic from `pubget._coordinates`. |
| 99 | + - `def _infer_coordinate_space(article_element) -> str | None`: replicates Pubget coordinate space detection. |
| 100 | + - Returns NIMADS-ready structures: each study gets `analyses` populated with `points`, includes raw table XML snippets in `metadata`. |
| 101 | + |
| 102 | +### `pipeline.py` |
| 103 | +- `async def run_pipeline(query: str, *, max_results: int = 25, settings: Settings | None = None) -> StudysetPayload`. |
| 104 | + - Glues search → download → extract, reusing shared `ScienceDirectClient` and `FileCache`. |
| 105 | + - Returns final aggregated payload with both metadata and extracted coordinates. |
| 106 | + |
| 107 | +## TDD Plan |
| 108 | + |
| 109 | +1. **Settings & Configuration** |
| 110 | + - Write `test_settings.py` verifying `.env` loading, required key enforcement, and memoization. |
| 111 | + - Implement minimal `settings.py` until tests pass. |
| 112 | + |
| 113 | +2. **HTTP Client Layer** |
| 114 | + - Draft `tests/search/test_client.py` (or inline in `test_api.py`) using `pytest-recording` to confirm headers, retries, and timeout behavior. |
| 115 | + - Implement `ScienceDirectClient` with httpx, stub out rate-limit parsing. |
| 116 | + |
| 117 | +3. **Search Module** |
| 118 | + - Author tests that: |
| 119 | + - mock ScienceDirect responses (via recordings) to validate query building, pagination, metadata extraction, and cache hits. |
| 120 | + - assert returned structure matches NIMADS study schema fields (title, abstract, authors, year, journal, open-access flag). |
| 121 | + - Build `search/api.py` to satisfy tests. |
| 122 | + |
| 123 | +4. **Download Module** |
| 124 | + - Create fixtures with recorded ScienceDirect full-text responses (XML + HTML fallback). |
| 125 | + - Tests cover: format preference, concurrency (mocked semaphore), cache usage, error handling/backoff. |
| 126 | + - Implement `download/api.py` accordingly. |
| 127 | + |
| 128 | +5. **Extraction Module** |
| 129 | + - Begin with unit tests using sample XML snippets (derived from Pubget tests/examples) to ensure coordinate parsing matches expectations. |
| 130 | + - Add integration test comparing against a known article used in Pubget to validate coordinate and space inference. |
| 131 | + - Port necessary parsing helpers into `extract/coordinates.py`. |
| 132 | + |
| 133 | +6. **Pipeline Integration** |
| 134 | + - Write async pipeline test with mocked modules to ensure data passed through correctly and NIMADS payload assembled. |
| 135 | + - Add optional recording-based end-to-end test gated behind marker to avoid frequent API calls. |
| 136 | + |
| 137 | +7. **NIMADS Helpers & Validation** |
| 138 | + - Tests verifying builder functions assemble schemas correctly and that point metadata includes raw table XML fragments. |
| 139 | + - Implement `nimads.py` helpers plus optional schema validation toggle. |
| 140 | + |
| 141 | +8. **Rate-Limit Handling** |
| 142 | + - Tests simulate responses with and without headers like `Retry-After` to check backoff logic. |
| 143 | + - Implement `rate_limits.py` to parse response headers and enforce delays (fall back to documented limits determined during implementation). |
| 144 | + |
| 145 | +9. **Cache Layer** |
| 146 | + - Tests ensure deterministic key generation, read/write round-trips, and concurrent access safety. |
| 147 | + - Implement `FileCache` (async wrappers around `asyncio.to_thread` for disk IO if necessary). |
| 148 | + |
| 149 | +10. **Documentation & Examples** |
| 150 | + - After functionality stabilizes, add README usage examples and docstrings, ensuring TDD artifacts remain green. |
| 151 | + |
| 152 | +## Targeted Download & Extraction TDD (using test DOIs) |
| 153 | + |
| 154 | +To exercise ScienceDirect endpoints without live dependencies, we use the `test_dois` fixture defined in `tests/conftest.py`. Recording rules: |
| 155 | +- Set `PYTEST_RECORDING_MODE=once` (default). Update recordings intentionally when request parameters change. |
| 156 | +- Recordings live under `tests/cassettes/download/` and `tests/cassettes/extract/`, named per test function. |
| 157 | + |
| 158 | +### Download module tests |
| 159 | +1. `tests/download/test_api.py::test_download_single_article_xml` |
| 160 | + - Uses cassette `download/test_download_single_article_xml.yaml`. |
| 161 | + - Asserts the API retrieves XML payload bytes, content type, and article identifiers (DOI, PII). |
| 162 | +2. `test_download_handles_cached_payload` |
| 163 | + - Mocks cache layer; ensures cached entries skip HTTP call. |
| 164 | +3. `test_download_parallel_respects_concurrency` |
| 165 | + - Parametrized with two DOIs; asserts semaphore limits concurrency via captured timestamps. |
| 166 | + |
| 167 | +### Extraction module tests |
| 168 | +1. `tests/extract/test_coordinates.py::test_extract_coordinates_from_sample_xml` |
| 169 | + - Loads recorded XML from download stage (fixture). |
| 170 | + - Validates we detect coordinate tables, parse xyz triplets, infer MNI/TAL space, and attach raw table XML. |
| 171 | +2. `test_extract_returns_nimads_structure` |
| 172 | + - Confirms output includes `study`, `analyses`, and `points` fields shaped like NIMADS schema. |
| 173 | +3. Integration test combining download + extract for a single DOI, using cached cassette to simulate full pipeline without external calls. |
| 174 | + |
| 175 | +Each test should fail prior to implementation, guiding incremental development of `download/api.py`, extraction helpers, and new type utilities. |
| 176 | + |
| 177 | +Throughout development, record HTTP interactions with `pytest-recording`, keep tests deterministic via cache fixtures, and use Ruff for linting. |
0 commit comments