Skip to content

Commit 2da8791

Browse files
committed
add new files
1 parent cd62473 commit 2da8791

14 files changed

+552
-877
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,3 +68,4 @@ Thumbs.db
6868
.python-version
6969

7070
.elsevier_cache
71+
.env

ARCHITECTURE.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Architecture Overview
2+
3+
This document contains a plain language overview of the architecture of the Elsvier Coordinate Extraction package.
4+
5+
## Core Components
6+
The package is structured into three main components:
7+
1. **Search Module**: Responsible for querying Elsevier's database to find relevant articles based on user-defined criteria.
8+
2. **Download Module**: Handles the downloading of articles identified by the Search Module.
9+
3. **Extraction Module**: Focuses on extracting coordinate data from the downloaded articles.
10+
11+
These components are modular and independent, allowing
12+
for piecewise development and testing, and integration
13+
into other code bases.
14+
15+
16+
## Data Flow
17+
The data flow within the package follows a linear progression:
18+
1. **Input**: User provides search criteria (e.g., keywords, authors).
19+
2. **Search**: The Search Module queries the Elsevier database and returns a list of articles matching the criteria.
20+
3. **Download**: The Download Module retrieves the full text of the articles identified in the search results.
21+
4. **Extraction**: The Extraction Module processes the downloaded articles to extract relevant coordinate data.
22+
5. **Output**: The extracted coordinates are returned to the user in a structured format (following the nimads standard: https://neurostuff.github.io/NIMADS/).
23+
24+
### Input
25+
26+
This will be a search query string, trying to faithfully
27+
represent the searches that can be done with the elsevier API.
28+
29+
### Search
30+
31+
Using the elsevier API, we will search for articles
32+
matching the input query. The results will be a list of
33+
articles with metadata including their unique identifiers.
34+
The output of this stage is a list of article identifiers in a dictionary format.
35+
I'm looking for doi, pmid, pmcid where available.
36+
If only DOI is available, that is sufficient.
37+
38+
### Download
39+
40+
This will take in the list of article identifiers
41+
from the Search stage, and download the full text of
42+
each article. The output of this stage is the full text
43+
of each article in a suitable format for processing
44+
by the Extraction stage. The downloading should be
45+
parallelized while respecting any rate limits imposed by
46+
the Elsevier API.
47+
48+
### Extraction
49+
This stage processes the full text of each downloaded
50+
article to extract coordinate data. The output of this
51+
stage is a structured representation of the extracted
52+
coordinates, following the nimads standard: https://neurostuff.github.io/NIMADS/
53+
54+
This will also be parallelized to improve performance.
55+
56+
57+
## Inspiration
58+
59+
The coordinate extraction is inspired by pubget (https://github.com/neuroquery/pubget)
60+
and ACE (https://github.com/neurosynth/ACE)

BUILD_STRATEGY.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Build Strategy
2+
This document outlines the build strategy for our project, detailing the tools, processes, and best practices we follow to ensure efficient and reliable builds.
3+
4+
## Build Tools
5+
- uv
6+
- venv
7+
8+
9+
## Testing
10+
11+
Testing is done using pytest and pytest-recording to
12+
use realistic http interactions without hitting the live APIs every time.
13+
14+
## Continuous Integration
15+
16+
We use GitHub Actions for continuous integration. The CI pipeline includes:
17+
- Linting with ruff
18+
- Running unit tests with pytest
19+
20+
21+
## Writing code
22+
23+
After an initial implementation of the code architecture, we follow a test driven development (TDD) approach to add new features and fix bugs. This ensures that our code is well-tested and reliable.

DEVELOPMENT_PLAN.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Elsvier Coordinate Extraction – Development Blueprint
2+
3+
## Package Layout & Interfaces
4+
5+
```
6+
elsevier_coordinate_extraction/
7+
├── __init__.py
8+
├── settings.py
9+
├── client.py
10+
├── cache.py
11+
├── rate_limits.py
12+
├── nimads.py
13+
├── types.py
14+
├── search/
15+
│ ├── __init__.py
16+
│ └── api.py
17+
├── download/
18+
│ ├── __init__.py
19+
│ └── api.py
20+
├── extract/
21+
│ ├── __init__.py
22+
│ └── coordinates.py
23+
└── pipeline.py
24+
tests/
25+
├── test_settings.py
26+
├── search/
27+
│ └── test_api.py
28+
├── download/
29+
│ └── test_api.py
30+
├── extract/
31+
│ └── test_coordinates.py
32+
└── test_pipeline.py
33+
```
34+
35+
### `settings.py`
36+
- `@dataclass class Settings`: `api_key`, `base_url`, `timeout`, `concurrency`, `cache_dir`, `user_agent`.
37+
- `get_settings() -> Settings`: loads `.env` via `python-dotenv`, memoizes the resulting object, validates required fields.
38+
39+
### `client.py`
40+
- `class ScienceDirectClient`: async context manager wrapping `httpx.AsyncClient`.
41+
- Injects API key header (`X-Elsevier-APIKey`), default query params (e.g., `httpAccept`).
42+
- Accepts `Settings`, optional external `AsyncClient`.
43+
- Exposes `async get_json(path: str, params: dict[str, str]) -> dict`.
44+
- Exposes `async get_xml(path: str, params: dict[str, str]) -> str`.
45+
- Handles retry/backoff using response status and headers; `rate_limits.py` assists with parsing `Retry-After` and known ScienceDirect policy (fallback to static ceiling if headers absent).
46+
47+
**Client usage example**
48+
49+
```python
50+
from elsevier_coordinate_extraction.client import ScienceDirectClient
51+
from elsevier_coordinate_extraction.settings import get_settings
52+
53+
settings = get_settings()
54+
55+
async with ScienceDirectClient(settings) as client:
56+
result = await client.get_json(
57+
"/search/sciencedirect",
58+
params={"query": "TITLE(fmri)", "count": "1"},
59+
)
60+
```
61+
62+
The client automatically applies API key and user agent headers, enforces the configured concurrency limit, and retries when the Elsevier API returns `Retry-After` metadata.
63+
64+
### `cache.py`
65+
- `class FileCache`:
66+
- `async get(namespace: str, key: str) -> bytes | None`.
67+
- `async set(namespace: str, key: str, data: bytes, metadata: dict | None = None) -> None`.
68+
- Namespaces for `search`, `articles`, `assets`; keys derived from deterministic hashes.
69+
- `CacheKey` helpers to hash query params and article identifiers.
70+
71+
### `types.py`
72+
- `StudyMetadata`, `ArticleContent`, `AnalysisPayload`, `PointPayload` defined as `TypedDict`/`dataclass` to mirror NIMADS schema.
73+
- `StudysetPayload` alias for the top-level structure handed between modules.
74+
75+
### `nimads.py`
76+
- `def build_study(study_meta: StudyMetadata, *, analyses: list[AnalysisPayload] | None = None) -> dict`.
77+
- `def build_studyset(name: str, studies: list[dict], metadata: dict | None = None) -> dict`.
78+
- Optional `validate(payload: dict) -> None` hook using LinkML schemas if we decide to integrate them later (validation deferred for now).
79+
80+
### `search/api.py`
81+
- `async def search_articles(query: str, *, max_results: int = 25, client: ScienceDirectClient | None = None, cache: FileCache | None = None) -> StudysetPayload`.
82+
- Builds ScienceDirect search endpoint params (`query`, `count`, `start`, requested fields for DOI/title/abstract/authors/openaccess flags).
83+
- Handles pagination until `max_results`.
84+
- Collects minimal study metadata (title, abstract, authors, journal, year, open access flag, DOI/PII/Scopus ID) in NIMADS `Study` format; attaches ScienceDirect identifiers to `metadata`.
85+
- Persists search responses via cache keyed by query hash.
86+
87+
### `download/api.py`
88+
- `async def download_articles(studies: Sequence[StudyMetadata], *, formats: Sequence[str] = ("xml", "html"), client: ScienceDirectClient | None = None, cache: FileCache | None = None) -> list[ArticleContent]`.
89+
- Resolves best-available format per study (prioritize XML; fallback to HTML/PDF).
90+
- Uses `asyncio.Semaphore(settings.concurrency)` for parallel downloads.
91+
- Stores raw payload bytes and associated metadata (`content_type`, `is_open_access`, `retrieved_at`) in `ArticleContent`.
92+
- Persists downloads to cache using article-specific keys.
93+
94+
### `extract/coordinates.py`
95+
- `def extract_coordinates(articles: Sequence[ArticleContent]) -> StudysetPayload`.
96+
- Parses XML with `lxml` (mirroring Pubget heuristics).
97+
- `def _iter_table_fragments(xml_root) -> Iterable[TableFragment]`: yields raw table XML and metadata.
98+
- `def _parse_table(table_fragment: str) -> list[PointPayload]`: ported logic from `pubget._coordinates`.
99+
- `def _infer_coordinate_space(article_element) -> str | None`: replicates Pubget coordinate space detection.
100+
- Returns NIMADS-ready structures: each study gets `analyses` populated with `points`, includes raw table XML snippets in `metadata`.
101+
102+
### `pipeline.py`
103+
- `async def run_pipeline(query: str, *, max_results: int = 25, settings: Settings | None = None) -> StudysetPayload`.
104+
- Glues search → download → extract, reusing shared `ScienceDirectClient` and `FileCache`.
105+
- Returns final aggregated payload with both metadata and extracted coordinates.
106+
107+
## TDD Plan
108+
109+
1. **Settings & Configuration**
110+
- Write `test_settings.py` verifying `.env` loading, required key enforcement, and memoization.
111+
- Implement minimal `settings.py` until tests pass.
112+
113+
2. **HTTP Client Layer**
114+
- Draft `tests/search/test_client.py` (or inline in `test_api.py`) using `pytest-recording` to confirm headers, retries, and timeout behavior.
115+
- Implement `ScienceDirectClient` with httpx, stub out rate-limit parsing.
116+
117+
3. **Search Module**
118+
- Author tests that:
119+
- mock ScienceDirect responses (via recordings) to validate query building, pagination, metadata extraction, and cache hits.
120+
- assert returned structure matches NIMADS study schema fields (title, abstract, authors, year, journal, open-access flag).
121+
- Build `search/api.py` to satisfy tests.
122+
123+
4. **Download Module**
124+
- Create fixtures with recorded ScienceDirect full-text responses (XML + HTML fallback).
125+
- Tests cover: format preference, concurrency (mocked semaphore), cache usage, error handling/backoff.
126+
- Implement `download/api.py` accordingly.
127+
128+
5. **Extraction Module**
129+
- Begin with unit tests using sample XML snippets (derived from Pubget tests/examples) to ensure coordinate parsing matches expectations.
130+
- Add integration test comparing against a known article used in Pubget to validate coordinate and space inference.
131+
- Port necessary parsing helpers into `extract/coordinates.py`.
132+
133+
6. **Pipeline Integration**
134+
- Write async pipeline test with mocked modules to ensure data passed through correctly and NIMADS payload assembled.
135+
- Add optional recording-based end-to-end test gated behind marker to avoid frequent API calls.
136+
137+
7. **NIMADS Helpers & Validation**
138+
- Tests verifying builder functions assemble schemas correctly and that point metadata includes raw table XML fragments.
139+
- Implement `nimads.py` helpers plus optional schema validation toggle.
140+
141+
8. **Rate-Limit Handling**
142+
- Tests simulate responses with and without headers like `Retry-After` to check backoff logic.
143+
- Implement `rate_limits.py` to parse response headers and enforce delays (fall back to documented limits determined during implementation).
144+
145+
9. **Cache Layer**
146+
- Tests ensure deterministic key generation, read/write round-trips, and concurrent access safety.
147+
- Implement `FileCache` (async wrappers around `asyncio.to_thread` for disk IO if necessary).
148+
149+
10. **Documentation & Examples**
150+
- After functionality stabilizes, add README usage examples and docstrings, ensuring TDD artifacts remain green.
151+
152+
## Targeted Download & Extraction TDD (using test DOIs)
153+
154+
To exercise ScienceDirect endpoints without live dependencies, we use the `test_dois` fixture defined in `tests/conftest.py`. Recording rules:
155+
- Set `PYTEST_RECORDING_MODE=once` (default). Update recordings intentionally when request parameters change.
156+
- Recordings live under `tests/cassettes/download/` and `tests/cassettes/extract/`, named per test function.
157+
158+
### Download module tests
159+
1. `tests/download/test_api.py::test_download_single_article_xml`
160+
- Uses cassette `download/test_download_single_article_xml.yaml`.
161+
- Asserts the API retrieves XML payload bytes, content type, and article identifiers (DOI, PII).
162+
2. `test_download_handles_cached_payload`
163+
- Mocks cache layer; ensures cached entries skip HTTP call.
164+
3. `test_download_parallel_respects_concurrency`
165+
- Parametrized with two DOIs; asserts semaphore limits concurrency via captured timestamps.
166+
167+
### Extraction module tests
168+
1. `tests/extract/test_coordinates.py::test_extract_coordinates_from_sample_xml`
169+
- Loads recorded XML from download stage (fixture).
170+
- Validates we detect coordinate tables, parse xyz triplets, infer MNI/TAL space, and attach raw table XML.
171+
2. `test_extract_returns_nimads_structure`
172+
- Confirms output includes `study`, `analyses`, and `points` fields shaped like NIMADS schema.
173+
3. Integration test combining download + extract for a single DOI, using cached cassette to simulate full pipeline without external calls.
174+
175+
Each test should fail prior to implementation, guiding incremental development of `download/api.py`, extraction helpers, and new type utilities.
176+
177+
Throughout development, record HTTP interactions with `pytest-recording`, keep tests deterministic via cache fixtures, and use Ruff for linting.

elsevier_coordinate_extraction/client.py

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -91,15 +91,20 @@ async def _ensure_client(self) -> None:
9191
if self._settings.insttoken:
9292
headers["X-ELS-Insttoken"] = self._settings.insttoken
9393
timeout = httpx.Timeout(self._settings.timeout)
94-
proxy_value = self._settings.https_proxy or self._settings.http_proxy
95-
self._client = httpx.AsyncClient(
96-
base_url=self._settings.base_url,
97-
timeout=timeout,
98-
headers=headers,
99-
transport=self._transport,
100-
http2=True,
101-
proxy=proxy_value,
102-
)
94+
client_kwargs: dict[str, Any] = {
95+
"base_url": self._settings.base_url,
96+
"timeout": timeout,
97+
"headers": headers,
98+
"transport": self._transport,
99+
"http2": True,
100+
}
101+
if self._settings.use_proxy:
102+
proxy_value = self._settings.https_proxy or self._settings.http_proxy
103+
if proxy_value:
104+
client_kwargs["proxy"] = proxy_value
105+
else:
106+
client_kwargs["trust_env"] = False
107+
self._client = httpx.AsyncClient(**client_kwargs)
103108

104109
async def _request(
105110
self,

elsevier_coordinate_extraction/settings.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,23 @@ class Settings:
3131
insttoken: str | None
3232
http_proxy: str | None
3333
https_proxy: str | None
34+
use_proxy: bool
35+
36+
37+
_TRUE_VALUES: Final[set[str]] = {"1", "true", "yes", "on"}
38+
_FALSE_VALUES: Final[set[str]] = {"0", "false", "no", "off"}
39+
40+
41+
def _coerce_bool(value: str | None, *, default: bool) -> bool:
42+
"""Convert common textual boolean representations to bool."""
43+
if value is None:
44+
return default
45+
normalized = value.strip().lower()
46+
if normalized in _TRUE_VALUES:
47+
return True
48+
if normalized in _FALSE_VALUES:
49+
return False
50+
return default
3451

3552

3653
def get_settings(*, force_reload: bool = False) -> Settings:
@@ -63,6 +80,11 @@ def get_settings(*, force_reload: bool = False) -> Settings:
6380
insttoken = os.getenv("ELSEVIER_INSTTOKEN")
6481
http_proxy = os.getenv("ELSEVIER_HTTP_PROXY")
6582
https_proxy = os.getenv("ELSEVIER_HTTPS_PROXY")
83+
default_use_proxy = bool(http_proxy or https_proxy)
84+
use_proxy = _coerce_bool(
85+
os.getenv("ELSEVIER_USE_PROXY"),
86+
default=default_use_proxy,
87+
)
6688

6789
_CACHED_SETTINGS = Settings(
6890
api_key=api_key,
@@ -74,5 +96,6 @@ def get_settings(*, force_reload: bool = False) -> Settings:
7496
insttoken=insttoken,
7597
http_proxy=http_proxy,
7698
https_proxy=https_proxy,
99+
use_proxy=use_proxy,
77100
)
78101
return _CACHED_SETTINGS

0 commit comments

Comments
 (0)