neurostuff
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ARCHITECTURE.md‎
Lines changed: 60 additions & 0 deletions b/‎ARCHITECTURE.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎BUILD_STRATEGY.md‎
Lines changed: 23 additions & 0 deletions b/‎BUILD_STRATEGY.md‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎DEVELOPMENT_PLAN.md‎
Lines changed: 177 additions & 0 deletions b/‎DEVELOPMENT_PLAN.md‎
Lines changed: 177 additions & 0 deletions
diff --git a/‎elsevier_coordinate_extraction/client.py‎
Lines changed: 14 additions & 9 deletions b/‎elsevier_coordinate_extraction/client.py‎
Lines changed: 14 additions & 9 deletions
diff --git a/‎elsevier_coordinate_extraction/settings.py‎
Lines changed: 23 additions & 0 deletions b/‎elsevier_coordinate_extraction/settings.py‎
Lines changed: 23 additions & 0 deletions
@@ -68,3 +68,4 @@ Thumbs.db
 .python-version
 
 .elsevier_cache
+.env
@@ -0,0 +1,60 @@
+# Architecture Overview
+
+This document contains a plain language overview of the architecture of the Elsvier Coordinate Extraction package.
+
+## Core Components
+The package is structured into three main components:
+1. **Search Module**: Responsible for querying Elsevier's database to find relevant articles based on user-defined criteria.
+2. **Download Module**: Handles the downloading of articles identified by the Search Module.
+3. **Extraction Module**: Focuses on extracting coordinate data from the downloaded articles.
+
+These components are modular and independent, allowing
+for piecewise development and testing, and integration
+into other code bases.
+
+
+## Data Flow
+The data flow within the package follows a linear progression:
+1. **Input**: User provides search criteria (e.g., keywords, authors).
+2. **Search**: The Search Module queries the Elsevier database and returns a list of articles matching the criteria.
+3. **Download**: The Download Module retrieves the full text of the articles identified in the search results.
+4. **Extraction**: The Extraction Module processes the downloaded articles to extract relevant coordinate data.
+5. **Output**: The extracted coordinates are returned to the user in a structured format (following the nimads standard: https://neurostuff.github.io/NIMADS/).
+
+### Input
+
+This will be a search query string, trying to faithfully
+represent the searches that can be done with the elsevier API.
+
+### Search
+
+Using the elsevier API, we will search for articles
+matching the input query. The results will be a list of
+articles with metadata including their unique identifiers.
+The output of this stage is a list of article identifiers in a dictionary format.
+I'm looking for doi, pmid, pmcid where available.
+If only DOI is available, that is sufficient.
+
+### Download
+
+This will take in the list of article identifiers
+from the Search stage, and download the full text of
+each article. The output of this stage is the full text
+of each article in a suitable format for processing
+by the Extraction stage. The downloading should be
+parallelized while respecting any rate limits imposed by
+the Elsevier API.
+
+### Extraction
+This stage processes the full text of each downloaded
+article to extract coordinate data. The output of this
+stage is a structured representation of the extracted
+coordinates, following the nimads standard: https://neurostuff.github.io/NIMADS/
+
+This will also be parallelized to improve performance.
+
+
+## Inspiration
+
+The coordinate extraction is inspired by pubget (https://github.com/neuroquery/pubget)
+and ACE (https://github.com/neurosynth/ACE)
@@ -0,0 +1,23 @@
+# Build Strategy
+This document outlines the build strategy for our project, detailing the tools, processes, and best practices we follow to ensure efficient and reliable builds.
+
+## Build Tools
+- uv
+- venv
+
+
+## Testing
+
+Testing is done using pytest and pytest-recording to
+use realistic http interactions without hitting the live APIs every time.
+
+## Continuous Integration
+
+We use GitHub Actions for continuous integration. The CI pipeline includes:
+- Linting with ruff
+- Running unit tests with pytest
+
+
+## Writing code
+
+After an initial implementation of the code architecture, we follow a test driven development (TDD) approach to add new features and fix bugs. This ensures that our code is well-tested and reliable.
@@ -0,0 +1,177 @@
+# Elsvier Coordinate Extraction – Development Blueprint
+
+## Package Layout & Interfaces
+
+```
+elsevier_coordinate_extraction/
+├── __init__.py
+├── settings.py
+├── client.py
+├── cache.py
+├── rate_limits.py
+├── nimads.py
+├── types.py
+├── search/
+│   ├── __init__.py
+│   └── api.py
+├── download/
+│   ├── __init__.py
+│   └── api.py
+├── extract/
+│   ├── __init__.py
+│   └── coordinates.py
+└── pipeline.py
+tests/
+├── test_settings.py
+├── search/
+│   └── test_api.py
+├── download/
+│   └── test_api.py
+├── extract/
+│   └── test_coordinates.py
+└── test_pipeline.py
+```
+
+### `settings.py`
+- `@dataclass class Settings`: `api_key`, `base_url`, `timeout`, `concurrency`, `cache_dir`, `user_agent`.
+- `get_settings() -> Settings`: loads `.env` via `python-dotenv`, memoizes the resulting object, validates required fields.
+
+### `client.py`
+- `class ScienceDirectClient`: async context manager wrapping `httpx.AsyncClient`.
+  - Injects API key header (`X-Elsevier-APIKey`), default query params (e.g., `httpAccept`).
+  - Accepts `Settings`, optional external `AsyncClient`.
+  - Exposes `async get_json(path: str, params: dict[str, str]) -> dict`.
+  - Exposes `async get_xml(path: str, params: dict[str, str]) -> str`.
+  - Handles retry/backoff using response status and headers; `rate_limits.py` assists with parsing `Retry-After` and known ScienceDirect policy (fallback to static ceiling if headers absent).
+
+**Client usage example**
+
+```python
+from elsevier_coordinate_extraction.client import ScienceDirectClient
+from elsevier_coordinate_extraction.settings import get_settings
+
+settings = get_settings()
+
+async with ScienceDirectClient(settings) as client:
+    result = await client.get_json(
+        "/search/sciencedirect",
+        params={"query": "TITLE(fmri)", "count": "1"},
+    )
+```
+
+The client automatically applies API key and user agent headers, enforces the configured concurrency limit, and retries when the Elsevier API returns `Retry-After` metadata.
+
+### `cache.py`
+- `class FileCache`:
+  - `async get(namespace: str, key: str) -> bytes | None`.
+  - `async set(namespace: str, key: str, data: bytes, metadata: dict | None = None) -> None`.
+  - Namespaces for `search`, `articles`, `assets`; keys derived from deterministic hashes.
+- `CacheKey` helpers to hash query params and article identifiers.
+
+### `types.py`
+- `StudyMetadata`, `ArticleContent`, `AnalysisPayload`, `PointPayload` defined as `TypedDict`/`dataclass` to mirror NIMADS schema.
+- `StudysetPayload` alias for the top-level structure handed between modules.
+
+### `nimads.py`
+- `def build_study(study_meta: StudyMetadata, *, analyses: list[AnalysisPayload] | None = None) -> dict`.
+- `def build_studyset(name: str, studies: list[dict], metadata: dict | None = None) -> dict`.
+- Optional `validate(payload: dict) -> None` hook using LinkML schemas if we decide to integrate them later (validation deferred for now).
+
+### `search/api.py`
+- `async def search_articles(query: str, *, max_results: int = 25, client: ScienceDirectClient | None = None, cache: FileCache | None = None) -> StudysetPayload`.
+  - Builds ScienceDirect search endpoint params (`query`, `count`, `start`, requested fields for DOI/title/abstract/authors/openaccess flags).
+  - Handles pagination until `max_results`.
+  - Collects minimal study metadata (title, abstract, authors, journal, year, open access flag, DOI/PII/Scopus ID) in NIMADS `Study` format; attaches ScienceDirect identifiers to `metadata`.
+  - Persists search responses via cache keyed by query hash.
+
+### `download/api.py`
+- `async def download_articles(studies: Sequence[StudyMetadata], *, formats: Sequence[str] = ("xml", "html"), client: ScienceDirectClient | None = None, cache: FileCache | None = None) -> list[ArticleContent]`.
+  - Resolves best-available format per study (prioritize XML; fallback to HTML/PDF).
+  - Uses `asyncio.Semaphore(settings.concurrency)` for parallel downloads.
+  - Stores raw payload bytes and associated metadata (`content_type`, `is_open_access`, `retrieved_at`) in `ArticleContent`.
+  - Persists downloads to cache using article-specific keys.
+
+### `extract/coordinates.py`
+- `def extract_coordinates(articles: Sequence[ArticleContent]) -> StudysetPayload`.
+  - Parses XML with `lxml` (mirroring Pubget heuristics).
+  - `def _iter_table_fragments(xml_root) -> Iterable[TableFragment]`: yields raw table XML and metadata.
+  - `def _parse_table(table_fragment: str) -> list[PointPayload]`: ported logic from `pubget._coordinates`.
+  - `def _infer_coordinate_space(article_element) -> str | None`: replicates Pubget coordinate space detection.
+  - Returns NIMADS-ready structures: each study gets `analyses` populated with `points`, includes raw table XML snippets in `metadata`.
+
+### `pipeline.py`
+- `async def run_pipeline(query: str, *, max_results: int = 25, settings: Settings | None = None) -> StudysetPayload`.
+  - Glues search → download → extract, reusing shared `ScienceDirectClient` and `FileCache`.
+  - Returns final aggregated payload with both metadata and extracted coordinates.
+
+## TDD Plan
+
+1. **Settings & Configuration**
+   - Write `test_settings.py` verifying `.env` loading, required key enforcement, and memoization.
+   - Implement minimal `settings.py` until tests pass.
+
+2. **HTTP Client Layer**
+   - Draft `tests/search/test_client.py` (or inline in `test_api.py`) using `pytest-recording` to confirm headers, retries, and timeout behavior.
+   - Implement `ScienceDirectClient` with httpx, stub out rate-limit parsing.
+
+3. **Search Module**
+   - Author tests that:
+     - mock ScienceDirect responses (via recordings) to validate query building, pagination, metadata extraction, and cache hits.
+     - assert returned structure matches NIMADS study schema fields (title, abstract, authors, year, journal, open-access flag).
+   - Build `search/api.py` to satisfy tests.
+
+4. **Download Module**
+   - Create fixtures with recorded ScienceDirect full-text responses (XML + HTML fallback).
+   - Tests cover: format preference, concurrency (mocked semaphore), cache usage, error handling/backoff.
+   - Implement `download/api.py` accordingly.
+
+5. **Extraction Module**
+   - Begin with unit tests using sample XML snippets (derived from Pubget tests/examples) to ensure coordinate parsing matches expectations.
+   - Add integration test comparing against a known article used in Pubget to validate coordinate and space inference.
+   - Port necessary parsing helpers into `extract/coordinates.py`.
+
+6. **Pipeline Integration**
+   - Write async pipeline test with mocked modules to ensure data passed through correctly and NIMADS payload assembled.
+   - Add optional recording-based end-to-end test gated behind marker to avoid frequent API calls.
+
+7. **NIMADS Helpers & Validation**
+   - Tests verifying builder functions assemble schemas correctly and that point metadata includes raw table XML fragments.
+   - Implement `nimads.py` helpers plus optional schema validation toggle.
+
+8. **Rate-Limit Handling**
+   - Tests simulate responses with and without headers like `Retry-After` to check backoff logic.
+   - Implement `rate_limits.py` to parse response headers and enforce delays (fall back to documented limits determined during implementation).
+
+9. **Cache Layer**
+   - Tests ensure deterministic key generation, read/write round-trips, and concurrent access safety.
+   - Implement `FileCache` (async wrappers around `asyncio.to_thread` for disk IO if necessary).
+
+10. **Documentation & Examples**
+   - After functionality stabilizes, add README usage examples and docstrings, ensuring TDD artifacts remain green.
+
+## Targeted Download & Extraction TDD (using test DOIs)
+
+To exercise ScienceDirect endpoints without live dependencies, we use the `test_dois` fixture defined in `tests/conftest.py`. Recording rules:
+- Set `PYTEST_RECORDING_MODE=once` (default). Update recordings intentionally when request parameters change.
+- Recordings live under `tests/cassettes/download/` and `tests/cassettes/extract/`, named per test function.
+
+### Download module tests
+1. `tests/download/test_api.py::test_download_single_article_xml`
+   - Uses cassette `download/test_download_single_article_xml.yaml`.
+   - Asserts the API retrieves XML payload bytes, content type, and article identifiers (DOI, PII).
+2. `test_download_handles_cached_payload`
+   - Mocks cache layer; ensures cached entries skip HTTP call.
+3. `test_download_parallel_respects_concurrency`
+   - Parametrized with two DOIs; asserts semaphore limits concurrency via captured timestamps.
+
+### Extraction module tests
+1. `tests/extract/test_coordinates.py::test_extract_coordinates_from_sample_xml`
+   - Loads recorded XML from download stage (fixture).
+   - Validates we detect coordinate tables, parse xyz triplets, infer MNI/TAL space, and attach raw table XML.
+2. `test_extract_returns_nimads_structure`
+   - Confirms output includes `study`, `analyses`, and `points` fields shaped like NIMADS schema.
+3. Integration test combining download + extract for a single DOI, using cached cassette to simulate full pipeline without external calls.
+
+Each test should fail prior to implementation, guiding incremental development of `download/api.py`, extraction helpers, and new type utilities.
+
+Throughout development, record HTTP interactions with `pytest-recording`, keep tests deterministic via cache fixtures, and use Ruff for linting.
@@ -91,15 +91,20 @@ async def _ensure_client(self) -> None:
         if self._settings.insttoken:
             headers["X-ELS-Insttoken"] = self._settings.insttoken
         timeout = httpx.Timeout(self._settings.timeout)
-        proxy_value = self._settings.https_proxy or self._settings.http_proxy
-        self._client = httpx.AsyncClient(
-            base_url=self._settings.base_url,
-            timeout=timeout,
-            headers=headers,
-            transport=self._transport,
-            http2=True,
-            proxy=proxy_value,
-        )
+        client_kwargs: dict[str, Any] = {
+            "base_url": self._settings.base_url,
+            "timeout": timeout,
+            "headers": headers,
+            "transport": self._transport,
+            "http2": True,
+        }
+        if self._settings.use_proxy:
+            proxy_value = self._settings.https_proxy or self._settings.http_proxy
+            if proxy_value:
+                client_kwargs["proxy"] = proxy_value
+        else:
+            client_kwargs["trust_env"] = False
+        self._client = httpx.AsyncClient(**client_kwargs)
 
     async def _request(
         self,
 
@@ -31,6 +31,23 @@ class Settings:
     insttoken: str | None
     http_proxy: str | None
     https_proxy: str | None
+    use_proxy: bool
+
+
+_TRUE_VALUES: Final[set[str]] = {"1", "true", "yes", "on"}
+_FALSE_VALUES: Final[set[str]] = {"0", "false", "no", "off"}
+
+
+def _coerce_bool(value: str | None, *, default: bool) -> bool:
+    """Convert common textual boolean representations to bool."""
+    if value is None:
+        return default
+    normalized = value.strip().lower()
+    if normalized in _TRUE_VALUES:
+        return True
+    if normalized in _FALSE_VALUES:
+        return False
+    return default
 
 
 def get_settings(*, force_reload: bool = False) -> Settings:
@@ -63,6 +80,11 @@ def get_settings(*, force_reload: bool = False) -> Settings:
     insttoken = os.getenv("ELSEVIER_INSTTOKEN")
     http_proxy = os.getenv("ELSEVIER_HTTP_PROXY")
     https_proxy = os.getenv("ELSEVIER_HTTPS_PROXY")
+    default_use_proxy = bool(http_proxy or https_proxy)
+    use_proxy = _coerce_bool(
+        os.getenv("ELSEVIER_USE_PROXY"),
+        default=default_use_proxy,
+    )
 
     _CACHED_SETTINGS = Settings(
         api_key=api_key,
@@ -74,5 +96,6 @@ def get_settings(*, force_reload: bool = False) -> Settings:
         insttoken=insttoken,
         http_proxy=http_proxy,
         https_proxy=https_proxy,
+        use_proxy=use_proxy,
     )
     return _CACHED_SETTINGS
Original file line number	Diff line number	Diff line change
`@@ -68,3 +68,4 @@ Thumbs.db`
`68`	`68`	`.python-version`
`69`	`69`
`70`	`70`	`.elsevier_cache`
	`71`	`+.env`