Skip to content

Commit e53c63b

Browse files
committed
feat(env and documentation): Update .env.example to remove deprecated GitHub Enterprise variable and add new context summary scout mode configuration. Enhance AGENTS.md and getting-started.md with updated Open Pulse Ontology version and additional details on the extraction process. Modify index.md and rag-indices.md to reflect changes in data storage layout and improve clarity on indexer configurations.
1 parent 9777306 commit e53c63b

111 files changed

Lines changed: 4472 additions & 542 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,6 @@ OPENROUTER_API_KEY=your-openrouter-key-here
5050
# fetching (the pipeline degrades gracefully with warnings).
5151
# SELENIUM_REMOTE_URL=http://localhost:4444
5252

53-
# Override for the GitHub host when targeting GitHub Enterprise. Default is
54-
# `https://github.com`.
55-
# V2_GITHUB_BASE_URL=https://github.com
56-
5753
# ---------------------------------------------------------------------------
5854
# V2 pipeline runtime
5955
# ---------------------------------------------------------------------------
@@ -73,6 +69,21 @@ OPENROUTER_API_KEY=your-openrouter-key-here
7369
# `excluded_entities` with reason "critic pruning".
7470
# V2_APPLY_CRITIC_PRUNING=false
7571

72+
# Scout-mode upgrade for the `context_summary` LLM stage. When `true`,
73+
# the summary agent gets the broad RAG-search toolkit (orcid, ror,
74+
# infoscience, openalex, zenodo, ethz_research_collection, huggingface,
75+
# renkulab, snsf, epfl_graph + selenium_fetch) on top of the legacy
76+
# `grep_repository_corpus` + DuckDuckGo pair, and switches to a
77+
# scout-style system prompt that produces a structured brief with
78+
# explicit People / Organizations / Articles / Affiliations / Caveats
79+
# sections. Per-entity agents (person, org, article, membership,
80+
# contribution) automatically benefit since they already consume the
81+
# `summary_markdown` field. Trade-off: one heavier upfront LLM call,
82+
# but per-entity calls send less context (raw blobs already stripped)
83+
# and duplicate ORCID/ROR lookups across entities collapse into the
84+
# scout's shared brief. Off by default.
85+
# V2_CONTEXT_SUMMARY_SCOUT_MODE=false
86+
7687
# ---------------------------------------------------------------------------
7788
# V2 caches
7889
# ---------------------------------------------------------------------------
@@ -132,8 +143,10 @@ OPENROUTER_API_KEY=your-openrouter-key-here
132143
# INDEX_QDRANT_API_KEY=
133144

134145
# Optional runtime overrides (rarely needed — defaults from YAML are correct
135-
# for the EPFL deployment).
136-
# INDEX_QDRANT_URL=http://localhost:6333
146+
# for the EPFL deployment). When running inside the devcontainer the Qdrant
147+
# service is reachable as `gme-qdrant`; from a host shell or external runner
148+
# use `localhost`.
149+
# INDEX_QDRANT_URL=http://gme-qdrant:6333
137150
# INDEX_QDRANT_PREFER_GRPC=false
138151
# INDEX_OPENALEX_SCOPE_ROR=https://ror.org/02s376052
139152
# INDEX_OPENALEX_SCOPE_COUNTRY=ch

AGENTS.md

Lines changed: 63 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ human back-and-forth.
77
## What this tool is
88

99
A FastAPI service that turns a GitHub URL (repository / user / org) into
10-
JSON-LD aligned with **Open Pulse Ontology v2.0.0**. The service runs the
10+
JSON-LD aligned with **Open Pulse Ontology v2.1.2**. The service runs the
1111
input through a multi-stage pipeline that combines deterministic rules,
1212
provider lookups (GitHub REST, ROR, ORCID, Infoscience), and optional LLM
1313
agents to produce a graph of `schema:SoftwareSourceCode`,
@@ -46,6 +46,11 @@ src/v2/
4646
# docs/v2-rag-tools.md
4747
rule_based/
4848
<kind>_agent.py # deterministic counterparts (no LLM)
49+
refiners/ # hybrid-runtime LLM refiners — propose targeted
50+
<kind>/agent.py # patches over rule-based output, whitelisted fields only
51+
# organization/agent.py — pulse:OrganizationType
52+
# repository/agent.py — pulse:discipline, pulse:repositoryType
53+
# person/agent.py — schema:name (handle→canonical)
4954
5055
ingest/
5156
cache.py # ProviderCache (SQLite, WAL)
@@ -102,24 +107,36 @@ deterministic rule-based agents). All other stages run unconditionally.
102107
15. assemble_output split graph into root + related + excluded
103108
16. link_veracity [LLM, gated] verify every URL via Selenium fetch + LLM
104109
17. validate_articles drop placeholder/sentinel-DOI articles
105-
18. validate_ownership strip mismatched pulse:owns
106-
19. infer_owners stamp pulse:owns / pulse:ownedBy from handles
107-
20. infer_github_handle_parents fuzzy-search ROR for parent of every github
110+
18. validate_author_classes drop `schema:author` refs whose target is
111+
not a `schema:Person` (closes a SHACL
112+
class-shape violation that was previously
113+
warning-only)
114+
19. validate_ownership strip mismatched pulse:owns
115+
20. infer_owners stamp pulse:owns / pulse:ownedBy from handles;
116+
also coerces any residual bare-login string
117+
on `pulse:ownedBy` (e.g. "luzpaz") to the
118+
IRI shape `{"@id": "https://github.com/luzpaz"}`
119+
so SHACL never sees a `<file:///CWD/...>` URI
120+
21. infer_github_handle_parents fuzzy-search ROR for parent of every github
108121
org; add ROR org entities, stamp unitOf
109-
21. org_relationships [LLM] whole-graph LLM call to refine unitOf edges
110-
22. infer_org_units deterministic name-token fallback for unitOf
111-
23. concept_tagging [gated] pull EPFL Graph concepts/keywords/disciplines
122+
22. org_relationships [LLM] whole-graph LLM call to refine unitOf edges
123+
23. infer_org_units deterministic name-token fallback for unitOf
124+
24. concept_tagging [gated] pull EPFL Graph concepts/keywords/disciplines
112125
from the README onto the root repo entity as
113126
internal `_concepts`/`_keywords`/`_disciplines`
114127
metadata (off by default; opt-in with
115128
`V2_CONCEPT_TAGGING_ENABLED=true`)
116-
24. build_jsonld_output produce the final JSON-LD graph
129+
25. build_jsonld_output produce the final JSON-LD graph; also
130+
strips redundant `pulse:ror` from any
131+
`org:Organization` whose `@id` is already
132+
the ROR (closed-shape violation fix)
117133
```
118134

119135
**Gates:**
120136

121137
- Stages tagged `[LLM]` only run in `agent_runtime=llm`.
122138
- `link_veracity` is `[LLM]`-only too: in `agent_runtime=rule_based` it is **always skipped** (rule-based mode is guaranteed LLM-free).
139+
- `agent_runtime=hybrid` runs the rule-based generators (stages 4-9) **and** an LLM refiner stage (`refine_with_llm`, between reconciliation and `guarantee_repo_author`). The refiner proposes whitelisted-field patches per entity type: organizations (`pulse:OrganizationType`), repositories (`pulse:discipline`, `pulse:repositoryType` only when current is `pulse:Other`), and persons (`schema:name` only when current looks like a GitHub handle). LLM-only stages (`llm_dedup`, `llm_critic`, `link_veracity`, `org_relationships`) are **skipped** in hybrid mode. Toggle with `V2_HYBRID_REFINER_ENABLED` (default `true`).
123140
- `llm_critic` is **off by default**. Set `V2_APPLY_CRITIC_PRUNING=true` to enable (LLM mode only).
124141
- `link_veracity` is **on by default in LLM mode**. Set `V2_LINK_VERACITY_ENABLED=false` to skip even in LLM mode (recommended for batch runs).
125142
- `concept_tagging` is **off by default**. Set `V2_CONCEPT_TAGGING_ENABLED=true` to opt in. Backends are pluggable via `V2_CONCEPT_TAGGING_BACKEND` ∈ {`epfl_graph` (default, calls graphai), `wikipedia` (credential-free MediaWiki opensearch), `llm` (pydantic-ai)}. Stamps `_concepts` / `_keywords` / `_disciplines` as internal `_*` metadata (stripped before JSON-LD output and strict validation). Optional OpenAlex enrichment per discipline via `V2_CONCEPT_TAGGING_OPENALEX_RELATED_ENABLED=true` (publications, people, units). Full reference at [`docs/concept-tagging.md`](docs/concept-tagging.md).
@@ -131,6 +148,43 @@ deterministic rule-based agents). All other stages run unconditionally.
131148
- Contribution agent post-LLM: stamps `schema:author = target_person.id` and `pulse:contributionTo = target_repository.id` from the orchestrator's authoritative pair, regardless of what the LLM emits.
132149
- Article agent post-LLM: drops the entity if `schema:identifier` is a placeholder DOI (`10.0000/...`) or sentinel string (`UNKNOWN`, `N/A`, `TBD`, etc.) and no `pulse:infoscienceArticleIdentifier` is present.
133150
- Article agent (rule-based) defaults to repo-name-only Infoscience queries; opt in to the wider `include_person_queries=True` / `include_organization_queries=True` blend only when over-attribution risk is low.
151+
- Membership agents (both rule-based and LLM): swap `time:hasBeginning` / `time:hasEnd` when ORCID returns the pair inverted, so `hasBeginning <= hasEnd` always holds.
152+
153+
**SHACL conformance auto-fixes (warning-only `shacl_gate`, but the graph is fixed in place):**
154+
155+
The SHACL gate emits violations as `result.warnings` rather than aborting,
156+
so the responsibility for producing a SHACL-clean graph lives in the
157+
upstream stages and agents. The four most common violations are
158+
addressed deterministically:
159+
160+
1. `pulse:ownedBy` IRI shape — `infer_owners` rewrites bare-login strings
161+
to `{"@id": "https://github.com/{handle}"}`. Without this, SHACL
162+
resolves the bare token against the working-directory base URI
163+
(`<file:///workspaces/project/luzpaz>`) and the closed-shape check on
164+
`schema:Person | org:Organization` fails.
165+
2. `pulse:ror` redundancy — `build_jsonld_output` strips the field on
166+
any `org:Organization` whose `@id` is already the ROR. The
167+
Organization shape is `sh:closed` and rejects `pulse:ror`; the
168+
`@id` already carries that information.
169+
3. `Membership` date order — both membership agents swap inverted dates
170+
(above).
171+
4. `schema:author` class — `validate_author_classes` filters refs whose
172+
target is missing from the graph or not typed `schema:Person`.
173+
Catches Membership / Contribution ids leaking into author lists.
174+
175+
**Orchestrator fanout filtering:**
176+
177+
- `_filter_person_work_items` skips a queued person fanout when the
178+
GitHub handle resolves to `type=Organization` (cached
179+
`provider.github.get_user(login)` lookup).
180+
- `_filter_org_work_items` mirrors the rule for org fanouts: skips
181+
handles whose GitHub `type` is `User`. This prevents the 4× retry
182+
loop in `org_agent` for personal handles encoded into
183+
`org:hasMembership` composite ids.
184+
- `_person_fanout_contexts` materialises a User-account repo owner as a
185+
Person when missing from `contributors` (abandoned repos, empty
186+
repos). Prevents the owner from leaking as a bare-string ref in
187+
`pulse:ownedBy` / `schema:author` with no backing entity.
134188

135189
## API surface
136190

@@ -159,8 +213,8 @@ V1 endpoints (`/v1/extract`, `/v1/cache/*`) are still mounted but frozen
159213
| `SELENIUM_REMOTE_URL` | unset | enables Selenium-backed link veracity + selenium-fetch tool |
160214
| `V2_AGENT_RUNTIME_DEFAULT` | `llm` | default runtime when `/v2/extract` omits `agent_runtime` |
161215
| `V2_USE_MOCK_PROVIDERS` | `true` | swap in mock GitHub/ORCID/Infoscience/ROR providers |
162-
| `V2_GITHUB_BASE_URL` | `https://github.com` | for GitHub Enterprise |
163216
| `V2_LINK_VERACITY_ENABLED` | `true` | turn off to skip the link-veracity stage in LLM mode (rule-based mode skips unconditionally) |
217+
| `V2_CONTEXT_SUMMARY_SCOUT_MODE` | `false` | upgrade `context_summary` LLM stage to scout mode: broad RAG-search toolkit (orcid/ror/infoscience/openalex/zenodo/ethz/huggingface/renkulab/snsf/epfl_graph + selenium_fetch) on top of the legacy `grep_repository_corpus` + DuckDuckGo pair, plus a structured-brief prompt with explicit People / Organizations / Articles / Affiliations / Caveats sections. Per-entity LLM agents (person, org, article, membership, contribution) automatically benefit since they already consume the `summary_markdown`. Trade-off: heavier upfront LLM call, but per-entity calls send less context and duplicate ORCID/ROR lookups across entities collapse into the scout's shared brief. |
164218
| `V2_INFOSCIENCE_RAG_ENABLED` | `true` | enables the Infoscience RAG agent tools (Qdrant-backed semantic search + on-demand chunk/record fetch). Construction degrades gracefully when Qdrant or RCP is unreachable. |
165219
| `V2_ETHZ_RESEARCH_COLLECTION_RAG_ENABLED` | `true` | enables the ETH Research Collection RAG agent tools (DSpace-backed sister index to Infoscience for ETHZ research outputs). Same shape: search + fetch_chunks + fetch_records. |
166220
| `V2_HUGGINGFACE_RAG_ENABLED` | `true` | enables the HuggingFace Hub RAG search tool (collections: `hf_models`, `hf_datasets`, `hf_spaces`, `hf_orgs`). |

docs/getting-started.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@ Optional integrations:
3535
- `INFOSCIENCE_TOKEN` — protected Infoscience routes only.
3636
- `SELENIUM_REMOTE_URL` — enables the link-veracity pipeline stage and
3737
the `fetch_link_content_via_selenium` LLM tool.
38-
- `V2_GITHUB_BASE_URL` — for GitHub Enterprise.
3938

4039
Per-indexer politeness (only needed when running the corresponding
4140
indexer):

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Git Metadata Extractor
22

33
A FastAPI service that turns a GitHub URL (repository / user / org) into
4-
JSON-LD aligned with **Open Pulse Ontology v2.0.0**, plus nine sibling RAG
4+
JSON-LD aligned with **Open Pulse Ontology v2.1.2**, plus nine sibling RAG
55
indices over EPFL/Swiss research catalogues that the v2 LLM agents can
66
query during extraction.
77

docs/rag-indices.md

Lines changed: 35 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -221,21 +221,44 @@ done
221221

222222
## Storage layout
223223

224+
Each per-index directory carries its own DuckDB store and any
225+
ingest-time scratch (raw downloads, fetch caches, run logs). Qdrant
226+
runs as a single shared service whose persistence lives **outside**
227+
`data/index/` because it backs all indices simultaneously.
228+
224229
```
225-
data/index/
226-
huggingface/{duckdb,cards,logs,cache}/
227-
openalex/{duckdb,cache}/
228-
infoscience/{duckdb,raw,chunks,dumps}/
229-
orcid/{duckdb,records,cache}/
230-
ror/{duckdb,dump,cache}/
231-
zenodo/{duckdb,records}/
232-
ethz_research_collection/{duckdb,raw,chunks}/
233-
github/{duckdb,readmes}/
234-
snsf/{duckdb,records}/
235-
qdrant/storage/ # shared by ALL indices, one collection per (index, entity_type)
230+
data/
231+
index/
232+
huggingface/{duckdb,cards,cache,logs}/
233+
openalex/{duckdb,cache,logs}/
234+
infoscience/{duckdb,raw,text,dumps,chroma,matches.jsonl,organizations.txt,persons.txt,relations.jsonl,discover_state.json}/
235+
orcid-epfl/{duckdb,cache,logs}/
236+
orcid-switzerland/{duckdb,cache,logs,discover.log,discover_resume.log}/
237+
ror/{duckdb,dump,index}/
238+
zenodo/{duckdb,cache,logs,state}/
239+
ethz-research-collection/{duckdb,raw,text,matches.jsonl,organizations.txt,persons.txt,relations.jsonl,discover_state.json}/
240+
github/{duckdb,cards,cache,logs}/
241+
snsf/{duckdb,raw}/
242+
renkulab/{duckdb,cache,logs,state}/
243+
epfl_graph/{duckdb,cache,logs}/
244+
swissubase/{duckdb,cache,logs,state}/
245+
qdrant/storage/ # shared by ALL indices, one collection per (index, entity_type)
236246
```
237247

238-
Backups: each `.duckdb` file is a self-contained SQLite-ish snapshot — `cp` it. The Qdrant collections can be regenerated from DuckDB via `<index>-embed`, so they don't strictly need to be backed up.
248+
Per-subdir convention:
249+
250+
- `duckdb/` — the canonical DuckDB store (`<index>.duckdb` + WAL).
251+
- `raw/` — bulk inputs from the upstream source (CSVs, JSON dumps).
252+
Used by indices whose ingest is local-file-driven (SNSF, Infoscience,
253+
ETHZ Research Collection).
254+
- `cache/` — per-record HTTP / API cache for incremental ingest.
255+
- `logs/` — ingest run logs.
256+
- `state/` — resumable ingest checkpoints (Zenodo, RenkuLab, SWISSUbase).
257+
- `cards/`, `text/`, `dumps/`, `discover_state.json`, `matches.jsonl`,
258+
`organizations.txt`, `persons.txt`, `relations.jsonl` — index-specific
259+
intermediate artefacts. See the per-index docs and CLI help.
260+
261+
Backups: each `.duckdb` file is a self-contained SQLite-ish snapshot — `cp` it. The Qdrant collections can be regenerated from DuckDB via `<index>-embed`, so they don't strictly need to be backed up. Qdrant persistence lives in `data/qdrant/storage/` (bind-mounted into the `gme-qdrant` container at `/qdrant/storage`).
239262

240263
## Related documentation
241264

docs/v2-api-reference.md

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -183,13 +183,20 @@ deterministic rule-based agents). Other stages run unconditionally.
183183
15. assemble_output split graph into root + related + excluded
184184
16. link_veracity [LLM, gated] verify every URL via Selenium fetch + LLM
185185
17. validate_articles drop placeholder / sentinel-DOI articles
186-
18. validate_ownership strip mismatched pulse:owns
187-
19. infer_owners stamp pulse:owns / pulse:ownedBy from handles
188-
20. infer_github_handle_parents fuzzy-search ROR for parent of every github
186+
18. validate_author_classes drop `schema:author` refs whose target is
187+
not a `schema:Person`
188+
19. validate_ownership strip mismatched pulse:owns
189+
20. infer_owners stamp pulse:owns / pulse:ownedBy from handles;
190+
coerces residual bare-login strings on
191+
`pulse:ownedBy` to `{"@id": "https://github.com/{handle}"}`
192+
21. infer_github_handle_parents fuzzy-search ROR for parent of every github
189193
org; add ROR org entities, stamp unitOf
190-
21. org_relationships [LLM] whole-graph LLM call to refine unitOf edges
191-
22. infer_org_units deterministic name-token fallback for unitOf
192-
23. build_jsonld_output produce the final JSON-LD graph
194+
22. org_relationships [LLM] whole-graph LLM call to refine unitOf edges
195+
23. infer_org_units deterministic name-token fallback for unitOf
196+
24. build_jsonld_output produce the final JSON-LD graph; strips
197+
redundant `pulse:ror` from any
198+
`org:Organization` whose `@id` is already
199+
the ROR (closed-shape fix)
193200
```
194201

195202
**Gates:**

0 commit comments

Comments
 (0)