You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(env and documentation): Update .env.example to remove deprecated GitHub Enterprise variable and add new context summary scout mode configuration. Enhance AGENTS.md and getting-started.md with updated Open Pulse Ontology version and additional details on the extraction process. Modify index.md and rag-indices.md to reflect changes in data storage layout and improve clarity on indexer configurations.
24. build_jsonld_output produce the final JSON-LD graph
129
+
25. build_jsonld_output produce the final JSON-LD graph; also
130
+
strips redundant `pulse:ror` from any
131
+
`org:Organization` whose `@id` is already
132
+
the ROR (closed-shape violation fix)
117
133
```
118
134
119
135
**Gates:**
120
136
121
137
- Stages tagged `[LLM]` only run in `agent_runtime=llm`.
122
138
-`link_veracity` is `[LLM]`-only too: in `agent_runtime=rule_based` it is **always skipped** (rule-based mode is guaranteed LLM-free).
139
+
-`agent_runtime=hybrid` runs the rule-based generators (stages 4-9) **and** an LLM refiner stage (`refine_with_llm`, between reconciliation and `guarantee_repo_author`). The refiner proposes whitelisted-field patches per entity type: organizations (`pulse:OrganizationType`), repositories (`pulse:discipline`, `pulse:repositoryType` only when current is `pulse:Other`), and persons (`schema:name` only when current looks like a GitHub handle). LLM-only stages (`llm_dedup`, `llm_critic`, `link_veracity`, `org_relationships`) are **skipped** in hybrid mode. Toggle with `V2_HYBRID_REFINER_ENABLED` (default `true`).
123
140
-`llm_critic` is **off by default**. Set `V2_APPLY_CRITIC_PRUNING=true` to enable (LLM mode only).
124
141
-`link_veracity` is **on by default in LLM mode**. Set `V2_LINK_VERACITY_ENABLED=false` to skip even in LLM mode (recommended for batch runs).
125
142
-`concept_tagging` is **off by default**. Set `V2_CONCEPT_TAGGING_ENABLED=true` to opt in. Backends are pluggable via `V2_CONCEPT_TAGGING_BACKEND` ∈ {`epfl_graph` (default, calls graphai), `wikipedia` (credential-free MediaWiki opensearch), `llm` (pydantic-ai)}. Stamps `_concepts` / `_keywords` / `_disciplines` as internal `_*` metadata (stripped before JSON-LD output and strict validation). Optional OpenAlex enrichment per discipline via `V2_CONCEPT_TAGGING_OPENALEX_RELATED_ENABLED=true` (publications, people, units). Full reference at [`docs/concept-tagging.md`](docs/concept-tagging.md).
@@ -131,6 +148,43 @@ deterministic rule-based agents). All other stages run unconditionally.
131
148
- Contribution agent post-LLM: stamps `schema:author = target_person.id` and `pulse:contributionTo = target_repository.id` from the orchestrator's authoritative pair, regardless of what the LLM emits.
132
149
- Article agent post-LLM: drops the entity if `schema:identifier` is a placeholder DOI (`10.0000/...`) or sentinel string (`UNKNOWN`, `N/A`, `TBD`, etc.) and no `pulse:infoscienceArticleIdentifier` is present.
133
150
- Article agent (rule-based) defaults to repo-name-only Infoscience queries; opt in to the wider `include_person_queries=True` / `include_organization_queries=True` blend only when over-attribution risk is low.
151
+
- Membership agents (both rule-based and LLM): swap `time:hasBeginning` / `time:hasEnd` when ORCID returns the pair inverted, so `hasBeginning <= hasEnd` always holds.
152
+
153
+
**SHACL conformance auto-fixes (warning-only `shacl_gate`, but the graph is fixed in place):**
154
+
155
+
The SHACL gate emits violations as `result.warnings` rather than aborting,
156
+
so the responsibility for producing a SHACL-clean graph lives in the
157
+
upstream stages and agents. The four most common violations are
158
+
addressed deterministically:
159
+
160
+
1.`pulse:ownedBy` IRI shape — `infer_owners` rewrites bare-login strings
161
+
to `{"@id": "https://github.com/{handle}"}`. Without this, SHACL
162
+
resolves the bare token against the working-directory base URI
163
+
(`<file:///workspaces/project/luzpaz>`) and the closed-shape check on
164
+
`schema:Person | org:Organization` fails.
165
+
2.`pulse:ror` redundancy — `build_jsonld_output` strips the field on
166
+
any `org:Organization` whose `@id` is already the ROR. The
167
+
Organization shape is `sh:closed` and rejects `pulse:ror`; the
168
+
`@id` already carries that information.
169
+
3.`Membership` date order — both membership agents swap inverted dates
170
+
(above).
171
+
4.`schema:author` class — `validate_author_classes` filters refs whose
172
+
target is missing from the graph or not typed `schema:Person`.
173
+
Catches Membership / Contribution ids leaking into author lists.
174
+
175
+
**Orchestrator fanout filtering:**
176
+
177
+
-`_filter_person_work_items` skips a queued person fanout when the
178
+
GitHub handle resolves to `type=Organization` (cached
179
+
`provider.github.get_user(login)` lookup).
180
+
-`_filter_org_work_items` mirrors the rule for org fanouts: skips
181
+
handles whose GitHub `type` is `User`. This prevents the 4× retry
182
+
loop in `org_agent` for personal handles encoded into
183
+
`org:hasMembership` composite ids.
184
+
-`_person_fanout_contexts` materialises a User-account repo owner as a
185
+
Person when missing from `contributors` (abandoned repos, empty
186
+
repos). Prevents the owner from leaking as a bare-string ref in
187
+
`pulse:ownedBy` / `schema:author` with no backing entity.
134
188
135
189
## API surface
136
190
@@ -159,8 +213,8 @@ V1 endpoints (`/v1/extract`, `/v1/cache/*`) are still mounted but frozen
|`V2_AGENT_RUNTIME_DEFAULT`|`llm`| default runtime when `/v2/extract` omits `agent_runtime`|
161
215
|`V2_USE_MOCK_PROVIDERS`|`true`| swap in mock GitHub/ORCID/Infoscience/ROR providers |
162
-
|`V2_GITHUB_BASE_URL`|`https://github.com`| for GitHub Enterprise |
163
216
|`V2_LINK_VERACITY_ENABLED`|`true`| turn off to skip the link-veracity stage in LLM mode (rule-based mode skips unconditionally) |
217
+
|`V2_CONTEXT_SUMMARY_SCOUT_MODE`|`false`| upgrade `context_summary` LLM stage to scout mode: broad RAG-search toolkit (orcid/ror/infoscience/openalex/zenodo/ethz/huggingface/renkulab/snsf/epfl_graph + selenium_fetch) on top of the legacy `grep_repository_corpus` + DuckDuckGo pair, plus a structured-brief prompt with explicit People / Organizations / Articles / Affiliations / Caveats sections. Per-entity LLM agents (person, org, article, membership, contribution) automatically benefit since they already consume the `summary_markdown`. Trade-off: heavier upfront LLM call, but per-entity calls send less context and duplicate ORCID/ROR lookups across entities collapse into the scout's shared brief. |
164
218
|`V2_INFOSCIENCE_RAG_ENABLED`|`true`| enables the Infoscience RAG agent tools (Qdrant-backed semantic search + on-demand chunk/record fetch). Construction degrades gracefully when Qdrant or RCP is unreachable. |
165
219
|`V2_ETHZ_RESEARCH_COLLECTION_RAG_ENABLED`|`true`| enables the ETH Research Collection RAG agent tools (DSpace-backed sister index to Infoscience for ETHZ research outputs). Same shape: search + fetch_chunks + fetch_records. |
qdrant/storage/ # shared by ALL indices, one collection per (index, entity_type)
236
246
```
237
247
238
-
Backups: each `.duckdb` file is a self-contained SQLite-ish snapshot — `cp` it. The Qdrant collections can be regenerated from DuckDB via `<index>-embed`, so they don't strictly need to be backed up.
248
+
Per-subdir convention:
249
+
250
+
-`duckdb/` — the canonical DuckDB store (`<index>.duckdb` + WAL).
251
+
-`raw/` — bulk inputs from the upstream source (CSVs, JSON dumps).
252
+
Used by indices whose ingest is local-file-driven (SNSF, Infoscience,
253
+
ETHZ Research Collection).
254
+
-`cache/` — per-record HTTP / API cache for incremental ingest.
intermediate artefacts. See the per-index docs and CLI help.
260
+
261
+
Backups: each `.duckdb` file is a self-contained SQLite-ish snapshot — `cp` it. The Qdrant collections can be regenerated from DuckDB via `<index>-embed`, so they don't strictly need to be backed up. Qdrant persistence lives in `data/qdrant/storage/` (bind-mounted into the `gme-qdrant` container at `/qdrant/storage`).
0 commit comments