Skip to content

Commit 6fdfdb7

Browse files
bloxsterclaudeCopilotCopilot
authored
[r3.4] docs: add llms.txt generator script and update root llms.txt
## Summary - Adds `docs/site/scripts/generate-llms.py` — a pure Python (stdlib only, zero npm deps) script that generates LLM-friendly content exports from the Docusaurus source files directly - Generates `docs/site/static/llms.txt` (page index, 71 pages) and `docs/site/static/llms-full.txt` (full clean markdown, ~351 KB), served at `docs.erigon.tech/llms.txt` and `docs.erigon.tech/llms-full.txt` - Updates the repo-root `llms.txt`, which was pointing to the deleted `docs/gitbook/` folder — now mirrors the Docusaurus-generated index with live `docs.erigon.tech` URLs - Adds a CI guard in `.github/workflows/docs-deploy.yml` that runs `generate-llms.py --check` and the unit tests before the npm build, blocking drift between any of the four committed files (root + `static/`) - Adds a unit test suite (`docs/site/scripts/test_generate_llms.py`, 25 tests) covering placeholder preservation, fence transparency, JSX stripping, multi-line expr blocks, frontmatter parsing, and landing-page synthesis ## Why a custom script instead of `docusaurus-plugin-llms-txt` (replaces #20993) PR #20993 used the `docusaurus-plugin-llms-txt@0.1.3` npm package. After review, we decided against it: - **Wrong approach**: the plugin works on *compiled HTML output* and converts it back to markdown — a lossy round-trip. Our source is already markdown. - **Supply chain risk**: the package has no declared source repo, is maintained by a personal Gmail address, and has not been updated in 16 months. - **Unnecessary dependency**: a Python stdlib script does the same job with no external dependencies, no build-time coupling, and cleaner output. The custom script reads `.md`/`.mdx` files directly, strips MDX-specific syntax (imports, JSX components, HTML tags, expressions), extracts frontmatter titles and descriptions, and maps file paths to their deployed `docs.erigon.tech` URLs. Both Docusaurus plugin instances (main docs and help-center) are supported. Card-grid landing pages (e.g. `docs/index.mdx`) are detected via the `lp-card` JSX pattern and synthesized into structured "## Sections" + bullet lists rather than collapsing into a soup of title/desc fragments. ## How to update Re-run the script whenever doc content changes: ```bash python3 docs/site/scripts/generate-llms.py ``` To verify on-disk files match what the script would generate (used by CI): ```bash python3 docs/site/scripts/generate-llms.py --check ``` The CI guard in `docs-deploy.yml` runs `--check` and the unittest suite on every push touching `docs/site/**`, so a forgotten regeneration after a docs edit will fail the build before deploy. ## Updates after review (commit `05a81fcd`) Addressing yperbasis CHANGES_REQUESTED + Copilot follow-ups: **Blockers** - ✅ Preserve `{ERIGON_VERSION}` and other ALL_CAPS identifier placeholders in prose and table cells. The brace-strip regex now skips pure-uppercase identifier braces, mirroring the existing `<IP>`/`<PID>` angle-tag guard. Verified against the install-instructions table cell (`erigon_{ERIGON_VERSION}_amd64.deb`) and the version selector prose (`(e.g., v{ERIGON_VERSION})`) the reviewer flagged. - ✅ Test-plan H1 assertion replaced — the prior `^# ` count incorrectly counted shell comments inside `bash` fences (e.g. `# Reduce disk latency impact`). Now uses `^URL: ` (one synthetic URL line per page = 71). - ✅ Drift guard via CI rather than `prebuild` (catches drift in all 4 files, no Python coupling in the npm build path). **Non-blocking review items** - ✅ Singleton "## Erigon Docs" header dropped — the Introduction bullet sits directly under the preamble now. - ✅ Landing-page MDX synthesis (no more title/desc soup for `docs/index.mdx`, `staking/index.mdx`, `help-center/index.mdx`, etc.). - ✅ `parse_frontmatter` hardened: skip indented YAML continuations, `_safe_int` wrapper for `sidebar_position`. - ✅ Nested `_category_.json` honored via `ancestor_positions()` for sort tie-breaking. - ✅ `--check` flag for CI. - ✅ `first_description` tightened — only skip lines that *look like* JSX leaks (`^<tag`, `^{`, arrow-fn) instead of skipping any line that mentions those tokens mid-sentence. - ✅ `# Requires: Python 3.8+` documented at the top. ## Test plan ### Deployment - [ ] `llms.txt` renders correctly at `docs.erigon.tech/llms.txt` after deploy - [ ] `llms-full.txt` renders at `docs.erigon.tech/llms-full.txt` - [ ] Root `llms.txt` no longer references deleted `docs/gitbook/` paths - [ ] Re-running the script produces identical output (`--check` returns OK) ### Output quality — run after regenerating **Page index (`llms.txt`)** - [ ] Every section header (`## Get Started`, `## Fundamentals`, etc.) appears exactly once - [ ] No singleton section header (the Introduction bullet should sit directly under the preamble, no `## Erigon Docs` line above it) - [ ] Index pages (e.g. `get-started/index.mdx`) appear before their siblings within each section - [ ] No entry has a blank or missing title - [ ] No entry description contains raw JSX (`<Component`, `{props.`, `import `) **Full export (`llms-full.txt`)** - [ ] No page has back-to-back duplicate H1 headings (synthetic title + body's own H1) - [ ] Fenced code blocks are intact — content between fences is unchanged, including shell `export VAR=…` lines - [ ] Inline code placeholders survive — `{ERIGON_VERSION}`, `<YOUR_ADDRESS>` style tokens are preserved both inside backtick spans and in bare prose / table cells - [ ] No truncated shell commands — `curl`, `docker run`, `erigon` invocations with `{…}` args are complete - [ ] Nested list indentation is preserved — sublists appear indented, not flush-left - [ ] No raw HTML/JSX tags leak into prose (`<Link`, `<Tabs`, `<div`, `<section`) - [ ] No raw MDX imports/exports leak (`import Link from`, `export const`) - [ ] Landing pages (`docs/index.mdx`, `help-center/index.mdx`, etc.) emit a `## Sections` heading + bullet list, not unstructured title/desc fragments **Sanity checks (quick greps)** ```bash # Page count — synthetic URL line per page (should equal 71) grep -c '^URL: ' docs/site/static/llms-full.txt # Real JSX component leaks — uppercase-then-lowercase tag pattern (should be 0) grep -cE '<[A-Z][a-z][a-zA-Z]+' docs/site/static/llms-full.txt # MDX imports/exports leaked outside fences (should be 0) grep -cE '^(import|export const|export function|export default)' docs/site/static/llms-full.txt # Identifier placeholders preserved — should be > 0 if source uses any grep -c '{ERIGON_VERSION}' docs/site/static/llms-full.txt # Shell `export VAR=` lines preserved inside ```bash fences — should be > 0 grep -c '^export ' docs/site/static/llms-full.txt ``` Current values (regenerated, commit `05a81fcd`): URL 71, JSX leaks 0, MDX imports/exports 0, `{ERIGON_VERSION}` 15, `^export ` 9. ### Tests ```bash python3 -m unittest discover docs/site/scripts -v # Ran 25 tests in 0.001s — OK ``` 🤖 Generated with [Claude Code](https://claude.ai/claude-code) --------- Co-authored-by: Bloxster <bloxster@proton.me> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent 0cb5369 commit 6fdfdb7

7 files changed

Lines changed: 15852 additions & 37 deletions

File tree

.github/workflows/docs-deploy.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,15 @@ jobs:
2525
steps:
2626
- uses: actions/checkout@v6
2727

28+
- uses: actions/setup-python@v5
29+
with:
30+
python-version: '3.11'
31+
32+
- name: Verify llms.txt artifacts are up to date
33+
run: |
34+
python3 docs/site/scripts/generate-llms.py --check
35+
python3 -m unittest discover docs/site/scripts -v
36+
2837
- uses: actions/setup-node@v6
2938
with:
3039
node-version: '20'

0 commit comments

Comments
 (0)