Commit 6fdfdb7
[r3.4] docs: add llms.txt generator script and update root llms.txt
## Summary
- Adds `docs/site/scripts/generate-llms.py` — a pure Python (stdlib
only, zero npm deps) script that generates LLM-friendly content exports
from the Docusaurus source files directly
- Generates `docs/site/static/llms.txt` (page index, 71 pages) and
`docs/site/static/llms-full.txt` (full clean markdown, ~351 KB), served
at `docs.erigon.tech/llms.txt` and `docs.erigon.tech/llms-full.txt`
- Updates the repo-root `llms.txt`, which was pointing to the deleted
`docs/gitbook/` folder — now mirrors the Docusaurus-generated index with
live `docs.erigon.tech` URLs
- Adds a CI guard in `.github/workflows/docs-deploy.yml` that runs
`generate-llms.py --check` and the unit tests before the npm build,
blocking drift between any of the four committed files (root +
`static/`)
- Adds a unit test suite (`docs/site/scripts/test_generate_llms.py`, 25
tests) covering placeholder preservation, fence transparency, JSX
stripping, multi-line expr blocks, frontmatter parsing, and landing-page
synthesis
## Why a custom script instead of `docusaurus-plugin-llms-txt` (replaces
#20993)
PR #20993 used the `docusaurus-plugin-llms-txt@0.1.3` npm package. After
review, we decided against it:
- **Wrong approach**: the plugin works on *compiled HTML output* and
converts it back to markdown — a lossy round-trip. Our source is already
markdown.
- **Supply chain risk**: the package has no declared source repo, is
maintained by a personal Gmail address, and has not been updated in 16
months.
- **Unnecessary dependency**: a Python stdlib script does the same job
with no external dependencies, no build-time coupling, and cleaner
output.
The custom script reads `.md`/`.mdx` files directly, strips MDX-specific
syntax (imports, JSX components, HTML tags, expressions), extracts
frontmatter titles and descriptions, and maps file paths to their
deployed `docs.erigon.tech` URLs. Both Docusaurus plugin instances (main
docs and help-center) are supported. Card-grid landing pages (e.g.
`docs/index.mdx`) are detected via the `lp-card` JSX pattern and
synthesized into structured "## Sections" + bullet lists rather than
collapsing into a soup of title/desc fragments.
## How to update
Re-run the script whenever doc content changes:
```bash
python3 docs/site/scripts/generate-llms.py
```
To verify on-disk files match what the script would generate (used by
CI):
```bash
python3 docs/site/scripts/generate-llms.py --check
```
The CI guard in `docs-deploy.yml` runs `--check` and the unittest suite
on every push touching `docs/site/**`, so a forgotten regeneration after
a docs edit will fail the build before deploy.
## Updates after review (commit `05a81fcd`)
Addressing yperbasis CHANGES_REQUESTED + Copilot follow-ups:
**Blockers**
- ✅ Preserve `{ERIGON_VERSION}` and other ALL_CAPS identifier
placeholders in prose and table cells. The brace-strip regex now skips
pure-uppercase identifier braces, mirroring the existing `<IP>`/`<PID>`
angle-tag guard. Verified against the install-instructions table cell
(`erigon_{ERIGON_VERSION}_amd64.deb`) and the version selector prose
(`(e.g., v{ERIGON_VERSION})`) the reviewer flagged.
- ✅ Test-plan H1 assertion replaced — the prior `^# ` count incorrectly
counted shell comments inside `bash` fences (e.g. `# Reduce disk latency
impact`). Now uses `^URL: ` (one synthetic URL line per page = 71).
- ✅ Drift guard via CI rather than `prebuild` (catches drift in all 4
files, no Python coupling in the npm build path).
**Non-blocking review items**
- ✅ Singleton "## Erigon Docs" header dropped — the Introduction bullet
sits directly under the preamble now.
- ✅ Landing-page MDX synthesis (no more title/desc soup for
`docs/index.mdx`, `staking/index.mdx`, `help-center/index.mdx`, etc.).
- ✅ `parse_frontmatter` hardened: skip indented YAML continuations,
`_safe_int` wrapper for `sidebar_position`.
- ✅ Nested `_category_.json` honored via `ancestor_positions()` for sort
tie-breaking.
- ✅ `--check` flag for CI.
- ✅ `first_description` tightened — only skip lines that *look like* JSX
leaks (`^<tag`, `^{`, arrow-fn) instead of skipping any line that
mentions those tokens mid-sentence.
- ✅ `# Requires: Python 3.8+` documented at the top.
## Test plan
### Deployment
- [ ] `llms.txt` renders correctly at `docs.erigon.tech/llms.txt` after
deploy
- [ ] `llms-full.txt` renders at `docs.erigon.tech/llms-full.txt`
- [ ] Root `llms.txt` no longer references deleted `docs/gitbook/` paths
- [ ] Re-running the script produces identical output (`--check` returns
OK)
### Output quality — run after regenerating
**Page index (`llms.txt`)**
- [ ] Every section header (`## Get Started`, `## Fundamentals`, etc.)
appears exactly once
- [ ] No singleton section header (the Introduction bullet should sit
directly under the preamble, no `## Erigon Docs` line above it)
- [ ] Index pages (e.g. `get-started/index.mdx`) appear before their
siblings within each section
- [ ] No entry has a blank or missing title
- [ ] No entry description contains raw JSX (`<Component`, `{props.`,
`import `)
**Full export (`llms-full.txt`)**
- [ ] No page has back-to-back duplicate H1 headings (synthetic title +
body's own H1)
- [ ] Fenced code blocks are intact — content between fences is
unchanged, including shell `export VAR=…` lines
- [ ] Inline code placeholders survive — `{ERIGON_VERSION}`,
`<YOUR_ADDRESS>` style tokens are preserved both inside backtick spans
and in bare prose / table cells
- [ ] No truncated shell commands — `curl`, `docker run`, `erigon`
invocations with `{…}` args are complete
- [ ] Nested list indentation is preserved — sublists appear indented,
not flush-left
- [ ] No raw HTML/JSX tags leak into prose (`<Link`, `<Tabs`, `<div`,
`<section`)
- [ ] No raw MDX imports/exports leak (`import Link from`, `export
const`)
- [ ] Landing pages (`docs/index.mdx`, `help-center/index.mdx`, etc.)
emit a `## Sections` heading + bullet list, not unstructured title/desc
fragments
**Sanity checks (quick greps)**
```bash
# Page count — synthetic URL line per page (should equal 71)
grep -c '^URL: ' docs/site/static/llms-full.txt
# Real JSX component leaks — uppercase-then-lowercase tag pattern (should be 0)
grep -cE '<[A-Z][a-z][a-zA-Z]+' docs/site/static/llms-full.txt
# MDX imports/exports leaked outside fences (should be 0)
grep -cE '^(import|export const|export function|export default)' docs/site/static/llms-full.txt
# Identifier placeholders preserved — should be > 0 if source uses any
grep -c '{ERIGON_VERSION}' docs/site/static/llms-full.txt
# Shell `export VAR=` lines preserved inside ```bash fences — should be > 0
grep -c '^export ' docs/site/static/llms-full.txt
```
Current values (regenerated, commit `05a81fcd`): URL 71, JSX leaks 0,
MDX imports/exports 0, `{ERIGON_VERSION}` 15, `^export ` 9.
### Tests
```bash
python3 -m unittest discover docs/site/scripts -v
# Ran 25 tests in 0.001s — OK
```
🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Bloxster <bloxster@proton.me>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>1 parent 0cb5369 commit 6fdfdb7
7 files changed
Lines changed: 15852 additions & 37 deletions
File tree
- .github/workflows
- docs/site
- scripts
- static
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
28 | 37 | | |
29 | 38 | | |
30 | 39 | | |
| |||
0 commit comments