🧭 Quick Return to Map
You are in a sub-page of Chunking.
To reorient, go back here:
- Chunking — text segmentation and context window management
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
How to cut documents into audit-ready sections once titles are known. The goal is a deterministic start and end for each section so citations land on the correct text and parents do not swallow child content.
- Heading tree builder: title_hierarchy.md
- Chunk identifier rules: chunk_id_schema.md
- Layout and OCR normalizing: pdf_layouts_and_ocr.md
- Block typing for code and tables: code_tables_blocks.md
- Why this snippet and citation schema: retrieval-traceability.md
- Payload contracts for RAG: data-contracts.md
- Safe reindex after small edits: reindex_migration.md
- Boundaries reproduce on two runs from the same source. Match rate ≥ 0.98 for
offsets. - Parents exclude child body text. A parent may include a short abstract block at most. Overlap budget ≤ 2 percent of bytes.
- First anchor sentence exists for every section and is inside the boundary.
- ΔS(question, retrieved) ≤ 0.45 on cite-first prompts that target a section anchor.
- No orphan blocks. Every body block belongs to exactly one section.
A section starts at the first body block after its heading. It ends before the next heading whose depth is less than or equal to the current depth.
Formally
start = next\_body\_block\_after(heading\_i)
end = block\_before(next\_heading\_with\_depth\_le(current.depth))
span = \[start.offset\_begin, end.offset\_end]
Rules
- Children consume their own body. Parents do not include child body blocks.
- A parent may include a one block abstract if the abstract sits between the parent title and the first child title.
- Figures, tables, and code are typed blocks. They stay in the section where they appear unless a caption references the next heading by id. See code_tables_blocks.md.
-
Prepare a canonical block stream
Combine lines into paragraphs. Tag block types. Normalize spaces. Remove page headers and footers. See pdf_layouts_and_ocr.md. -
Load the heading tree
Use nodes from title_hierarchy.md. Each node hassection_id,depth, and the heading block pointer. -
Sweep to set spans
For each heading node in reading order. Find the next heading with depth less than or equal to current. Assign the span from the body block after the current heading up to the block before the found heading. For the last node, end at document tail. -
Apply shields and merges
Pull a caption into the same section as its referenced figure when the caption contains a local reference token. Keep footnotes in the section where the marker appears. -
Anchors and offsets
Record the first sentence offsets and the full span offsets. Persist both in the section node.
def detect_sections(blocks, headings):
# blocks: list of typed blocks with offsets
# headings: nodes with fields {idx, depth, block_idx}
sections = []
for i, h in enumerate(headings):
# find first body block after heading
b0 = next_body_idx(blocks, h.block_idx + 1)
# find end boundary
if i + 1 < len(headings):
j = i + 1
# climb until a heading with depth <= h.depth
while j < len(headings) and headings[j].depth > h.depth:
j += 1
end_limit = headings[j].block_idx if j < len(headings) else len(blocks)
else:
end_limit = len(blocks)
b1 = prev_body_idx(blocks, end_limit - 1)
if b0 is None or b1 is None or b0 > b1:
# section with no body. allow empty span
span = None
first_anchor = None
else:
span = (blocks[b0].off_begin, blocks[b1].off_end)
first_anchor = first_sentence_offset(blocks, b0, b1)
sec = {
"section_id": h.section_id,
"depth": h.depth,
"page_start": blocks[h.block_idx].page,
"page_end": blocks[b1].page if b1 is not None else blocks[h.block_idx].page,
"offsets": span,
"anchor_offsets": first_anchor
}
sections.append(shield_repairs(blocks, sec))
return sectionsshield_repairs pulls a caption when the caption references a figure inside the same span. It also trims a running list that leaks across page breaks.
- Start at the first paragraph that contains at least eight characters after normalization.
- Strip list bullets and numbering leaders.
- If the paragraph begins with a table or code marker, skip to the next paragraph.
- If the paragraph is a one line abstract under the parent and a child heading follows within three blocks, mark both anchors. Prefer the child for retrieval. Keep the parent anchor for table of contents links.
- Use document local size bins and indentation bins. Do not rely on absolute font size thresholds.
- Reject headings that end with a period or with heavy punctuation. Reduce false positives in references and figure lists.
- Split multi column pages before boundary detection. Reading order errors break spans.
- Clamp jump size. A child depth cannot exceed parent depth plus one.
Sample ten sections across the document.
- Each has a non empty anchor sentence inside the span.
- No section span overlaps its nearest child span.
- Re run from the same source yields identical
offsetsfor at least nine of the ten samples. - Citation to the anchor yields ΔS ≤ 0.45 with correct snippet id and source offsets. See retrieval-traceability.md.
{
"section_id": "4.2",
"depth": 2,
"page_start": 33,
"page_end": 36,
"offsets": [120445, 137992],
"anchor_offsets": [120445, 120612],
"first_block_index": 918,
"last_block_index": 1110
}Use byte offsets in the canonical text. Keep block indexes for fast repairs.
-
Parent span contains child body text Fix the boundary rule to end before the next heading whose depth is less than or equal to current. Run the validator again.
-
Captions detached from figures If a caption follows its figure and contains a local id or the word “Figure”, pull the caption block into the same span.
-
Empty sections after OCR cleanup Allow empty spans for headers like “References”. Keep the node for navigation. Retrieval logic should skip empty spans.
-
Off by one block near page breaks Normalize page headers and footers first. See pdf_layouts_and_ocr.md.
You have TXT OS and WFGY Problem Map loaded.
I provide a section with fields:
section_id = {id}
offsets = {start, end}
anchor_offsets = {a_start, a_end}
title = "{title}"
Task:
1) Verify the anchor sentence lives inside the span.
2) Return a one line reason if the span is empty.
3) Suggest the minimal structural fix page if the citation would miss.
Return JSON:
{ "ok": true|false, "why": "...", "open": "retrieval-traceability.md | pdf_layouts_and_ocr.md | code_tables_blocks.md" }
- HTML with correct
h1..h6tags and style headings. Expect identicaloffsetsacross runs. - PDF multi column. Expect no span overlap between parent and child nodes.
- OCR with noisy fonts and broken hyphenation. Expect at most one false boundary per ten pages.
- Small content edits under a child section. The parent span should not change. See reindex_migration.md.
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
要我繼續下一頁就說:GO code_tables_blocks.md