Skip to content

Latest commit

 

History

History
252 lines (185 loc) · 12 KB

File metadata and controls

252 lines (185 loc) · 12 KB

Section Detection: stable boundaries for headings and bodies

🧭 Quick Return to Map

You are in a sub-page of Chunking.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

How to cut documents into audit-ready sections once titles are known. The goal is a deterministic start and end for each section so citations land on the correct text and parents do not swallow child content.

Open these first

Acceptance targets

  • Boundaries reproduce on two runs from the same source. Match rate ≥ 0.98 for offsets.
  • Parents exclude child body text. A parent may include a short abstract block at most. Overlap budget ≤ 2 percent of bytes.
  • First anchor sentence exists for every section and is inside the boundary.
  • ΔS(question, retrieved) ≤ 0.45 on cite-first prompts that target a section anchor.
  • No orphan blocks. Every body block belongs to exactly one section.

Boundary model

A section starts at the first body block after its heading. It ends before the next heading whose depth is less than or equal to the current depth.

Formally


start = next\_body\_block\_after(heading\_i)
end   = block\_before(next\_heading\_with\_depth\_le(current.depth))
span  = \[start.offset\_begin, end.offset\_end]

Rules

  • Children consume their own body. Parents do not include child body blocks.
  • A parent may include a one block abstract if the abstract sits between the parent title and the first child title.
  • Figures, tables, and code are typed blocks. They stay in the section where they appear unless a caption references the next heading by id. See code_tables_blocks.md.

Pipeline

  1. Prepare a canonical block stream
    Combine lines into paragraphs. Tag block types. Normalize spaces. Remove page headers and footers. See pdf_layouts_and_ocr.md.

  2. Load the heading tree
    Use nodes from title_hierarchy.md. Each node has section_id, depth, and the heading block pointer.

  3. Sweep to set spans
    For each heading node in reading order. Find the next heading with depth less than or equal to current. Assign the span from the body block after the current heading up to the block before the found heading. For the last node, end at document tail.

  4. Apply shields and merges
    Pull a caption into the same section as its referenced figure when the caption contains a local reference token. Keep footnotes in the section where the marker appears.

  5. Anchors and offsets
    Record the first sentence offsets and the full span offsets. Persist both in the section node.


Pseudocode

def detect_sections(blocks, headings):
    # blocks: list of typed blocks with offsets
    # headings: nodes with fields {idx, depth, block_idx}
    sections = []
    for i, h in enumerate(headings):
        # find first body block after heading
        b0 = next_body_idx(blocks, h.block_idx + 1)
        # find end boundary
        if i + 1 < len(headings):
            j = i + 1
            # climb until a heading with depth <= h.depth
            while j < len(headings) and headings[j].depth > h.depth:
                j += 1
            end_limit = headings[j].block_idx if j < len(headings) else len(blocks)
        else:
            end_limit = len(blocks)
        b1 = prev_body_idx(blocks, end_limit - 1)

        if b0 is None or b1 is None or b0 > b1:
            # section with no body. allow empty span
            span = None
            first_anchor = None
        else:
            span = (blocks[b0].off_begin, blocks[b1].off_end)
            first_anchor = first_sentence_offset(blocks, b0, b1)

        sec = {
            "section_id": h.section_id,
            "depth": h.depth,
            "page_start": blocks[h.block_idx].page,
            "page_end": blocks[b1].page if b1 is not None else blocks[h.block_idx].page,
            "offsets": span,
            "anchor_offsets": first_anchor
        }
        sections.append(shield_repairs(blocks, sec))
    return sections

shield_repairs pulls a caption when the caption references a figure inside the same span. It also trims a running list that leaks across page breaks.


Anchor sentence rules

  • Start at the first paragraph that contains at least eight characters after normalization.
  • Strip list bullets and numbering leaders.
  • If the paragraph begins with a table or code marker, skip to the next paragraph.
  • If the paragraph is a one line abstract under the parent and a child heading follows within three blocks, mark both anchors. Prefer the child for retrieval. Keep the parent anchor for table of contents links.

Stability heuristics

  • Use document local size bins and indentation bins. Do not rely on absolute font size thresholds.
  • Reject headings that end with a period or with heavy punctuation. Reduce false positives in references and figure lists.
  • Split multi column pages before boundary detection. Reading order errors break spans.
  • Clamp jump size. A child depth cannot exceed parent depth plus one.

60 second validator

Sample ten sections across the document.

  • Each has a non empty anchor sentence inside the span.
  • No section span overlaps its nearest child span.
  • Re run from the same source yields identical offsets for at least nine of the ten samples.
  • Citation to the anchor yields ΔS ≤ 0.45 with correct snippet id and source offsets. See retrieval-traceability.md.

JSON fields to persist

{
  "section_id": "4.2",
  "depth": 2,
  "page_start": 33,
  "page_end": 36,
  "offsets": [120445, 137992],
  "anchor_offsets": [120445, 120612],
  "first_block_index": 918,
  "last_block_index": 1110
}

Use byte offsets in the canonical text. Keep block indexes for fast repairs.


Common failure patterns and fixes

  • Parent span contains child body text Fix the boundary rule to end before the next heading whose depth is less than or equal to current. Run the validator again.

  • Captions detached from figures If a caption follows its figure and contains a local id or the word “Figure”, pull the caption block into the same span.

  • Empty sections after OCR cleanup Allow empty spans for headers like “References”. Keep the node for navigation. Retrieval logic should skip empty spans.

  • Off by one block near page breaks Normalize page headers and footers first. See pdf_layouts_and_ocr.md.


Copy-paste prompt for a quick check

You have TXT OS and WFGY Problem Map loaded.

I provide a section with fields:
section_id = {id}
offsets = {start, end}
anchor_offsets = {a_start, a_end}
title = "{title}"

Task:
1) Verify the anchor sentence lives inside the span.
2) Return a one line reason if the span is empty.
3) Suggest the minimal structural fix page if the citation would miss.
Return JSON:
{ "ok": true|false, "why": "...", "open": "retrieval-traceability.md | pdf_layouts_and_ocr.md | code_tables_blocks.md" }

Tests to include in CI

  • HTML with correct h1..h6 tags and style headings. Expect identical offsets across runs.
  • PDF multi column. Expect no span overlap between parent and child nodes.
  • OCR with noisy fonts and broken hyphenation. Expect at most one false boundary per ten pages.
  • Small content edits under a child section. The parent span should not change. See reindex_migration.md.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer Page What it’s for
⭐ Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars

要我繼續下一頁就說:GO code_tables_blocks.md