Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions PATCHES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Prefix-free list itemization — fork of datalab/marker

This fork removes marker's reliance on prefix-pattern heuristics (`a.`, `1.`,
`•`, etc.) for list detection. List items are surfaced as individual blocks
with their own bounding boxes, driven only by surya's layout model and
line-level layout. Useful when prefix-based detection produces false
positives on dense forms / problem sets.

Upstream: <https://github.com/datalab-to/marker>

## What changed

1. **`marker/builders/structure.py`** — disabled `group_lists` and
`unmark_lists` in `StructureBuilder.__call__`. Surya-labeled `ListItem`
blocks are no longer merged into `ListGroup`s, and isolated items are no
longer demoted to `Text`. Each item stays as its own page-level block
with its original bbox.

2. **`marker/processors/list_line_explode.py`** *(new)* — splits every
multi-line `ListItem` into one `ListItem` per `Line` child. Each new
item's polygon is the line's polygon. Prefix-free.

3. **`marker/processors/list_gap_cluster.py`** *(new)* — alternate
strategy. Same input, but clusters lines by vertical gap: lines whose
inter-line gap is at most `gap_factor * median_gap` (default 1.5×) stay
merged into one `ListItem`; larger gaps start a new item. Preserves
wrapped items in docs with even line spacing; collapses to a single item
when there's no measurable gap between sub-items.

4. **`marker/converters/pdf.py`** — registers
`ListItemLineExplodeProcessor` in the default pipeline immediately after
`LineMergeProcessor`. To switch strategies, replace it with
`ListItemGapClusterProcessor` in the `default_processors` tuple.

5. **`scripts/patch_surya_label.py`** *(new)* — patches the installed surya
package so the layout model never emits the `<form>` → `Form` label;
instead, those regions come through as `ListItem` so the patched
structure builder and list-itemization processors handle them.
Idempotent. Run once after installing or upgrading `surya-ocr`.

6. **`scripts/marker_view.py`** *(new)* — standalone HTML viewer. Renders
each PDF page and overlays marker's JSON bboxes with color-coded labels
and a "ListItems only" filter.

## Setup

```bash
pip install -e .
pip install surya-ocr pypdfium2
python scripts/patch_surya_label.py
```

## Usage

```bash
# convert
SURYA_INFERENCE_BACKEND=llamacpp marker_single path/to.pdf \
--output_dir out --output_format json --disable_multiprocessing

# visualize
python scripts/marker_view.py path/to.pdf out/<stem>/<stem>.json
```

## Choosing a strategy

| Strategy | When to use | Tradeoff |
|---|---|---|
| `ListItemLineExplodeProcessor` (default) | You want **every visual line** to be its own bbox | Wrapped 2-line items get over-split |
| `ListItemGapClusterProcessor` | Source PDF uses extra spacing between items | Collapses to one item when sub-items are visually flush |

To swap, edit `marker/converters/pdf.py` and replace the entry in
`default_processors`. Both processors are prefix-free; they look at line
bboxes only.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
> **margati fork notice** — this fork modifies marker's list detection to
> be prefix-free (no reliance on `1.`, `a.`, `•`, etc.). See
> [PATCHES.md](./PATCHES.md) for the full list of changes (structure
> builder, two new processors, surya label patch, viewer script).

<p align="center">
<img src="data/images/datalab-logo.png" alt="Datalab Logo" width="150"/>
</p>
Expand Down
4 changes: 2 additions & 2 deletions marker/builders/structure.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ def __init__(self, config=None):
def __call__(self, document: Document):
for page in document.pages:
self.group_caption_blocks(page)
self.group_lists(page)
self.unmark_lists(page)
# self.group_lists(page) # keep each ListItem standalone (no merging into ListGroup)
# self.unmark_lists(page) # don't demote lone ListItems to Text

def group_caption_blocks(self, page: PageGroup):
gap_threshold_px = self.gap_threshold * page.polygon.height
Expand Down
3 changes: 3 additions & 0 deletions marker/converters/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@
from marker.processors.order import OrderProcessor
from marker.services.gemini import GoogleGeminiService
from marker.processors.line_merge import LineMergeProcessor
from marker.processors.list_line_explode import ListItemLineExplodeProcessor
from marker.processors.list_gap_cluster import ListItemGapClusterProcessor # noqa: F401 # alternate strategy
from marker.processors.llm.llm_mathblock import LLMMathBlockProcessor
from marker.processors.llm.llm_page_correction import LLMPageCorrectionProcessor
from marker.processors.llm.llm_sectionheader import LLMSectionHeaderProcessor
Expand All @@ -75,6 +77,7 @@ class PdfConverter(BaseConverter):
OrderProcessor,
BlockRelabelProcessor,
LineMergeProcessor,
ListItemLineExplodeProcessor, # swap to ListItemGapClusterProcessor for gap-clustered itemization
BlockquoteProcessor,
CodeProcessor,
DocumentTOCProcessor,
Expand Down
94 changes: 94 additions & 0 deletions marker/processors/list_gap_cluster.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
from copy import deepcopy
from statistics import median
from typing import Annotated, List

from marker.processors import BaseProcessor
from marker.schema import BlockTypes
from marker.schema.document import Document
from marker.schema.registry import get_block_class


class ListItemGapClusterProcessor(BaseProcessor):
"""
Split multi-line ListItem blocks into per-item ListItems by clustering
Line children on vertical gap. Lines whose top-to-prev-bottom gap is at
most `gap_factor * median_gap` (with a minimum of `min_gap_pt`) are
treated as continuations of the previous item; larger gaps start a new
item. Each emitted ListItem's polygon is the merge of its lines'
polygons.

Prefix-free: uses only line bboxes.
"""
block_types = (BlockTypes.ListItem,)
min_lines: Annotated[
int, "Minimum number of Line children before clustering applies.",
] = 2
gap_factor: Annotated[
float,
"Multiplier on the median inter-line gap. Anything <= this is treated as 'tight'.",
] = 1.5
min_gap_pt: Annotated[
float,
"Floor on the 'tight' gap (PDF points). Below this, lines always belong to the same item.",
] = 2.0

def __call__(self, document: Document):
ListItemCls = get_block_class(BlockTypes.ListItem)
for page in document.pages:
if not page.structure:
continue
new_structure = []
for bid in list(page.structure):
block = page.get_block(bid)
if block is None or block.block_type != BlockTypes.ListItem:
new_structure.append(bid)
continue

line_ids = list(block.structure or [])
lines = []
for lid in line_ids:
ln = page.get_block(lid)
if ln is not None and ln.block_type == BlockTypes.Line:
lines.append(ln)

if len(lines) < self.min_lines:
new_structure.append(bid)
continue

# Sort top-to-bottom (defensive — should already be in order)
lines.sort(key=lambda l: l.polygon.y_start)

gaps = [
max(0.0, lines[i].polygon.y_start - lines[i - 1].polygon.y_end)
for i in range(1, len(lines))
]
if not gaps:
new_structure.append(bid)
continue

med = median(gaps)
tight_threshold = max(self.min_gap_pt, med * self.gap_factor)

clusters: List[List] = [[lines[0]]]
for i in range(1, len(lines)):
g = lines[i].polygon.y_start - lines[i - 1].polygon.y_end
if g <= tight_threshold:
clusters[-1].append(lines[i])
else:
clusters.append([lines[i]])

if len(clusters) == 1:
# Nothing to split — keep the original block intact
new_structure.append(bid)
continue

for cluster in clusters:
merged_poly = deepcopy(cluster[0].polygon)
if len(cluster) > 1:
merged_poly = merged_poly.merge([l.polygon for l in cluster[1:]])
new_li = page.add_block(ListItemCls, merged_poly)
new_li.structure = [l.id for l in cluster]
new_structure.append(new_li.id)
block.removed = True

page.structure = new_structure
53 changes: 53 additions & 0 deletions marker/processors/list_line_explode.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
from copy import deepcopy
from typing import Annotated

from marker.processors import BaseProcessor
from marker.schema import BlockTypes
from marker.schema.document import Document
from marker.schema.registry import get_block_class


class ListItemLineExplodeProcessor(BaseProcessor):
"""
Explode every multi-line ListItem into one ListItem per Line child.
Each new ListItem inherits the line's polygon and points at that single
Line as its structure. The parent block is dropped from the page structure.

Prefix-free: relies only on line-level layout from LineBuilder.
"""
block_types = (BlockTypes.ListItem,)
min_lines: Annotated[
int,
"Minimum number of Line children before a ListItem is exploded.",
] = 2

def __call__(self, document: Document):
ListItemCls = get_block_class(BlockTypes.ListItem)
for page in document.pages:
if not page.structure:
continue
new_structure = []
for bid in list(page.structure):
block = page.get_block(bid)
if block is None or block.block_type != BlockTypes.ListItem:
new_structure.append(bid)
continue

line_ids = list(block.structure or [])
lines = []
for lid in line_ids:
ln = page.get_block(lid)
if ln is not None and ln.block_type == BlockTypes.Line:
lines.append(ln)

if len(lines) < self.min_lines:
new_structure.append(bid)
continue

for line in lines:
new_li = page.add_block(ListItemCls, deepcopy(line.polygon))
new_li.structure = [line.id]
new_structure.append(new_li.id)
block.removed = True

page.structure = new_structure
136 changes: 136 additions & 0 deletions scripts/marker_view.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
#!/usr/bin/env python3
"""Render a marker JSON result over the original PDF pages as an HTML viewer."""
import argparse, json, html, base64, io, subprocess
from pathlib import Path
import pypdfium2 as pdfium

LABEL_COLORS = {
"Text": "#3b82f6", "SectionHeader": "#ef4444", "PageHeader": "#a855f7",
"PageFooter": "#a855f7", "Picture": "#f59e0b", "PictureGroup": "#f59e0b",
"Figure": "#f59e0b", "FigureGroup": "#f59e0b",
"Table": "#10b981", "TableGroup": "#10b981",
"ListGroup": "#06b6d4", "ListItem": "#06b6d4",
"Form": "#22d3ee",
"Equation": "#ec4899", "Caption": "#f97316", "Footnote": "#94a3b8",
"Code": "#facc15",
}
DEFAULT_COLOR = "#6b7280"

def render_pages(pdf, dpi=150):
doc = pdfium.PdfDocument(str(pdf))
scale = dpi / 72
pages = []
for i in range(len(doc)):
page = doc[i]
pil = page.render(scale=scale).to_pil()
buf = io.BytesIO()
pil.save(buf, format="PNG")
pages.append({
"b64": base64.b64encode(buf.getvalue()).decode(),
"w_px": pil.width, "h_px": pil.height,
"w_pt": page.get_width(), "h_pt": page.get_height(),
})
return pages

def collect_blocks(page_json):
"""Top-level blocks on a marker page (ListItems remain individual after patch)."""
out = []
for child in (page_json.get("children") or []):
out.append({
"label": child.get("block_type", "?"),
"bbox": child.get("bbox"),
})
return out

def build_html(pdf_name, pages, marker_pages):
parts = [f"""<!doctype html><meta charset=utf-8>
<title>Marker (patched): {html.escape(pdf_name)}</title>
<style>
body{{font-family:system-ui;margin:0;background:#111;color:#eee}}
header{{padding:12px 20px;background:#1f2937;position:sticky;top:0;z-index:10;display:flex;gap:16px;align-items:center;flex-wrap:wrap}}
h1{{font-size:16px;margin:0}}
.legend{{display:flex;gap:10px;font-size:12px;flex-wrap:wrap}}
.swatch{{display:inline-block;width:12px;height:12px;margin-right:4px;vertical-align:middle;border:1px solid #fff3}}
.page{{margin:20px auto;max-width:fit-content;background:#000;padding:8px 8px 8px;border-radius:6px}}
.page-title{{font-size:13px;margin:0 0 6px;color:#9ca3af}}
.canvas{{position:relative;line-height:0;margin-top:14px}}
.canvas img{{display:block;max-width:90vw;height:auto}}
.bbox{{position:absolute;border:1.5px solid;box-sizing:border-box;cursor:pointer}}
.bbox span{{position:absolute;top:-12px;left:-1px;font-size:9px;line-height:11px;padding:0 3px;color:#fff;white-space:nowrap;border-radius:2px 2px 0 0;font-family:ui-monospace,monospace;z-index:2;pointer-events:none;opacity:.85}}
.bbox:hover{{background:#ffffff22;z-index:5}}
.bbox:hover span{{opacity:1;z-index:6}}
.controls label{{font-size:12px;margin-right:8px}}
</style>
<header>
<h1>{html.escape(pdf_name)} — patched marker (no ListGroup merging)</h1>
<div class=controls>
<label><input type=checkbox id=lbls checked> labels</label>
<label><input type=checkbox id=boxes checked> boxes</label>
<label><input type=checkbox id=only_li> ListItems only</label>
</div>
<div class=legend id=legend></div>
</header>"""]

labels_seen = set()
for idx, (img, mp) in enumerate(zip(pages, marker_pages), 1):
# marker bbox is in PDF points; image is rendered at dpi → multiply by (px / pt)
sx = img["w_px"] / img["w_pt"]
sy = img["h_px"] / img["h_pt"]
blocks = collect_blocks(mp)
li_count = sum(1 for b in blocks if b["label"] == "ListItem")
parts.append(f'<div class=page><div class=page-title>page {idx} — {len(blocks)} blocks ({li_count} ListItem)</div>')
parts.append(f'<div class=canvas style="width:{img["w_px"]}px;height:{img["h_px"]}px">')
parts.append(f'<img src="data:image/png;base64,{img["b64"]}" width="{img["w_px"]}" height="{img["h_px"]}">')
for b in blocks:
if not b["bbox"]:
continue
x0, y0, x1, y1 = b["bbox"]
x0, y0, x1, y1 = x0*sx, y0*sy, x1*sx, y1*sy
label = b["label"]
color = LABEL_COLORS.get(label, DEFAULT_COLOR)
labels_seen.add((label, color))
parts.append(
f'<div class="bbox lbl-{html.escape(label)}" style="left:{x0:.1f}px;top:{y0:.1f}px;'
f'width:{x1-x0:.1f}px;height:{y1-y0:.1f}px;border-color:{color}" '
f'title="{html.escape(label)}">'
f'<span style="background:{color}">{html.escape(label)}</span></div>'
)
parts.append("</div></div>")

legend = "".join(
f'<span><i class=swatch style="background:{c}"></i>{html.escape(l)}</span>'
for l, c in sorted(labels_seen)
)
parts.append(f"<script>document.getElementById('legend').innerHTML={json.dumps(legend)};"
"const $=s=>document.querySelectorAll(s);"
"lbls.onchange=e=>$('.bbox span').forEach(x=>x.style.display=e.target.checked?'':'none');"
"boxes.onchange=e=>$('.bbox').forEach(x=>x.style.borderWidth=e.target.checked?'2px':'0');"
"only_li.onchange=e=>$('.bbox').forEach(x=>{x.style.display=(!e.target.checked||x.classList.contains('lbl-ListItem'))?'':'none';});"
"</script>")
return "".join(parts)

def main():
ap = argparse.ArgumentParser()
ap.add_argument("pdf")
ap.add_argument("marker_json")
ap.add_argument("--out", default="/tmp/marker_view")
ap.add_argument("--no-open", action="store_true")
ap.add_argument("--tag", default="", help="suffix added to the output html name")
args = ap.parse_args()

pdf = Path(args.pdf).resolve()
out = Path(args.out); out.mkdir(parents=True, exist_ok=True)
data = json.loads(Path(args.marker_json).read_text())
marker_pages = data["children"]
print(f"[1/2] rendering {len(marker_pages)} page(s) ...", flush=True)
pages = render_pages(pdf)
print(f"[2/2] writing HTML ...", flush=True)
suffix = f".{args.tag}" if args.tag else ""
html_path = out / f"{pdf.stem}.marker{suffix}.html"
html_path.write_text(build_html(pdf.name, pages, marker_pages))
print(f"-> {html_path}")
if not args.no_open:
subprocess.Popen(["firefox", str(html_path)], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

if __name__ == "__main__":
main()
Loading
Loading