Skip to content

Commit b639327

Browse files
committed
quality improvmeents
1 parent 4f00a49 commit b639327

19 files changed

Lines changed: 1613 additions & 51 deletions

.github/copilot-instructions.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Copilot Instructions
2+
3+
## Commands
4+
5+
```bash
6+
# Install (editable + dev deps)
7+
pip install -e ".[dev]"
8+
9+
# Run all tests
10+
pytest -q
11+
12+
# Run a single test
13+
pytest tests/test_layout.py::test_empty_input -q
14+
15+
# Smoke test
16+
png2pptx convert examples/sample_input.png -o sample_output.pptx
17+
```
18+
19+
Tesseract OCR must be installed separately (not a pip package). On Windows, add it to `PATH` or set `$env:path="C:\Program Files\Tesseract-OCR;$env:path"`.
20+
21+
## Architecture
22+
23+
The pipeline is a linear sequence of transformations, one pass per input PNG:
24+
25+
```
26+
PNG file
27+
→ ocr.py extract_words() → list[WordBox]
28+
→ layout.py group_into_blocks() → list[TextBlock]
29+
→ styles.py extract_text_colors() → list[TextBlock] (colors set)
30+
→ inpaint.py remove_text() → cleaned PNG (temp file)
31+
→ pptx_builder.py build_pptx() → .pptx file
32+
```
33+
34+
`cli.py` orchestrates all steps. `models.py` defines the shared data structures that flow between every stage.
35+
36+
### Data model (`models.py`)
37+
38+
- **`WordBox`** — one word from Tesseract with pixel bounding box (`x`, `y`, `width`, `height`), `confidence`, and Tesseract's `block_num`/`par_num`/`line_num` grouping keys.
39+
- **`TextBlock`** — a logical text line made of one or more `WordBox`es. Computes `text`, bounds, and `estimated_font_size_px` as computed properties. Holds the sampled RGB `color`.
40+
- **`SlideData`** — everything needed for one PPTX slide: image path, pixel dimensions, and the list of `TextBlock`s.
41+
42+
### Key module details
43+
44+
**`layout.py`** — The most complex module. Groups words by Tesseract's `(block_num, par_num, line_num)` triplet, then applies extensive noise filtering:
45+
- Words are pre-filtered by `_is_noise_word()` before grouping.
46+
- Lines with large horizontal gaps are split into separate blocks via `_split_wide_gaps()` (handles multi-column infographics).
47+
- After grouping, `_clean_block_words()` trims low-confidence or noisy edge words.
48+
- `_filter_noise()` removes entire blocks that are OCR artifacts.
49+
- `merge_lines=False` by default — each Tesseract line becomes its own `TextBlock`. This is intentional for infographics where text is scattered; don't change this default.
50+
51+
**`pptx_builder.py`** — Slide dimensions are computed from the image aspect ratio, with the long edge fixed at `SLIDE_LONG_EDGE_EMU` (≈13.33 inches). All text boxes use zero margins, `word_wrap=False`, and transparent fill. Font size is determined by binary-search fitting (`_fit_font_size_px`) using `Pillow`'s `ImageFont` with `lru_cache` for performance. The hardcoded font family is `Calibri` (`_FONT_FAMILY`).
52+
53+
**`inpaint.py`** — Tries OpenCV (`cv2.inpaint` with Telea algorithm) first; falls back to a PIL median-color fill. The mask is built per-word using local color/contrast analysis (`_build_word_mask`) rather than simple rectangles, to avoid bleeding colors across adjacent panels. `cv2` is imported inside functions to allow graceful fallback.
54+
55+
**`ocr.py`** — Two OCR modes: `fast` (single Tesseract pass) and `aggressive` (multiple passes with different configs, results merged). Coordinates are always in source-image pixels.
56+
57+
## Quality improvement loop
58+
59+
The `examples/` folder contains three named test images (`sample_input`, `Infographic`, `CommandCenter`), each with a set of versioned artifacts:
60+
61+
| File | Purpose |
62+
|---|---|
63+
| `<name>.png` | Source input |
64+
| `<name>_baseline_clean.png` | Gold-standard inpainted background (do not overwrite) |
65+
| `<name>_current_clean.png` | Output from the current code — regenerate to check regressions |
66+
| `<name>_compare.png` | Side-by-side visual diff of baseline vs current |
67+
68+
### Regenerate `_current_clean.png` for all examples
69+
70+
Preferred repo command:
71+
72+
```bash
73+
png2pptx quality-loop --examples-dir examples --output-dir quality_output
74+
```
75+
76+
This writes fresh `_current_clean` images, per-example PPTX outputs, overlay review images, and `quality_output/summary.json`.
77+
78+
If you need the lower-level module path explicitly:
79+
80+
```python
81+
# Run from the repo root
82+
from pathlib import Path
83+
from png2pptx.ocr import extract_words
84+
from png2pptx.layout import group_into_blocks
85+
from png2pptx.styles import extract_text_colors
86+
from png2pptx.inpaint import remove_text
87+
88+
for name in ["sample_input", "Infographic", "CommandCenter"]:
89+
path = Path(f"examples/{name}.png")
90+
words, w, h = extract_words(path, confidence_threshold=40.0, ocr_mode="aggressive")
91+
blocks = group_into_blocks(words)
92+
blocks = extract_text_colors(path, blocks)
93+
remove_text(path, blocks, output_path=Path(f"examples/{name}_current_clean.png"))
94+
print(f"Done: {name}")
95+
```
96+
97+
Then open `_current_clean.png` and `_baseline_clean.png` side by side to spot regressions in inpainting quality.
98+
99+
To also check PPTX output (text placement, font sizes, colors):
100+
101+
```bash
102+
png2pptx convert examples/sample_input.png -o examples/sample_input.pptx
103+
png2pptx convert examples/Infographic.png -o examples/Infographic.pptx
104+
png2pptx convert examples/CommandCenter.png -o examples/CommandCenter.pptx
105+
```
106+
107+
### Diagnostic tools (in `.artifacts/`)
108+
109+
**`diag.py`** — Prints every detected text block with its pixel position, estimated font size, sampled color, and text content. Run from the `.artifacts/` directory:
110+
111+
```bash
112+
cd .artifacts
113+
python diag.py
114+
```
115+
116+
**`debug_overlay.py`** — Renders text-box bounding boxes from a PPTX back onto the original image as a red-outline overlay, saved to `debug_overlay.png`. Useful for diagnosing misaligned or missing text boxes:
117+
118+
```bash
119+
cd .artifacts
120+
python debug_overlay.py ../examples/Infographic.pptx
121+
```
122+
123+
## Conventions
124+
125+
- All modules use `from __future__ import annotations`.
126+
- All domain objects are `@dataclass`es in `models.py`; no domain logic lives outside of `models.py`, `layout.py`, `styles.py`, `inpaint.py`, or `pptx_builder.py`.
127+
- Progress/diagnostic output goes to **stderr** (`click.echo(..., err=True)`), not stdout.
128+
- Internal helper functions in `layout.py` are module-private (prefixed `_`). The noise-filtering heuristics are deliberate and fragile — check the existing tests before modifying thresholds.
129+
- Tests use a `_make_word()` factory helper to construct `WordBox` instances with sensible defaults. Follow this pattern for new tests in `test_layout.py`.
130+
- Do not commit generated files: `sample_output.pptx`, `build/`, `dist/`, `*.egg-info/`, or inpainted temp PNGs.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ pip-wheel-metadata/
3333

3434
# Local/session artifacts
3535
.artifacts/
36+
quality_output/
3637

3738
# Generated outputs and diagnostics
3839
output*.pptx

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,4 +114,12 @@ pip install -e ".[dev]"
114114
pytest -q
115115
```
116116

117+
Generate a repeatable quality baseline from the checked-in examples:
118+
119+
```bash
120+
png2pptx quality-loop --examples-dir examples --output-dir quality_output
121+
```
122+
123+
This writes per-example PPTX files, current clean images, overlay review images, and a `summary.json` file for before/after comparisons while tuning OCR, layout, or inpainting behavior.
124+
117125
Before opening a public-facing PR, make sure docs stay aligned with real CLI behavior and avoid committing local/generated artifacts such as virtualenvs, build outputs, or ad-hoc debug exports.

examples/CommandCenter_compare.png

7.09 MB
Loading

examples/Infographic_compare.png

7.89 MB
Loading

examples/earth.png

261 KB
Loading

examples/earth.pptx

858 KB
Binary file not shown.

examples/sample_input_compare.png

279 KB
Loading

png2pptx/cli.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
from .models import SlideData
1111
from .ocr import extract_words
1212
from .pptx_builder import build_pptx
13+
from .quality import run_quality_loop
1314
from .styles import extract_text_colors
1415

1516

@@ -128,3 +129,74 @@ def convert(
128129
tf.unlink()
129130
except OSError:
130131
pass
132+
133+
134+
@main.command("quality-loop")
135+
@click.option(
136+
"--examples-dir",
137+
default="examples",
138+
show_default=True,
139+
type=click.Path(exists=True, file_okay=False, path_type=Path),
140+
help="Directory containing source example PNGs and optional baseline clean images.",
141+
)
142+
@click.option(
143+
"--output-dir",
144+
default="quality_output",
145+
show_default=True,
146+
type=click.Path(file_okay=False, path_type=Path),
147+
help="Directory where current clean images, PPTX files, overlays, and summary.json are written.",
148+
)
149+
@click.option(
150+
"--confidence",
151+
default=40.0,
152+
type=float,
153+
show_default=True,
154+
help="Minimum OCR confidence threshold (0-100).",
155+
)
156+
@click.option(
157+
"--lang",
158+
default="eng",
159+
show_default=True,
160+
help="Tesseract language code (e.g. eng, fra, deu).",
161+
)
162+
@click.option(
163+
"--ocr-mode",
164+
type=click.Choice(["fast", "aggressive"], case_sensitive=False),
165+
default="aggressive",
166+
show_default=True,
167+
help="OCR quality mode to benchmark.",
168+
)
169+
@click.option(
170+
"--remove-text/--no-remove-text",
171+
default=True,
172+
help="Remove detected text from the background image before building each slide.",
173+
)
174+
def quality_loop(
175+
examples_dir: Path,
176+
output_dir: Path,
177+
confidence: float,
178+
lang: str,
179+
ocr_mode: str,
180+
remove_text: bool,
181+
):
182+
"""Generate benchmark artifacts for the example PNGs."""
183+
results = run_quality_loop(
184+
examples_dir=examples_dir,
185+
output_dir=output_dir,
186+
confidence=confidence,
187+
lang=lang,
188+
ocr_mode=ocr_mode,
189+
remove_background_text=remove_text,
190+
)
191+
if not results:
192+
raise click.ClickException(f"No example PNGs found in {examples_dir}")
193+
194+
click.echo(f"Wrote quality artifacts to {output_dir}", err=True)
195+
click.echo(f"Summary: {output_dir / 'summary.json'}", err=True)
196+
for result in results:
197+
baseline = "n/a" if result.baseline_mae is None else f"{result.baseline_mae:.2f}"
198+
click.echo(
199+
f"{result.name}: words={result.word_count}, blocks={result.block_count}, "
200+
f"shapes={result.pptx_text_shapes}, baseline_mae={baseline}",
201+
err=True,
202+
)

0 commit comments

Comments
 (0)