|
| 1 | +# Copilot Instructions |
| 2 | + |
| 3 | +## Commands |
| 4 | + |
| 5 | +```bash |
| 6 | +# Install (editable + dev deps) |
| 7 | +pip install -e ".[dev]" |
| 8 | + |
| 9 | +# Run all tests |
| 10 | +pytest -q |
| 11 | + |
| 12 | +# Run a single test |
| 13 | +pytest tests/test_layout.py::test_empty_input -q |
| 14 | + |
| 15 | +# Smoke test |
| 16 | +png2pptx convert examples/sample_input.png -o sample_output.pptx |
| 17 | +``` |
| 18 | + |
| 19 | +Tesseract OCR must be installed separately (not a pip package). On Windows, add it to `PATH` or set `$env:path="C:\Program Files\Tesseract-OCR;$env:path"`. |
| 20 | + |
| 21 | +## Architecture |
| 22 | + |
| 23 | +The pipeline is a linear sequence of transformations, one pass per input PNG: |
| 24 | + |
| 25 | +``` |
| 26 | +PNG file |
| 27 | + → ocr.py extract_words() → list[WordBox] |
| 28 | + → layout.py group_into_blocks() → list[TextBlock] |
| 29 | + → styles.py extract_text_colors() → list[TextBlock] (colors set) |
| 30 | + → inpaint.py remove_text() → cleaned PNG (temp file) |
| 31 | + → pptx_builder.py build_pptx() → .pptx file |
| 32 | +``` |
| 33 | + |
| 34 | +`cli.py` orchestrates all steps. `models.py` defines the shared data structures that flow between every stage. |
| 35 | + |
| 36 | +### Data model (`models.py`) |
| 37 | + |
| 38 | +- **`WordBox`** — one word from Tesseract with pixel bounding box (`x`, `y`, `width`, `height`), `confidence`, and Tesseract's `block_num`/`par_num`/`line_num` grouping keys. |
| 39 | +- **`TextBlock`** — a logical text line made of one or more `WordBox`es. Computes `text`, bounds, and `estimated_font_size_px` as computed properties. Holds the sampled RGB `color`. |
| 40 | +- **`SlideData`** — everything needed for one PPTX slide: image path, pixel dimensions, and the list of `TextBlock`s. |
| 41 | + |
| 42 | +### Key module details |
| 43 | + |
| 44 | +**`layout.py`** — The most complex module. Groups words by Tesseract's `(block_num, par_num, line_num)` triplet, then applies extensive noise filtering: |
| 45 | +- Words are pre-filtered by `_is_noise_word()` before grouping. |
| 46 | +- Lines with large horizontal gaps are split into separate blocks via `_split_wide_gaps()` (handles multi-column infographics). |
| 47 | +- After grouping, `_clean_block_words()` trims low-confidence or noisy edge words. |
| 48 | +- `_filter_noise()` removes entire blocks that are OCR artifacts. |
| 49 | +- `merge_lines=False` by default — each Tesseract line becomes its own `TextBlock`. This is intentional for infographics where text is scattered; don't change this default. |
| 50 | + |
| 51 | +**`pptx_builder.py`** — Slide dimensions are computed from the image aspect ratio, with the long edge fixed at `SLIDE_LONG_EDGE_EMU` (≈13.33 inches). All text boxes use zero margins, `word_wrap=False`, and transparent fill. Font size is determined by binary-search fitting (`_fit_font_size_px`) using `Pillow`'s `ImageFont` with `lru_cache` for performance. The hardcoded font family is `Calibri` (`_FONT_FAMILY`). |
| 52 | + |
| 53 | +**`inpaint.py`** — Tries OpenCV (`cv2.inpaint` with Telea algorithm) first; falls back to a PIL median-color fill. The mask is built per-word using local color/contrast analysis (`_build_word_mask`) rather than simple rectangles, to avoid bleeding colors across adjacent panels. `cv2` is imported inside functions to allow graceful fallback. |
| 54 | + |
| 55 | +**`ocr.py`** — Two OCR modes: `fast` (single Tesseract pass) and `aggressive` (multiple passes with different configs, results merged). Coordinates are always in source-image pixels. |
| 56 | + |
| 57 | +## Quality improvement loop |
| 58 | + |
| 59 | +The `examples/` folder contains three named test images (`sample_input`, `Infographic`, `CommandCenter`), each with a set of versioned artifacts: |
| 60 | + |
| 61 | +| File | Purpose | |
| 62 | +|---|---| |
| 63 | +| `<name>.png` | Source input | |
| 64 | +| `<name>_baseline_clean.png` | Gold-standard inpainted background (do not overwrite) | |
| 65 | +| `<name>_current_clean.png` | Output from the current code — regenerate to check regressions | |
| 66 | +| `<name>_compare.png` | Side-by-side visual diff of baseline vs current | |
| 67 | + |
| 68 | +### Regenerate `_current_clean.png` for all examples |
| 69 | + |
| 70 | +Preferred repo command: |
| 71 | + |
| 72 | +```bash |
| 73 | +png2pptx quality-loop --examples-dir examples --output-dir quality_output |
| 74 | +``` |
| 75 | + |
| 76 | +This writes fresh `_current_clean` images, per-example PPTX outputs, overlay review images, and `quality_output/summary.json`. |
| 77 | + |
| 78 | +If you need the lower-level module path explicitly: |
| 79 | + |
| 80 | +```python |
| 81 | +# Run from the repo root |
| 82 | +from pathlib import Path |
| 83 | +from png2pptx.ocr import extract_words |
| 84 | +from png2pptx.layout import group_into_blocks |
| 85 | +from png2pptx.styles import extract_text_colors |
| 86 | +from png2pptx.inpaint import remove_text |
| 87 | + |
| 88 | +for name in ["sample_input", "Infographic", "CommandCenter"]: |
| 89 | + path = Path(f"examples/{name}.png") |
| 90 | + words, w, h = extract_words(path, confidence_threshold=40.0, ocr_mode="aggressive") |
| 91 | + blocks = group_into_blocks(words) |
| 92 | + blocks = extract_text_colors(path, blocks) |
| 93 | + remove_text(path, blocks, output_path=Path(f"examples/{name}_current_clean.png")) |
| 94 | + print(f"Done: {name}") |
| 95 | +``` |
| 96 | + |
| 97 | +Then open `_current_clean.png` and `_baseline_clean.png` side by side to spot regressions in inpainting quality. |
| 98 | + |
| 99 | +To also check PPTX output (text placement, font sizes, colors): |
| 100 | + |
| 101 | +```bash |
| 102 | +png2pptx convert examples/sample_input.png -o examples/sample_input.pptx |
| 103 | +png2pptx convert examples/Infographic.png -o examples/Infographic.pptx |
| 104 | +png2pptx convert examples/CommandCenter.png -o examples/CommandCenter.pptx |
| 105 | +``` |
| 106 | + |
| 107 | +### Diagnostic tools (in `.artifacts/`) |
| 108 | + |
| 109 | +**`diag.py`** — Prints every detected text block with its pixel position, estimated font size, sampled color, and text content. Run from the `.artifacts/` directory: |
| 110 | + |
| 111 | +```bash |
| 112 | +cd .artifacts |
| 113 | +python diag.py |
| 114 | +``` |
| 115 | + |
| 116 | +**`debug_overlay.py`** — Renders text-box bounding boxes from a PPTX back onto the original image as a red-outline overlay, saved to `debug_overlay.png`. Useful for diagnosing misaligned or missing text boxes: |
| 117 | + |
| 118 | +```bash |
| 119 | +cd .artifacts |
| 120 | +python debug_overlay.py ../examples/Infographic.pptx |
| 121 | +``` |
| 122 | + |
| 123 | +## Conventions |
| 124 | + |
| 125 | +- All modules use `from __future__ import annotations`. |
| 126 | +- All domain objects are `@dataclass`es in `models.py`; no domain logic lives outside of `models.py`, `layout.py`, `styles.py`, `inpaint.py`, or `pptx_builder.py`. |
| 127 | +- Progress/diagnostic output goes to **stderr** (`click.echo(..., err=True)`), not stdout. |
| 128 | +- Internal helper functions in `layout.py` are module-private (prefixed `_`). The noise-filtering heuristics are deliberate and fragile — check the existing tests before modifying thresholds. |
| 129 | +- Tests use a `_make_word()` factory helper to construct `WordBox` instances with sensible defaults. Follow this pattern for new tests in `test_layout.py`. |
| 130 | +- Do not commit generated files: `sample_output.pptx`, `build/`, `dist/`, `*.egg-info/`, or inpainted temp PNGs. |
0 commit comments