Skip to content

Commit 534c8bc

Browse files
committed
Docs: add early 'Inputs & Outputs' overview and confidence safeguards; clarify stdin/files symmetry and locking via confidence=10
1 parent bea4893 commit 534c8bc

File tree

1 file changed

+20
-0
lines changed

1 file changed

+20
-0
lines changed

README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,26 @@ Includes improvements from 0.2:
2626

2727
autoPDFtagger is a CLI for semi‑automatic classification, sorting, and tagging of PDF documents. It enriches PDFs with standard metadata using OCR + AI (text and images) and is explicitly built to handle difficult inputs like low‑quality scans and image‑heavy files (e.g., presentations). Your archive remains plain files and folders (no lock‑in), with optional JSON export for review and integration.
2828

29+
## Inputs & Outputs at a Glance
30+
31+
autoPDFtagger is a file database transformer. It accepts inputs either from stdin or as command‑line arguments, and it produces outputs in the same formats. This symmetry makes it easy to chain runs and keep your archive reproducible.
32+
33+
- Inputs
34+
- PDFs or folders of PDFs: the tool scans files and interprets their existing metadata and content (with OCR when needed).
35+
- JSON or CSV database: an existing description of your PDF collection (as exported by this tool). You can mix PDFs and database files in one invocation.
36+
- Outputs
37+
- JSON or CSV database: a structured view of your archive with enriched metadata (title, summary, creator, creation date, tags, confidences).
38+
- Files: optional export to a target directory (e.g., renamed by detected title/creator).
39+
- Behavior selection
40+
- CLI options control which analyses and actions run (e.g., `-t` for text, `-i` for image, `-c` for tags, `-e` for export). Image analysis already includes page text; `-ti` is redundant.
41+
42+
### Confidence protects good metadata
43+
44+
To avoid overwriting high‑quality metadata with worse guesses, the tool uses per‑field confidences (0–10). Updates apply only when the new confidence is not lower than the existing one.
45+
46+
- If you want to lock a field permanently, set its confidence to 10. The tool will not overwrite it.
47+
- The overall confidence index (shown in summaries) reflects multiple fields, with extra weight on title and date, and can be used to filter items before exporting.
48+
2949
## Key Features
3050

3151
- OCR (via Tesseract) + AI text analysis

0 commit comments

Comments
 (0)