-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathNew Text Document.txt
More file actions
268 lines (178 loc) · 3.6 KB
/
New Text Document.txt
File metadata and controls
268 lines (178 loc) · 3.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
Product Name
**LibraDigit AI** – AI-Based Digitization & Digital Archive Builder
Product Type
Web + Desktop hybrid app (offline-friendly for libraries)
Target Users
* Librarians
* Archivists
* Digitization teams
* Universities & public libraries
Core Goal
Convert scanned documents into **searchable, clean, metadata-rich digital archives** using a guided workflow.
This app replaces:
* Manual OCR tools
* Separate PDF editors
* Excel metadata sheets
* Folder chaos
One pipeline. One system.
---
PRD – Product Requirements Document
1. Problem Statement
Libraries have scanned files but:
* They are not searchable
* No metadata
* No workflow
* No quality control
* No standard archive structure
Current digitization = storage
Needed digitization = searchable knowledge
---
2. Key Features (MVP)
2.1 OCR Module
Function:
* Upload PDF or image
* Run OCR using:
* Tesseract engine
* OCRmyPDF wrapper
UI:
* Upload → Process → Download
* Before/After preview
* Search test button
Output:
* Searchable PDF
---
2.2 OCR Cleanup Module
Function:
* Show extracted text
* Highlight OCR errors
* Allow:
* Text correction
* Formatting cleanup
* Remove garbage characters
UI:
* Side-by-side:
* Original image
* OCR text editor
---
2.3 Metadata Module
Function:
* Manual metadata entry:
* Title
* Author
* Year
* Subject
* Keywords
* AI suggestions for keywords (optional)
* Save as:
* JSON
* CSV
* MARC-ready format (future)
---
2.4 Digital Archive Builder
Function:
* Standard file naming
* Auto folder structure:
```
/Archive/
/Subject/
/Year/
Author_Title.pdf
```
* Attach metadata to file
---
2.5 Workflow Tracker
Function:
Show:
* Step 1: Upload
* Step 2: OCR
* Step 3: Cleanup
* Step 4: Metadata
* Step 5: Archive Ready
Progress indicator per file.
---
3. Optional AI Enhancements (Phase 2)
* Auto keyword extraction (NLP)
* Document classification by subject
* Language detection
* Duplicate document detection
* Quality score for OCR accuracy
---
4. User Flow
5. Upload scanned document
6. Click “Run OCR”
7. Validate searchability
8. Clean OCR text
9. Enter metadata
10. Save to archive
11. Export searchable PDF + metadata
Simple and linear.
---
5. Non-Functional Requirements
* Works offline (important for libraries)
* No cloud dependency
* Data privacy
* Lightweight deployment
* Cross-platform support (Windows/Linux)
---
6. Tech Stack (Suggested)
Backend:
* Python
* Tesseract OCR
* OCRmyPDF
* spaCy (future)
Frontend:
* Electron / Tauri + React
* Or Web UI + local backend
Storage:
* Local file system
* SQLite for metadata
---
7. MVP Scope (Strict)
Must have:
* OCR
* Searchable PDF
* Cleanup editor
* Metadata form
* Archive export
* Workflow steps
Exclude:
* Chatbot
* Search engines
* Recommendation systems
* Plagiarism tools
Keep it focused.
---
8. Monetization Model
* Free: OCR + basic PDF
* Pro:
* Cleanup editor
* Metadata
* Archive builder
* Enterprise:
* Batch processing
* Institutional branding
* Training mode
---
9. Training Mode (Your Unique Advantage)
Add a “Training Mode”:
* Guided steps
* Progress tracking
* Certificate generation
* Project report export
Your app becomes:
> A learning platform + a digitization system
---
10. Success Metrics
* Time to convert 10-page scan → searchable PDF < 3 minutes
* Metadata creation < 2 minutes
* Zero dependency on external SaaS
* Librarian can complete workflow without tech help
---
This is a real product.
Not a demo app.
Not an AI toy.
It directly supports:
* Your training
* Your certification
* Your branding
* Your commercial offering