-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathprompts.example.yaml
More file actions
191 lines (161 loc) · 7.11 KB
/
prompts.example.yaml
File metadata and controls
191 lines (161 loc) · 7.11 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# EPUB LLM Cleaner - Example Prompts Configuration
# ================================================
#
# This file contains example prompts for the EPUB cleaning workflow.
# Copy this file to 'prompts.yaml' and customize for your specific needs.
#
# The cleaning process works in two stages:
# 1. ANALYSIS: Quickly scan each chapter to decide if it needs cleaning
# 2. CLEANING: Selectively rewrite only the problematic paragraphs
#
# PLACEHOLDERS:
# {chapter_text} - Used in analysis_prompt, contains the full chapter text
# {numbered_chapter} - Used in cleaning_prompt, contains paragraphs with [PARAGRAPH N] markers
#
# CUSTOMIZATION TIPS:
# - Be specific about what issues to look for (OCR errors, formatting, etc.)
# - Provide examples of common problems you've seen in your ebook
# - Define clear criteria for when to filter vs. when to skip
# - For cleaning, specify exactly how you want issues fixed
#
# ============================================================================
# ANALYSIS PROMPT
# ---------------
# Purpose: Quickly determine if a chapter needs any cleaning
# Input: {chapter_text} - The full text content of the chapter
# Expected Output: Single word - either "FILTER" or "CLEAN"
#
# Tips for customization:
# - List specific patterns or issues to look for
# - Include examples of problematic text if helpful
# - Define what should NOT trigger filtering to reduce false positives
# - Keep the prompt focused for faster, more accurate decisions
analysis_prompt: |
You are analyzing a chapter from a digitized book to determine if it contains issues that need correction.
FILTER if the chapter contains ANY of these issues:
OCR and Scanning Artifacts:
- Misread characters (e.g., "rn" read as "m", "cl" as "d", "1" as "l")
- Random punctuation or symbols from scanning errors
- Words split incorrectly across lines (e.g., "to-gether" or "some thing")
- Missing spaces between words or extra spaces within words
- Garbled text or nonsense character sequences
Formatting Problems:
- Headers or footers repeated within body text
- Page numbers embedded in paragraphs
- Running headers mixed into content
- Inconsistent paragraph breaks or missing line breaks
- Bullet points or list markers that didn't convert properly
Text Quality Issues:
- Repeated words or phrases (e.g., "the the", "and and")
- Incomplete sentences that appear cut off
- Obvious typos or misspellings that appear systematic
- Filler text or placeholder content
DO NOT FILTER if the chapter only contains:
- Intentional stylistic choices by the author
- Archaic or unusual spelling that is period-appropriate
- Technical terms or proper nouns that may look unusual
- Normal punctuation and formatting
Analyze this chapter:
{chapter_text}
Respond with ONLY ONE WORD:
- "FILTER" if it needs correction
- "CLEAN" if it does not
# CLEANING PROMPT
# ---------------
# Purpose: Identify and rewrite only the paragraphs that need fixing
# Input: {numbered_chapter} - Chapter text with [PARAGRAPH N] markers
# Expected Output: Either "NONE" or list of "PARAGRAPH N: [corrected text]"
#
# Tips for customization:
# - Explain the specific fixes you want applied
# - Provide before/after examples for common issues
# - Specify what to preserve (author's voice, style, etc.)
# - Define when to rewrite vs. when to remove a paragraph
#
# OUTPUT FORMAT:
# - If no changes needed: respond with "NONE"
# - To rewrite a paragraph: "PARAGRAPH N: [corrected text here]"
# - To remove a paragraph: "PARAGRAPH N: [PARAGRAPH REMOVED]"
cleaning_prompt: |
You are editing a digitized book to fix OCR errors, scanning artifacts, and formatting issues while preserving the author's original voice and intent.
INSTRUCTIONS:
1. FIX OCR ERRORS:
- Correct misread characters: "rn" misread as "m", "cl" as "d", "0" as "O"
- Fix split words: "to-gether" should be "together"
- Correct missing/extra spaces: "some thing" to "something", "alot" to "a lot"
- Remove garbled characters or symbols from scanning errors
2. FIX FORMATTING ISSUES:
- Remove embedded page numbers, headers, or footers from body text
- Fix paragraph breaks that were incorrectly merged or split
- Correct list formatting that didn't convert properly
- Remove duplicate content from scanning overlap
3. FIX TEXT QUALITY:
- Remove repeated words: "the the" to "the"
- Complete obviously truncated sentences when possible
- Fix systematic typos while preserving intentional spelling
WHAT TO PRESERVE:
- The author's original voice, style, and word choices
- Intentional formatting like emphasis or paragraph structure
- Period-appropriate spelling or archaic language
- Technical terms, proper nouns, and specialized vocabulary
- Dialogue formatting and attribution
WHAT NOT TO CHANGE:
- Do not "improve" the writing or modernize language
- Do not change sentences that are correct but unusual
- Do not add content that wasn't there
- Do not change the meaning or intent of any passage
RESPONSE FORMAT:
If NO paragraphs need changes, respond with:
NONE
If paragraphs need changes, respond with:
PARAGRAPH N: [complete corrected text]
For paragraphs that should be removed entirely (e.g., duplicate content, page headers):
PARAGRAPH N: [PARAGRAPH REMOVED]
Example response:
PARAGRAPH 3: The corrected text for paragraph three goes here with all OCR errors fixed.
PARAGRAPH 7: [PARAGRAPH REMOVED]
PARAGRAPH 12: Another corrected paragraph with proper spacing and punctuation.
IMPORTANT:
- Only include paragraphs that actually need changes
- Provide the complete corrected paragraph text, not just the corrections
- Preserve the exact paragraph structure - do not merge or split paragraphs
- When in doubt, preserve the original text
Here is the chapter with numbered paragraphs:
{numbered_chapter}
Provide your response:
# ============================================================================
# ADDITIONAL EXAMPLE USE CASES
# ============================================================================
#
# Below are alternative prompt ideas for different cleaning scenarios.
# Uncomment and modify as needed, or use as inspiration for your own prompts.
#
# ---
# EXAMPLE: Removing Watermarks and Copyright Notices
# ---
# analysis_prompt: |
# Analyze if this chapter contains watermarks, copyright notices, or
# promotional content that should be removed...
#
# ---
# EXAMPLE: Standardizing Dialogue Formatting
# ---
# analysis_prompt: |
# Check if this chapter has inconsistent dialogue formatting such as
# mixed quote styles, missing attribution, or incorrect punctuation...
#
# ---
# EXAMPLE: Fixing Character Encoding Issues
# ---
# analysis_prompt: |
# Scan for character encoding problems like wrong quotation marks,
# missing accented characters, or garbled special characters...
#
# ---
# EXAMPLE: Removing Duplicate Content
# ---
# analysis_prompt: |
# Check if paragraphs or sentences are repeated due to scanning
# overlap or conversion errors...
#
# ============================================================================