Skip to content

Commit 13cb226

Browse files
authored
Merge pull request #3 from alphagov/acw-29/markdown-cleanup-extraction
ACW-29 Cleaning up markdown outputs
2 parents 8d6bfc9 + 1f73633 commit 13cb226

1 file changed

Lines changed: 7 additions & 1 deletion

File tree

src/content_extractor/base.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,13 @@ def __init__(self, config: BaseExtractorConfig):
6464
"2. the keyword that was matched as 'keyword_matched'\n\n"
6565
"If a keyword appears multiple times in different sentences, extract each unique sentence. "
6666
"If no matches are found, return an empty list of quotes.\n"
67-
"Also note source document is a markdown file; consider this when pre-cleaning."
67+
"""IMPORTANT: The source is a Markdown file. You MUST return 'content' that matches the RENDERED text, not the raw source. "
68+
Apply these cleaning rules to every extracted quote:
69+
- Remove Markdown link syntax: change [link text](url) to just link text.
70+
- Strip formatting: remove all **, *, __, _, and ` symbols.
71+
- Strip list markers: remove leading '* ', '- ', '+ ', or '1. ' types of bullet points.
72+
- Strip headers: remove leading '#' symbols."""
73+
6874
)
6975
)
7076

0 commit comments

Comments
 (0)