Skip to content

Commit d364150

Browse files
committed
Add fuzzy heading matching examples and documentation to README.md
1 parent 8e12b29 commit d364150

1 file changed

Lines changed: 45 additions & 0 deletions

File tree

README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ A parser for extracting headings and hierarchical structure from Markdown files.
1313

1414
- Parse multiple heading formats (hash `#`, asterisk `**`, inline with colon, all-caps)
1515
- Build hierarchical structure from headings
16+
- **Fuzzy heading matching** to extract expected headings from improperly formatted documents
1617
- Process single documents or batches from DataFrames
1718
- Export results to DataFrame, JSON, or tree visualizations
1819
- Configurable parsing rules and word limits
@@ -111,6 +112,29 @@ parsed_batch.to_tree("tree_outputs/")
111112
df_parsed.to_csv("parsed_data.csv")
112113
```
113114

115+
**Extract headings from improperly formatted documents:**
116+
117+
```python
118+
import headhunter
119+
120+
# Document where headings are embedded inline or lack proper formatting
121+
messy_doc = """
122+
This document has ## Heading 1 embedded in text without line breaks.
123+
Then we have **heading 2** in bold but inline.
124+
**Inline Heading:** with content on the same line.
125+
"""
126+
127+
# Specify expected headings to extract via fuzzy matching
128+
parsed = headhunter.process_text(
129+
text=messy_doc,
130+
expected_headings=["Heading 1", "heading 2", "Inline Heading"],
131+
match_threshold=80 # 0-100, higher = stricter matching
132+
)
133+
134+
# Match statistics are added to metadata
135+
print(parsed.metadata) # includes: matched_count, expected_count, match_percentage
136+
```
137+
114138
## How Hierarchy is Built
115139

116140
Headhunter recognizes different heading styles in Markdown and builds a hierarchical structure by assigning levels to each heading. The following rules govern this process:
@@ -172,3 +196,24 @@ When a heading ends with a colon (like `**Name:** Jane Doe`), it works different
172196
### Mixed Heading Styles
173197

174198
Different heading styles can be mixed in the same document. When switching from one style to another, the new heading typically goes one level deeper than the previous one. However, the specific rules for each style (described above) still apply.
199+
200+
## Fuzzy Heading Matching
201+
202+
When documents have inconsistent formatting, such as headings embedded inline within text, missing markdown markers, or improper line breaks, headhunter can use fuzzy matching to extract expected headings.
203+
204+
**How it works:**
205+
206+
Provide a list of `expected_headings` to `process_text()` or `process_batch_df()`. The matcher will:
207+
208+
1. **Search**: Use fuzzy string matching ([RapidFuzz](https://github.com/maxbachmann/RapidFuzz)) to find heading text within content, even with spelling variations or case differences
209+
2. **Extract**: Identify the best matching substring and detect surrounding markers (`#`, `**`, `*`, `:`)
210+
3. **Split**: Break up content tokens at heading boundaries
211+
4. **Rebuild**: Reconstruct the document hierarchy with extracted headings in their proper positions
212+
213+
**Parameters:**
214+
215+
- `expected_headings`: List of heading strings to find (case-insensitive)
216+
- `match_threshold`: Minimum similarity score 0-100 (default: 80)
217+
- 80-100: Strict matching, reduces false positives
218+
- 60-79: Moderate matching, allows more variation
219+
- Below 60: Lenient matching, may produce unexpected matches

0 commit comments

Comments
 (0)