You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Headhunter recognizes different heading styles in Markdown and builds a hierarchical structure by assigning levels to each heading. The following rules govern this process:
@@ -172,3 +196,24 @@ When a heading ends with a colon (like `**Name:** Jane Doe`), it works different
172
196
### Mixed Heading Styles
173
197
174
198
Different heading styles can be mixed in the same document. When switching from one style to another, the new heading typically goes one level deeper than the previous one. However, the specific rules for each style (described above) still apply.
199
+
200
+
## Fuzzy Heading Matching
201
+
202
+
When documents have inconsistent formatting, such as headings embedded inline within text, missing markdown markers, or improper line breaks, headhunter can use fuzzy matching to extract expected headings.
203
+
204
+
**How it works:**
205
+
206
+
Provide a list of `expected_headings` to `process_text()` or `process_batch_df()`. The matcher will:
207
+
208
+
1.**Search**: Use fuzzy string matching ([RapidFuzz](https://github.com/maxbachmann/RapidFuzz)) to find heading text within content, even with spelling variations or case differences
209
+
2.**Extract**: Identify the best matching substring and detect surrounding markers (`#`, `**`, `*`, `:`)
210
+
3.**Split**: Break up content tokens at heading boundaries
211
+
4.**Rebuild**: Reconstruct the document hierarchy with extracted headings in their proper positions
212
+
213
+
**Parameters:**
214
+
215
+
-`expected_headings`: List of heading strings to find (case-insensitive)
0 commit comments