Commit b4dee39
committed
Adds fuzzy heading matching and extraction capabilities.
Introduces fuzzy matching and extraction functionality for headings, enabling validation and recovery of headings even when they are embedded within content or use non-standard markdown syntax.
Extends the API and data models to support expected heading lists, configurable match thresholds, and hierarchical rebuilding after extraction. Batch processing now aggregates match statistics, computes average match rates, and tracks missing/matched headings across documents.
Improves heading metadata, restructures tokenization for consistency, and adds utility functions and tests for new functionality.1 parent 7dffdd2 commit b4dee39
18 files changed
Lines changed: 1910 additions & 259 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
3 | 6 | | |
4 | 7 | | |
5 | 8 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
| 14 | + | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
| 18 | + | |
17 | 19 | | |
18 | 20 | | |
19 | 21 | | |
| |||
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
25 | 31 | | |
26 | 32 | | |
27 | 33 | | |
| |||
46 | 52 | | |
47 | 53 | | |
48 | 54 | | |
49 | | - | |
| 55 | + | |
50 | 56 | | |
51 | 57 | | |
52 | 58 | | |
| |||
55 | 61 | | |
56 | 62 | | |
57 | 63 | | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
58 | 69 | | |
59 | 70 | | |
60 | 71 | | |
| |||
71 | 82 | | |
72 | 83 | | |
73 | 84 | | |
| 85 | + | |
| 86 | + | |
74 | 87 | | |
75 | 88 | | |
76 | 89 | | |
| |||
84 | 97 | | |
85 | 98 | | |
86 | 99 | | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
87 | 104 | | |
88 | 105 | | |
89 | 106 | | |
| |||
142 | 159 | | |
143 | 160 | | |
144 | 161 | | |
145 | | - | |
| 162 | + | |
146 | 163 | | |
147 | 164 | | |
148 | 165 | | |
| |||
157 | 174 | | |
158 | 175 | | |
159 | 176 | | |
160 | | - | |
| 177 | + | |
161 | 178 | | |
162 | 179 | | |
163 | 180 | | |
| |||
170 | 187 | | |
171 | 188 | | |
172 | 189 | | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
173 | 196 | | |
174 | 197 | | |
175 | 198 | | |
176 | 199 | | |
| 200 | + | |
177 | 201 | | |
178 | 202 | | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
179 | 207 | | |
180 | 208 | | |
181 | 209 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
| |||
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | 37 | | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
43 | 49 | | |
44 | 50 | | |
| 51 | + | |
| 52 | + | |
45 | 53 | | |
46 | 54 | | |
47 | 55 | | |
48 | 56 | | |
49 | 57 | | |
50 | | - | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
51 | 64 | | |
52 | 65 | | |
53 | 66 | | |
| |||
57 | 70 | | |
58 | 71 | | |
59 | 72 | | |
60 | | - | |
61 | | - | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
62 | 78 | | |
63 | 79 | | |
64 | 80 | | |
65 | 81 | | |
66 | 82 | | |
67 | | - | |
68 | | - | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
69 | 94 | | |
70 | 95 | | |
71 | 96 | | |
72 | 97 | | |
73 | 98 | | |
74 | 99 | | |
75 | 100 | | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
76 | 104 | | |
77 | 105 | | |
78 | 106 | | |
79 | 107 | | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
| 108 | + | |
| 109 | + | |
0 commit comments