Commit c65fbee
refactor: decompose monolithic modules + sliding-window crawler (#2)
* refactor: decompose monolithic modules + sliding-window crawler
- Decompose content-extractors.ts (1281 → ~400 lines) into 7 focused
sub-modules: metadata, readability, selector, json-ld, next-data,
text-density, and RSC extractors. Barrel re-exports preserve all
existing import paths.
- Decompose http-fetch.ts (1074 → ~700 lines) by extracting WP REST API
logic into wp-rest-api.ts and Next.js data route logic into
next-data-route.ts.
- Replace crawler batch-wait (Promise.all) with sliding-window
concurrency (Promise.race + Map) for better throughput with
non-uniform fetch times.
- Promote RequestContext interface to shared fetch/types.ts.
- Eliminate duplicate htmlToText by importing shared utility.
- Extract attachRawHtml helper to reduce repeated keepRawHtml patterns.
All 683 tests pass unchanged — no test modifications needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: sanitize HTML in JSON-LD articleBody extraction
JSON-LD articleBody/text fields can contain raw HTML with script tags,
event handlers, and dangerous URI schemes. Apply sanitizeHtml to strip
dangerous elements and htmlToText to produce clean textContent, matching
the pattern used by next-data-extractor.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 09eb3dd commit c65fbee
13 files changed
Lines changed: 1504 additions & 1365 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
147 | 147 | | |
148 | 148 | | |
149 | 149 | | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | | - | |
162 | | - | |
163 | 150 | | |
164 | 151 | | |
165 | 152 | | |
| |||
173 | 160 | | |
174 | 161 | | |
175 | 162 | | |
176 | | - | |
177 | | - | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
178 | 167 | | |
179 | 168 | | |
180 | 169 | | |
181 | 170 | | |
182 | 171 | | |
183 | 172 | | |
184 | 173 | | |
185 | | - | |
186 | | - | |
187 | | - | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
188 | 185 | | |
189 | | - | |
190 | | - | |
191 | 186 | | |
192 | 187 | | |
193 | 188 | | |
194 | | - | |
| 189 | + | |
195 | 190 | | |
196 | 191 | | |
197 | | - | |
| 192 | + | |
198 | 193 | | |
199 | 194 | | |
200 | 195 | | |
| |||
204 | 199 | | |
205 | 200 | | |
206 | 201 | | |
207 | | - | |
208 | | - | |
209 | | - | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
210 | 205 | | |
211 | 206 | | |
212 | | - | |
213 | | - | |
214 | | - | |
| 207 | + | |
| 208 | + | |
215 | 209 | | |
216 | | - | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | | - | |
| 210 | + | |
221 | 211 | | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
222 | 216 | | |
223 | | - | |
224 | | - | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
225 | 225 | | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
226 | 229 | | |
227 | 230 | | |
0 commit comments