-
Notifications
You must be signed in to change notification settings - Fork 45
Open
Description
Summary
Adopt streamed parsing for HTML to reduce allocations, and do early content-type sniffing to skip binary/large content unless configured.
Motivation
- Lower memory usage during large crawls
- Skip non-HTML payloads by default
Scope
internal/parse:- Streaming parse (
net/htmland/orgoqueryon aReader) - Extract absolute links (respect
basetags) - Sniff Content-Type + size guardrails
- Streaming parse (
- Config flag to allow binary downloads
Acceptance Criteria
- Heap profile shows fewer allocations vs baseline
- Tests cover: base href, meta refresh, unusual encodings
Tasks
- Implement streamed extraction
- Add content-type guards
- Unit tests with fixture pages
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status