Skip to content

fix: dedup layer for scrape_batch + field filter (197 tests, 96% pass)#139

Closed
SuarezPM wants to merge 5 commits into
brightdata:mainfrom
SuarezPM:feat/context-forge-nexus-integration
Closed

fix: dedup layer for scrape_batch + field filter (197 tests, 96% pass)#139
SuarezPM wants to merge 5 commits into
brightdata:mainfrom
SuarezPM:feat/context-forge-nexus-integration

Conversation

@SuarezPM

@SuarezPM SuarezPM commented May 23, 2026

Copy link
Copy Markdown

Summary

Deduplication layer for batch scraping that removes duplicate content blocks across URLs, reducing token usage in LLM pipelines.

Problem

When scraping multiple URLs from the same domain, pages share nav/header/footer HTML. Without dedup, identical content is returned N times, wasting tokens.

Solution

scrape_batch new params:

  • deduplicate (default: true) — removes duplicate content blocks via SHA-256 fingerprinting
  • include_metrics (default: false) — opt-in for {results: [...], metrics: {...}} response
  • fields — filter response to specific top-level fields
  • format — markdown (default) or raw

search_engine_batch:

  • fields — filter result.organic array to requested keys

Hash Algorithm

Content length Hash computation
≤ 2048 chars Full content SHA-256
> 2048 chars sha256(prefix[2048] + middle[256] + suffix[256])

This captures shared headers/footers while correctly distinguishing pages with same structure but different body content.

Test Suite

File Tests Coverage
test_context_cache.js 9 Core dedup logic, hash correctness
test_dedup_edge_cases.js 8 Edge cases: empty, boundary, null handling
test_real_websites.js 10 Real HTTP: lablab.ai, brightdata.com, wikipedia.org
test_filter_fields.js 20 Field filtering edge cases
test_150_comprehensive.js 150 96% pass rate on 35+ real websites
TOTAL 197

Real API Verification

Sites tested (35+): github, stackoverflow, wikipedia, medium, reddit,
hackernews, npmjs, pypi, crates.io, docs.python.org, mozilla.org,
rust-lang.org, golang.org, python.org, youtube, vimeo, amazon,
arstechnica, phoronix, stackexchange, wikimedia, x, mastodon,
twitch, ebay, openai, httpbin.org, and more

Results: 144/150 passed (96%)
Failures: site blocking (anti-bot), not code bugs

Backward Compatibility

Default behavior (no params or include_metrics: false) returns flat array — no breaking changes to existing consumers.

Prior Art

Deduplication strategy inspired by ContextForge (https://github.com/SuarezPM/Apohara_Context_Forge, DOI: 10.5281/zenodo.20277875).

…arch_engine_batch

- Add context_cache.js: SHA-256 prefix+length fingerprint for batch dedup
- Fix prefix collision bug: now includes content.length in hash
- Fix search_engine_batch field filtering for Google organic results
- scrape_batch: deduplicate param (default true), format param, fields param
- search_engine_batch: fields param for token-efficient responses
- All 6 unit tests pass

Invariant INV-CF-1: no content block appears twice in batch output
DOI: https://doi.org/10.5281/zenodo.20277875
@SuarezPM SuarezPM force-pushed the feat/context-forge-nexus-integration branch from 5d0e584 to 6a89b6d Compare May 23, 2026 20:44

@SuarezPM SuarezPM left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feat: add dedup layer (INV-CF-1) to scrape_batch + field filter to search_engine_batch

  • Add context_cache.js: SHA-256 prefix+length fingerprint for batch dedup
  • Fix prefix collision bug: now includes content.length in hash
  • Fix search_engine_batch field filtering for Google organic results
  • scrape_batch: deduplicate param (default true), format param, fields param
  • search_engine_batch: fields param for token-efficient responses
  • All 6 unit tests pass

- Add backward-compatible response: flat array by default, opt-in metrics wrapper
- Fix hash collision: now uses prefix(2048) + suffix(64) to avoid false duplicates
- Add 9 unit tests (was 6): edge cases for collision, length, boundary conditions
- Add 8 disaster recovery tests: empty/null/long content, filterFields null handling
- Rename forge_metrics to metrics, remove ContextForge branding from tool description

Fixes:
- False positive dedup when pages share header but differ in body
- Breaking API change in response structure (now backward compatible)
- Missing edge case coverage

Tests: 17 passed, 0 failed
Real API verified: 66.7% dedup ratio on example.com
@SuarezPM SuarezPM changed the title feat: ContextForge dedup layer for scrape_batch + field filter for search_engine_batch fix: ContextForge dedup layer for scrape_batch (backward compat) + field filter May 23, 2026
SuarezPM added 2 commits May 23, 2026 18:05
- buildForgeMetrics → buildBatchMetrics
- Remove forge_version, invariant, doi fields (ContextForge-specific)
- Remove ContextForge from deduplication description
- Update tests to match
Hash collision fix:
- Use prefix(2048) + middle(256) + suffix(256) for content > 2048 chars
- Short content uses full hash
- Add clear() method for long-running processes

New test files:
- test_real_websites.js: 10 tests with real HTTP calls (lablab.ai, brightdata.com, wikipedia.org)
- test_filter_fields.js: 20 edge case tests for filterFields

Fixed behavior:
- filterFields now returns {} for null/undefined/non-object items

Total tests: 47 passing
- test_context_cache.js: 9
- test_dedup_edge_cases.js: 8
- test_real_websites.js: 10
- test_filter_fields.js: 20
@SuarezPM SuarezPM changed the title fix: ContextForge dedup layer for scrape_batch (backward compat) + field filter fix: dedup layer for scrape_batch + field filter (47 tests, real API verified) May 23, 2026
@SuarezPM SuarezPM changed the title fix: dedup layer for scrape_batch + field filter (47 tests, real API verified) fix: dedup layer for scrape_batch + field filter (47 tests) May 23, 2026
150 tests covering:
- 30 single-page load tests
- 40 dedup correctness tests
- 20 hash consistency tests
- 20 cross-domain isolation tests
- 20 edge cases (unicode, boundaries, special chars)
- 20 error handling tests

Sites tested: github, stackoverflow, wikipedia, medium, reddit,
hackernews, npmjs, pypi, crates.io, docs.python.org, mozilla.org,
rust-lang.org, golang.org, python.org, youtube, vimeo, amazon,
arstechnica, phoronix, stackexchange, wikimedia, twitter, mastodon,
twitch, ebay, openai, httpbin.org, and more

Results: 144/150 passed (96%)
Failures due to site blocking (anti-bot) not code bugs.

Also fixes JSON response handling in scrape() for httpbin endpoints.
@SuarezPM SuarezPM changed the title fix: dedup layer for scrape_batch + field filter (47 tests) fix: dedup layer for scrape_batch + field filter (197 tests, 96% pass) May 23, 2026
@SuarezPM SuarezPM closed this May 24, 2026
@SuarezPM SuarezPM deleted the feat/context-forge-nexus-integration branch May 24, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant