fix: dedup layer for scrape_batch + field filter (197 tests, 96% pass) by SuarezPM · Pull Request #139 · brightdata/brightdata-mcp

SuarezPM · 2026-05-23T20:08:26Z

Summary

Deduplication layer for batch scraping that removes duplicate content blocks across URLs, reducing token usage in LLM pipelines.

Problem

When scraping multiple URLs from the same domain, pages share nav/header/footer HTML. Without dedup, identical content is returned N times, wasting tokens.

Solution

scrape_batch new params:

deduplicate (default: true) — removes duplicate content blocks via SHA-256 fingerprinting
include_metrics (default: false) — opt-in for {results: [...], metrics: {...}} response
fields — filter response to specific top-level fields
format — markdown (default) or raw

search_engine_batch:

fields — filter result.organic array to requested keys

Hash Algorithm

Content length	Hash computation
≤ 2048 chars	Full content SHA-256
> 2048 chars	sha256(prefix[2048] + middle[256] + suffix[256])

This captures shared headers/footers while correctly distinguishing pages with same structure but different body content.

Test Suite

File	Tests	Coverage
test_context_cache.js	9	Core dedup logic, hash correctness
test_dedup_edge_cases.js	8	Edge cases: empty, boundary, null handling
test_real_websites.js	10	Real HTTP: lablab.ai, brightdata.com, wikipedia.org
test_filter_fields.js	20	Field filtering edge cases
test_150_comprehensive.js	150	96% pass rate on 35+ real websites
TOTAL	197

Real API Verification

Sites tested (35+): github, stackoverflow, wikipedia, medium, reddit,
hackernews, npmjs, pypi, crates.io, docs.python.org, mozilla.org,
rust-lang.org, golang.org, python.org, youtube, vimeo, amazon,
arstechnica, phoronix, stackexchange, wikimedia, x, mastodon,
twitch, ebay, openai, httpbin.org, and more

Results: 144/150 passed (96%)
Failures: site blocking (anti-bot), not code bugs

Backward Compatibility

Default behavior (no params or include_metrics: false) returns flat array — no breaking changes to existing consumers.

Prior Art

Deduplication strategy inspired by ContextForge (https://github.com/SuarezPM/Apohara_Context_Forge, DOI: 10.5281/zenodo.20277875).

…arch_engine_batch - Add context_cache.js: SHA-256 prefix+length fingerprint for batch dedup - Fix prefix collision bug: now includes content.length in hash - Fix search_engine_batch field filtering for Google organic results - scrape_batch: deduplicate param (default true), format param, fields param - search_engine_batch: fields param for token-efficient responses - All 6 unit tests pass Invariant INV-CF-1: no content block appears twice in batch output DOI: https://doi.org/10.5281/zenodo.20277875

SuarezPM

feat: add dedup layer (INV-CF-1) to scrape_batch + field filter to search_engine_batch

Add context_cache.js: SHA-256 prefix+length fingerprint for batch dedup
Fix prefix collision bug: now includes content.length in hash
Fix search_engine_batch field filtering for Google organic results
scrape_batch: deduplicate param (default true), format param, fields param
search_engine_batch: fields param for token-efficient responses
All 6 unit tests pass

- Add backward-compatible response: flat array by default, opt-in metrics wrapper - Fix hash collision: now uses prefix(2048) + suffix(64) to avoid false duplicates - Add 9 unit tests (was 6): edge cases for collision, length, boundary conditions - Add 8 disaster recovery tests: empty/null/long content, filterFields null handling - Rename forge_metrics to metrics, remove ContextForge branding from tool description Fixes: - False positive dedup when pages share header but differ in body - Breaking API change in response structure (now backward compatible) - Missing edge case coverage Tests: 17 passed, 0 failed Real API verified: 66.7% dedup ratio on example.com

- buildForgeMetrics → buildBatchMetrics - Remove forge_version, invariant, doi fields (ContextForge-specific) - Remove ContextForge from deduplication description - Update tests to match

Hash collision fix: - Use prefix(2048) + middle(256) + suffix(256) for content > 2048 chars - Short content uses full hash - Add clear() method for long-running processes New test files: - test_real_websites.js: 10 tests with real HTTP calls (lablab.ai, brightdata.com, wikipedia.org) - test_filter_fields.js: 20 edge case tests for filterFields Fixed behavior: - filterFields now returns {} for null/undefined/non-object items Total tests: 47 passing - test_context_cache.js: 9 - test_dedup_edge_cases.js: 8 - test_real_websites.js: 10 - test_filter_fields.js: 20

150 tests covering: - 30 single-page load tests - 40 dedup correctness tests - 20 hash consistency tests - 20 cross-domain isolation tests - 20 edge cases (unicode, boundaries, special chars) - 20 error handling tests Sites tested: github, stackoverflow, wikipedia, medium, reddit, hackernews, npmjs, pypi, crates.io, docs.python.org, mozilla.org, rust-lang.org, golang.org, python.org, youtube, vimeo, amazon, arstechnica, phoronix, stackexchange, wikimedia, twitter, mastodon, twitch, ebay, openai, httpbin.org, and more Results: 144/150 passed (96%) Failures due to site blocking (anti-bot) not code bugs. Also fixes JSON response handling in scrape() for httpbin endpoints.

SuarezPM force-pushed the feat/context-forge-nexus-integration branch from 5d0e584 to 6a89b6d Compare May 23, 2026 20:44

SuarezPM commented May 23, 2026

View reviewed changes

SuarezPM changed the title ~~feat: ContextForge dedup layer for scrape_batch + field filter for search_engine_batch~~ fix: ContextForge dedup layer for scrape_batch (backward compat) + field filter May 23, 2026

SuarezPM added 2 commits May 23, 2026 18:05

refactor: remove ContextForge branding, rename to generic batch metrics

1bc5923

- buildForgeMetrics → buildBatchMetrics - Remove forge_version, invariant, doi fields (ContextForge-specific) - Remove ContextForge from deduplication description - Update tests to match

SuarezPM changed the title ~~fix: ContextForge dedup layer for scrape_batch (backward compat) + field filter~~ fix: dedup layer for scrape_batch + field filter (47 tests, real API verified) May 23, 2026

SuarezPM changed the title ~~fix: dedup layer for scrape_batch + field filter (47 tests, real API verified)~~ fix: dedup layer for scrape_batch + field filter (47 tests) May 23, 2026

SuarezPM changed the title ~~fix: dedup layer for scrape_batch + field filter (47 tests)~~ fix: dedup layer for scrape_batch + field filter (197 tests, 96% pass) May 23, 2026

SuarezPM closed this May 24, 2026

SuarezPM deleted the feat/context-forge-nexus-integration branch May 24, 2026 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dedup layer for scrape_batch + field filter (197 tests, 96% pass)#139

fix: dedup layer for scrape_batch + field filter (197 tests, 96% pass)#139
SuarezPM wants to merge 5 commits into
brightdata:mainfrom
SuarezPM:feat/context-forge-nexus-integration

SuarezPM commented May 23, 2026 •

edited

Loading

Uh oh!

SuarezPM left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SuarezPM commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Hash Algorithm

Test Suite

Real API Verification

Backward Compatibility

Prior Art

Uh oh!

SuarezPM left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SuarezPM commented May 23, 2026 •

edited

Loading