feat(calibration): add semantic response diffing signals

stanislav-web · stanislav-web · commit 736304379c10 · 2026-05-01T20:20:17.000+03:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,16 @@
 CHANGELOG
 =======
 
+v5.14.3 (01.05.2026)
+---------------------------
+- (enhancement) improved `--auto-calibrate` with lightweight semantic response diffing for soft-404 detection
+- (enhancement) added visible-text, semantic phrase, semantic term, DOM-token and text-density calibration signals
+- (enhancement) improved dynamic body normalization for emails, path-like fragments and long encoded tokens
+- (enhancement) semantic calibration remains opt-in through the existing `--auto-calibrate` flow and does not change default scan behaviour
+- (tests) added regression coverage for semantic soft-404 matching and calibration helper edge cases
+- (tests) full unittest suite passes after integration
+- (tests) coverage gate passes at `99%`
+
 v5.14.2 (01.05.2026)
 ---------------------------
 - (enhancement) extended `--header-bypass` with controlled path-manipulation probes after header-injection probes
diff --git a/README.md b/README.md
@@ -59,7 +59,7 @@ It helps security researchers, penetration testers, bug bounty hunters, DevSecOp
 - custom wordlists, prefixes, and extension filters;
 - custom request headers, cookies, and raw HTTP request templates;
 - response filters by status, size, text, regex, and body length;
-- smart auto-calibration for soft-404, wildcard, and catch-all responses;
+- smart auto-calibration for soft-404, wildcard, catch-all, and semantic response-diff cases;
 - technology fingerprint detection CMS, ecommerce platforms, frameworks;
 - passive WAF detection and WAF-safe scan mode;
 - controlled header and path bypass probes for blocked `401` and `403` resources;
@@ -85,7 +85,7 @@ OpenDoor focuses on **context-aware discovery** instead of blind enumeration.
 | **Fingerprint-first scanning** | OpenDoor can identify probable CMS platforms, frameworks, infrastructure providers, and WAF signals before deeper discovery. This helps you scan with context instead of blindly throwing a generic wordlist at the target. |
 | **WAF-aware behavior** | OpenDoor can detect probable WAF / anti-bot behavior and switch to a safer runtime profile with `--waf-safe-mode`, reducing noisy blocked scans and making defensive responses easier to understand. |
 | **Controlled bypass evidence** | OpenDoor can optionally probe blocked `401` and `403` resources with controlled header-injection and path-manipulation variants. It records exact evidence such as bypass type, header or path variant, probe value, original status code, and resulting status code without mutating global scan headers. |
-| **Multi-signal auto-calibration** | OpenDoor does not rely only on status code or response size. It compares multiple response signals such as body hashes, HTML structure, titles, redirects, stable headers, word count, line count, and normalized dynamic tokens to reduce soft-404 and wildcard false positives. |
+| **Multi-signal auto-calibration** | OpenDoor does not rely only on status code or response size. It compares multiple response signals such as body hashes, visible text, semantic soft-404 phrases, DOM-token structure, titles, redirects, stable headers, word count, line count, text density, and normalized dynamic tokens to reduce soft-404 and wildcard false positives. |
 | **Transport-level workflows** | OpenDoor supports direct, proxy, OpenVPN, and WireGuard transport modes. It can also rotate transport profiles per target in authorized batch scans, which is not the same as manually starting a VPN before running a scanner. |
 | **Resumable long scans** | OpenDoor can save scan checkpoints and resume later. This matters when scans are interrupted by crashes, unstable networks, blocked routes, terminal disconnects, or long multi-target jobs. |
 | **CI/CD-ready results** | OpenDoor can return a failing exit code only when selected result buckets are found, making it usable as a release gate or exposure regression check without custom post-processing scripts. |
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-5.14.2
+5.14.3
diff --git a/docs/Usage.md b/docs/Usage.md
@@ -398,6 +398,7 @@ opendoor \
 ## 🧠 Auto-calibration
 
 Auto-calibration helps classify soft-404, wildcard, and catch-all responses.
+Starting with OpenDoor 5.14.3, it also uses lightweight semantic response-diff signals such as visible text, soft-404 phrases, DOM-token structure, text density, and normalized dynamic fragments.
 
 ```shell
 opendoor --host https://example.com --auto-calibrate
@@ -418,6 +419,7 @@ opendoor --host https://example.com --auto-calibrate --calibration-threshold 0.8
 The threshold accepts values from `0.01` to `1.0`.
 
 Use auto-calibration when a target returns similar pages for invalid and valid paths.
+It is especially useful when dynamic 404 templates contain changing tokens, timestamps, trace IDs, A/B wrappers, or personalized fragments.
 
 ---
 
diff --git a/docs/detection/auto-calibration.md b/docs/detection/auto-calibration.md
@@ -116,3 +116,22 @@ opendoor \
   --sniff skipempty,collation,indexof,file \
   --exclude-size-range 0-256
 ```
+
+## Semantic response diffing
+
+OpenDoor 5.14.3 extends auto-calibration with lightweight semantic response-diff signals.
+
+When `--auto-calibrate` is enabled, calibration signatures include:
+
+- normalized visible text;
+- known soft-404 phrases;
+- stable semantic terms;
+- bounded DOM-tag tokens;
+- content kind (`html`, `json`, `text`, or `empty`);
+- visible-text density;
+- existing status, bucket, size, title, redirect, body hash, skeleton hash, word count, line count, and stable headers.
+
+This helps detect dynamic soft-404 templates where the HTML wrapper changes but the response has the same meaning, such as “page not found”, “requested resource does not exist”, changing trace IDs, CSRF-like values, timestamps, or path echoes.
+
+The feature is part of the existing `--auto-calibrate` flow. It does not run unless auto-calibration is explicitly enabled.
+
diff --git a/src/lib/browser/calibration.py b/src/lib/browser/calibration.py
@@ -26,6 +26,32 @@ class Calibration(object):
         r'csrf[_-]?token=["\'][^"\']+["\']',
         r'nonce=["\'][^"\']+["\']',
         r'([?&][a-z0-9_-]+)=([^&#\s"\']+)',
+        r'\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}\b',
+        r'/(?:[a-z0-9._~%+-]+/)*[a-z0-9._~%+-]+',
+        r'\b[a-z0-9+/]{32,}={0,2}\b',
+    )
+    SEMANTIC_STOPWORDS = frozenset((
+        'a', 'about', 'after', 'all', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'can',
+        'could', 'did', 'do', 'does', 'for', 'from', 'go', 'has', 'have', 'if', 'in',
+        'is', 'it', 'its', 'may', 'not', 'of', 'on', 'or', 'our', 'page', 'please',
+        'request', 'requested', 'site', 'that', 'the', 'this', 'to', 'try', 'url',
+        'was', 'we', 'were', 'with', 'you', 'your'
+    ))
+    SOFT_404_PHRASES = (
+        '404',
+        'not found',
+        'page not found',
+        'cannot be found',
+        'could not be found',
+        'does not exist',
+        'no longer exists',
+        'missing page',
+        'resource missing',
+        'resource not found',
+        'requested page',
+        'requested resource',
+        'unavailable',
+        'unknown page',
     )
 
     def __init__(self, signatures=None, threshold=None):
@@ -89,7 +115,9 @@ def build_signature(cls, response, response_data):
 
         body = cls._body(response)
         normalized_body = cls._normalize_body(body)
+        visible_text = cls._visible_text(body)
         skeleton = cls._body_skeleton(body)
+        dom_tokens = cls._dom_tokens(body)
 
         return {
             'bucket': str(response_data[0]),
@@ -102,6 +130,13 @@ def build_signature(cls, response, response_data):
             'redirect_location': cls._redirect_location(response),
             'normalized_body_hash': cls._hash(normalized_body),
             'body_skeleton_hash': cls._hash(skeleton),
+            'visible_text_hash': cls._hash(visible_text),
+            'content_kind': cls._content_kind(body),
+            'semantic_phrases': cls._semantic_phrases(visible_text),
+            'semantic_terms': cls._semantic_terms(visible_text),
+            'dom_tokens': dom_tokens,
+            'dom_token_hash': cls._hash(' '.join(dom_tokens)),
+            'text_density': cls._text_density(body),
             'header_fingerprint': cls._header_fingerprint(response),
         }
 
@@ -165,6 +200,46 @@ def _score(cls, baseline, candidate):
             score += 0.15
             reasons.append('skeleton-hash')
 
+        if baseline.get('visible_text_hash') and baseline.get('visible_text_hash') == candidate.get('visible_text_hash'):
+            score += 0.20
+            reasons.append('visible-text')
+
+        phrase_score = cls._jaccard_similarity(
+            baseline.get('semantic_phrases') or [],
+            candidate.get('semantic_phrases') or []
+        )
+        if phrase_score >= 0.50:
+            score += 0.16 * phrase_score
+            reasons.append('semantic-phrases')
+
+        term_score = cls._jaccard_similarity(
+            baseline.get('semantic_terms') or [],
+            candidate.get('semantic_terms') or []
+        )
+        if term_score >= 0.55:
+            score += 0.12 * term_score
+            reasons.append('semantic-terms')
+
+        dom_score = cls._sequence_similarity(
+            baseline.get('dom_tokens') or [],
+            candidate.get('dom_tokens') or []
+        )
+        if dom_score >= 0.65:
+            score += 0.10 * dom_score
+            reasons.append('dom-structure')
+
+        density_score = cls._ratio_similarity(
+            baseline.get('text_density'),
+            candidate.get('text_density')
+        )
+        if density_score >= 0.85:
+            score += 0.04 * density_score
+            reasons.append('text-density')
+
+        if baseline.get('content_kind') and baseline.get('content_kind') == candidate.get('content_kind'):
+            score += 0.02
+            reasons.append('content-kind')
+
         if baseline.get('title') and baseline.get('title') == candidate.get('title'):
             score += 0.06
             reasons.append('title')
@@ -247,6 +322,115 @@ def _normalize_body(cls, body):
         value = re.sub(r'\s+', ' ', value)
         return value.strip()
 
+    @classmethod
+    def _visible_text(cls, body):
+        """
+        Build normalized visible text without HTML wrappers or dynamic tokens.
+
+        :param str body:
+        :return: str
+        """
+
+        value = html.unescape(str(body or '')).lower()
+        value = re.sub(r'<!--.*?-->', ' ', value, flags=re.DOTALL)
+        value = re.sub(r'<script\b[^>]*>.*?</script>', ' ', value, flags=re.DOTALL | re.IGNORECASE)
+        value = re.sub(r'<style\b[^>]*>.*?</style>', ' ', value, flags=re.DOTALL | re.IGNORECASE)
+        value = re.sub(r'<[^>]+>', ' ', value)
+
+        for pattern in cls.DYNAMIC_PATTERNS:
+            value = re.sub(pattern, '<dynamic>', value, flags=re.IGNORECASE)
+
+        value = re.sub(r'[^a-z0-9<>]+', ' ', value)
+        value = re.sub(r'\s+', ' ', value)
+        return value.strip()
+
+    @staticmethod
+    def _content_kind(body):
+        """
+        Classify response body kind for calibration scoring.
+
+        :param str body:
+        :return: str
+        """
+
+        value = str(body or '').lstrip().lower()
+
+        if value.startswith('{') or value.startswith('['):
+            return 'json'
+
+        if '<html' in value or '<!doctype html' in value or re.search(r'<[a-z][^>]*>', value):
+            return 'html'
+
+        if value:
+            return 'text'
+
+        return 'empty'
+
+    @staticmethod
+    def _dom_tokens(body):
+        """
+        Build a bounded sequence of HTML tag tokens.
+
+        :param str body:
+        :return: list[str]
+        """
+
+        tokens = re.findall(r'<\s*/?\s*([a-z0-9:-]+)', str(body or ''), flags=re.IGNORECASE)
+        return [token.lower() for token in tokens[:120]]
+
+    @classmethod
+    def _semantic_phrases(cls, text):
+        """
+        Extract known soft-404 semantic phrases from visible text.
+
+        :param str text:
+        :return: list[str]
+        """
+
+        value = str(text or '').lower()
+        phrases = []
+
+        for phrase in cls.SOFT_404_PHRASES:
+            if phrase in value:
+                phrases.append(phrase)
+
+        return phrases
+
+    @classmethod
+    def _semantic_terms(cls, text):
+        """
+        Extract stable semantic terms from visible response text.
+
+        :param str text:
+        :return: list[str]
+        """
+
+        terms = []
+
+        for term in re.findall(r'[a-z][a-z0-9_-]{2,}', str(text or '').lower()):
+            if term in cls.SEMANTIC_STOPWORDS:
+                continue
+            if term == 'dynamic':
+                continue
+            terms.append(term)
+
+        return sorted(set(terms))[:40]
+
+    @classmethod
+    def _text_density(cls, body):
+        """
+        Estimate visible-text density relative to HTML markup volume.
+
+        :param str body:
+        :return: float
+        """
+
+        raw = str(body or '')
+        if len(raw) <= 0:
+            return 0.0
+
+        return round(len(cls._visible_text(raw)) / float(max(len(raw), 1)), 4)
+
     @staticmethod
     def _body_skeleton(body):
         """
@@ -256,10 +440,10 @@ def _body_skeleton(body):
         :return: str
         """
 
-        tags = re.findall(r'<\s*/?\s*([a-z0-9:-]+)', str(body or ''), flags=re.IGNORECASE)
+        tags = Calibration._dom_tokens(body)
 
         if len(tags) > 0:
-            return ' '.join([tag.lower() for tag in tags])
+            return ' '.join(tags)
 
         text = re.sub(r'\w+', 'w', str(body or '').lower())
         text = re.sub(r'\s+', ' ', text)
@@ -456,6 +640,76 @@ def _numeric_similarity(left, right):
         maximum = max(abs(left), abs(right), 1)
         return max(0.0, 1.0 - (abs(left - right) / float(maximum)))
 
+    @staticmethod
+    def _ratio_similarity(left, right):
+        """
+        Return float-ratio similarity in range 0..1.
+
+        :param float|None left:
+        :param float|None right:
+        :return: float
+        """
+
+        try:
+            left = float(left)
+            right = float(right)
+        except (TypeError, ValueError):
+            return 0.0
+
+        maximum = max(abs(left), abs(right), 0.0001)
+        return max(0.0, 1.0 - (abs(left - right) / maximum))
+
+    @staticmethod
+    def _jaccard_similarity(left, right):
+        """
+        Return set-overlap similarity in range 0..1.
+
+        :param list[str] left:
+        :param list[str] right:
+        :return: float
+        """
+
+        left_set = set(left or [])
+        right_set = set(right or [])
+
+        if not left_set and not right_set:
+            return 0.0
+
+        union = left_set | right_set
+        if len(union) <= 0:
+            return 0.0
+
+        return len(left_set & right_set) / float(len(union))
+
+    @staticmethod
+    def _sequence_similarity(left, right):
+        """
+        Return lightweight sequence similarity in range 0..1.
+
+        :param list[str] left:
+        :param list[str] right:
+        :return: float
+        """
+
+        left = list(left or [])
+        right = list(right or [])
+
+        if not left and not right:
+            return 0.0
+        if not left or not right:
+            return 0.0
+
+        prefix = 0
+        for left_item, right_item in zip(left, right):
+            if left_item != right_item:
+                break
+            prefix += 1
+
+        overlap = len(set(left) & set(right)) / float(len(set(left) | set(right)))
+        prefix_score = prefix / float(max(len(left), len(right), 1))
+
+        return max(overlap, prefix_score)
+
     @staticmethod
     def _header_similarity(left, right):
         """
diff --git a/tests/test_lib_browser_calibration.py b/tests/test_lib_browser_calibration.py