Skip to content

Commit 189914a

Browse files
antoniomtzclaude
andcommitted
fix: make post-check scalable with word-boundary matching and locale awareness
The programmatic post-check had two scalability issues: 1. Substring false positives: "at" matched inside "Nature", "10" matched inside "100". Fix: use regex word tokenization instead of string `in` operator. Both user words and enhanced title are tokenized into word sets before comparison. 2. Localization breakage: English user words were prepended to translated titles (e.g. "healthcare Aceite de Pescado..."). Fix: only apply the post-check for English locales (locale.startswith("en")). For non-English locales, the LLM translates user intent into the target language — English word matching doesn't apply. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c37f228 commit 189914a

File tree

1 file changed

+9
-5
lines changed

1 file changed

+9
-5
lines changed

src/backend/vlm.py

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
# limitations under the License.
1515

1616
import os
17+
import re
1718
import json
1819
import base64
1920
import logging
@@ -343,12 +344,15 @@ def _call_nemotron_enhance(
343344
enhanced = _call_nemotron_enhance_vlm(vlm_output, filtered_product_data, locale)
344345
logger.info("Step 1 complete (enhanced + localized to %s): enhanced_keys=%s", locale, list(enhanced.keys()))
345346

346-
# Post-check: guarantee user-provided title words survive the LLM pipeline
347-
# Uses the ORIGINAL product_data (not filtered) — the user typed it, we keep it.
348-
if product_data and product_data.get("title") and enhanced.get("title"):
347+
# Post-check: guarantee user-provided title words survive the LLM pipeline.
348+
# Only applies for English locales — for non-English, the LLM translates the
349+
# user's words into the target language, so English word matching doesn't apply.
350+
if product_data and product_data.get("title") and enhanced.get("title") and locale.startswith("en"):
349351
user_title = product_data["title"]
350-
enhanced_lower = enhanced["title"].lower()
351-
missing = [w for w in user_title.split() if w.lower() not in enhanced_lower]
352+
# Word-boundary matching: tokenize both into word sets to avoid substring false positives
353+
enhanced_words = set(re.findall(r'[a-zA-Z0-9]+(?:[-\'][a-zA-Z0-9]+)*', enhanced["title"].lower()))
354+
user_words = re.findall(r'[a-zA-Z0-9]+(?:[-\'][a-zA-Z0-9]+)*', user_title.lower())
355+
missing = [w for w in user_words if w not in enhanced_words]
352356
if missing:
353357
logger.info("Post-check: user words %s missing from title, prepending original user title", missing)
354358
enhanced["title"] = user_title + " " + enhanced["title"]

0 commit comments

Comments
 (0)