diff --git a/AGENTS.md b/AGENTS.md index c1f7c11..67b4968 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -322,17 +322,22 @@ Given the catalog enrichment focus, pay special attention to: - Ensure new code follows established patterns - Include appropriate error handling and logging -3. **Documentation** +3. **LLM Prompt Rules** + - **NEVER hardcode specific product examples in prompts.** Rules must be generic and work across all products. For example, do NOT write rules like `"when the user says 'synthetic leather' and the camera sees 'leather', use the user's term"` — instead write `"when there is a conflict, prefer the user's terms for materials and specs"`. + - Prompts are consumed by millions of products — every rule must generalize. + - If a specific scenario fails, fix the underlying rule, not just the example. + +4. **Documentation** - Update relevant documentation when making changes - Include examples in API documentation - Keep this AGENTS.md file current as the project evolves -4. **Communication** +5. **Communication** - Ask for clarification when requirements are ambiguous - Suggest improvements to architecture and processes - Flag potential security or performance concerns -5. **Incremental Development** +6. **Incremental Development** - Start with simple, working solutions - Iterate and improve based on feedback - Consider backwards compatibility when making changes diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..ea973d1 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +Refer to [AGENTS.md](AGENTS.md) for all project guidelines, coding standards, and AI assistant instructions. diff --git a/PRD.md b/PRD.md index 5347032..f391568 100644 --- a/PRD.md +++ b/PRD.md @@ -129,6 +129,25 @@ A GenAI-powered catalog enrichment system that transforms basic product images i - Support automated filtering or flagging of low-quality generated images - Ensure background differences from original are not penalized (backgrounds should differ) +### FR-10: Product FAQ Generation +- Generate 3-5 frequently asked questions and answers for each product +- FAQs are derived from the final enriched catalog data (after VLM analysis, user data merge, and branding) +- Questions cover practical shopper topics: materials, care instructions, sizing, use cases, compatibility, durability +- Answers are concise (1-3 sentences), factual, and grounded in the enriched product data +- Support locale-aware FAQ generation across all 10 supported regional locales +- Separate `/vlm/faqs` endpoint allows asynchronous generation — details display immediately while FAQs load in the background +- UI displays FAQs in a dedicated tab with collapsible accordion items + +### FR-11: Policy Compliance Checking +- Accept PDF policy documents through a persistent policy library (`/policies` endpoint) +- Parse and normalize uploaded PDFs into structured policy summaries +- Embed normalized policy records using NVIDIA embeddings and store in Milvus vector database +- During product analysis, perform semantic retrieval of relevant policy records +- Run compliance classification against enriched product data and retrieved policy records +- Return pass/fail status with matched policies, rule details, reasons, evidence, and warnings +- Support deduplication of repeated policy uploads by content hash +- Display compliance results in the UI with visual pass/fail indicators + ## Technical Requirements ### TR-1: Model Integration @@ -230,6 +249,16 @@ A GenAI-powered catalog enrichment system that transforms basic product images i **I want to** receive automated quality assessments with detailed scoring and issue detection for generated product images **So that** I can quickly identify and filter out low-quality variations without manual review, ensuring only high-quality assets enter my catalog +### US-8: Product FAQ Generation +**As a** e-commerce content manager +**I want to** automatically generate frequently asked questions and answers for each product based on its enriched catalog data +**So that** I can populate product FAQ sections without manual copywriting, improving the customer shopping experience + +### US-9: Policy Compliance Checking +**As a** catalog compliance officer +**I want to** upload policy PDFs and have the system automatically check enriched product listings against those policies +**So that** I can ensure all catalog entries comply with marketplace regulations and internal guidelines before publishing + ## Success Criteria - **Processing Time**: <1 minute per product for complete enrichment (including quality assessment) @@ -256,6 +285,8 @@ A GenAI-powered catalog enrichment system that transforms basic product images i - [ ] FR-7: Social Media Content Integration - [x] ~~FR-8: Brand Voice & Taxonomy Customization~~ *(Complete with brand_instructions parameter support)* - [x] ~~FR-9: Automated Quality Assessment for Generated Images~~ *(VLM-based reflection module integrated into image generation pipeline)* +- [x] ~~FR-10: Product FAQ Generation~~ *(Separate /vlm/faqs endpoint with async loading, Kaizen Tabs + Accordion UI)* +- [x] ~~FR-11: Policy Compliance Checking~~ *(PDF policy library with Milvus embeddings, semantic retrieval, compliance classification)* - [ ] TR-1: Model Integration - [x] ~~NVIDIA Nemotron VLM API integration~~ diff --git a/README.md b/README.md index c3d435f..df540a6 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,9 @@ A GenAI-powered catalog enrichment system that transforms basic product images i - **Cultural Image Generation**: Create culturally-appropriate product backgrounds (Spanish courtyards, Mexican family spaces, British formal settings) - **Quality Evaluation**: Automated VLM-based quality assessment of generated images with detailed scoring - **3D Asset Generation**: Transform 2D product images into interactive 3D GLB models using Microsoft TRELLIS -- **Modular API**: Separate endpoints for VLM analysis, image generation, and 3D asset generation +- **Product FAQ Generation**: Automatically generate 3-5 product FAQs from enriched catalog data +- **Policy Compliance**: Upload policy PDFs and automatically check product listings against them using RAG + Milvus +- **Modular API**: Separate endpoints for VLM analysis, FAQ generation, image generation, and 3D asset generation ## Documentation diff --git a/docs/API.md b/docs/API.md index 8f3cb2a..2e37981 100644 --- a/docs/API.md +++ b/docs/API.md @@ -36,8 +36,9 @@ Health check endpoint for monitoring service status. The API provides a modular approach for optimal performance and flexibility: **1) Fast VLM Analysis (POST `/vlm/analyze`)** - Get product fields quickly -**2) Image Generation (POST `/generate/variation`)** - Generate 2D variations on demand -**3) 3D Asset Generation (POST `/generate/3d`)** - Generate 3D models on demand +**2) FAQ Generation (POST `/vlm/faqs`)** - Generate product FAQs from enriched data +**3) Image Generation (POST `/generate/variation`)** - Generate 2D variations on demand +**4) 3D Asset Generation (POST `/generate/3d`)** - Generate 3D models on demand **Benefits of this approach:** - Display product information immediately to users @@ -273,7 +274,75 @@ curl -X POST \ --- -## 3️⃣ Image Generation: `/generate/variation` +## 3️⃣ FAQ Generation: `/vlm/faqs` + +Generate 3-5 frequently asked questions and answers for a product based on its enriched catalog data. Designed to be called after `/vlm/analyze` completes, using the enriched result as input. + +**Endpoint**: `POST /vlm/faqs` +**Content-Type**: `multipart/form-data` + +### Request Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `title` | string | No | Product title from VLM analysis | +| `description` | string | No | Product description from VLM analysis | +| `categories` | JSON string | No | Categories array (default: `[]`) | +| `tags` | JSON string | No | Tags array (default: `[]`) | +| `colors` | JSON string | No | Colors array (default: `[]`) | +| `locale` | string | No | Regional locale code (default: `en-US`) | + +### Response Schema + +```json +{ + "faqs": [ + { + "question": "string", + "answer": "string" + } + ] +} +``` + +### Usage Example + +```bash +# Call after /vlm/analyze to generate FAQs from enriched data +curl -X POST \ + -F "title=Craftsman 20V Cordless Lawn Mower" \ + -F "description=A cordless lawn mower featuring a black and red design..." \ + -F 'categories=["electronics"]' \ + -F 'tags=["cordless","lawn mower","Craftsman"]' \ + -F 'colors=["black","red"]' \ + -F "locale=en-US" \ + http://localhost:8000/vlm/faqs +``` + +### Example Response + +```json +{ + "faqs": [ + { + "question": "What type of battery does this mower use?", + "answer": "This mower operates on a 20V cordless battery system, providing the flexibility to mow without a power cord." + }, + { + "question": "Does this mower come with a grass collection bag?", + "answer": "Yes, it includes a rear-mounted grass collection bag for convenient clippings management." + }, + { + "question": "What are the main colors of this mower?", + "answer": "The mower features a black and red color scheme with prominent Craftsman branding." + } + ] +} +``` + +--- + +## 4️⃣ Image Generation: `/generate/variation` Generate culturally-appropriate product variations using FLUX models based on VLM analysis results. @@ -334,7 +403,7 @@ curl -X POST \ --- -## 4️⃣ 3D Asset Generation: `/generate/3d` +## 5️⃣ 3D Asset Generation: `/generate/3d` Generate interactive 3D GLB models from 2D product images using Microsoft's TRELLIS model. diff --git a/docs/hallucination-report.md b/docs/hallucination-report.md new file mode 100644 index 0000000..b645090 --- /dev/null +++ b/docs/hallucination-report.md @@ -0,0 +1,170 @@ +# LLM Enhancement Hallucination Report + +**Date:** 2026-04-15 +**Reported by:** Antonio Martinez +**Status:** Open — Separate task pending +**Affected component:** `src/backend/vlm.py` — `_call_nemotron_enhance_vlm()` (Step 1 enhancement) + +--- + +## Summary + +The VLM model (`nemotron-nano-12b-v2-vl`, 12B parameters) introduces hallucinations at the source — misreading visible text, fabricating materials and features, and drawing from training data rather than strictly describing the image. The LLM enhancement step (`_call_nemotron_enhance_vlm`) then compounds these errors by rewriting them into confident marketing copy. Both layers contribute, but the root cause is the VLM. + +--- + +## Root Cause Analysis + +### Pipeline Flow + +``` +Image Upload + | + v +[VLM] _call_vlm() <-- Accurate visual analysis + | Model: nemotron-nano-12b-v2-vl + | Output: title, description, categories, tags, colors + v +[LLM] _call_nemotron_enhance_vlm() <-- Hallucinations introduced HERE + | Model: nemotron-3-nano + | Task: "Write rich, persuasive product description" + v +[LLM] _call_nemotron_apply_branding() <-- Inherits errors from Step 1 + | (only runs if brand_instructions provided) + v +[LLM] _call_nemotron_generate_faqs() <-- Consumes VLM output directly, + (runs in parallel with Step 1) but FAQs still affected if + VLM has minor OCR issues +``` + +### Where the Problem Lives + +**Layer 1 — VLM** (`src/backend/vlm.py`, `_call_vlm()`): +The 12B VLM model misreads text, fabricates materials/features, and fills in details from training data. This happens regardless of prompt complexity — even "describe this product" triggers hallucinations. Longer prompts produce *more* hallucinations, not fewer. This is confirmed by the NVIDIA research team: longer system prompts degrade VLM output quality for this model class. + +**Layer 2 — LLM Enhancement** (`src/backend/vlm.py`, `_call_nemotron_enhance_vlm()`): +The LLM takes the already-hallucinated VLM output and rewrites it into confident marketing copy, compounding errors and adding its own fabrications. Skipping this step when no user data is provided eliminates the second layer. + +--- + +## Evidence: Craftsman 2XV20 Lawn Mower + +### Test Image + +`mower.jpeg` — Craftsman battery-powered lawn mower with "2XV20" printed on the deck (indicating dual V20 battery platform). + +### VLM Direct Testing (2026-04-15) + +Three prompts were tested against the same VLM endpoint (`nemotron-nano-12b-v2-vl`) with `mower.jpeg`: + +**Prompt 1 — Minimal: "describe this product"** + +> "This product is a Craftsman 20-inch 20V MAX Lithium Ion Cordless Lawn Mower. It's a compact, electric lawn mower designed for residential use. The mower features a 20-inch cutting deck [...] The 20V MAX Lithium Ion battery provides cordless convenience [...] includes a grass collection bag [...] equipped with a safety key to prevent accidental startups." + +- Gets closest to reality: correctly identifies it as cordless/battery-powered ("20V MAX Lithium Ion") +- Still fabricates: "20-inch cutting deck", "safety key" +- Clearly pulling from Craftsman training data rather than reading "2XV20" text + +**Prompt 2 — Detailed descriptive: "In detail, give a description of this image, include everything you see including texts. Be extremely descriptive."** + +> "The cutting deck itself is marked with the text '20' indicating the width of the cutting blade in inches [...] a clear plastic cover over the cutting deck, allowing a view of the blades inside." + +- Misreads "2XV20" as "20" and reinterprets it as cutting width +- Fabricates "clear plastic cover over the cutting deck" +- More hallucinations than the minimal prompt + +**Prompt 3 — Catalog enrichment structured prompt (our production prompt)** + +> `"title": "Craftsman 20-Inch Electric Lawn Mower"` ... `"clear plastic front cover"` ... `"control panel on the handlebar"` ... `"model number '20' is visible on the front"` + +- Same hallucinations as prompt 2, now in JSON format +- Fabricates: "clear plastic front cover", "control panel on the handlebar" +- Misreads "2XV20" as "20" and calls it a model number + +### Key Finding: Hallucinations Originate in the VLM + +Initial analysis attributed hallucinations to the LLM enhancement step. **Direct VLM testing disproved this.** The VLM itself: +1. Misreads "2XV20" as "20" across all prompt styles +2. Fabricates materials ("clear plastic") and features ("control panel", "safety key") not visible in the image +3. Draws from training data about Craftsman products rather than strictly describing the image +4. Performs *worse* with longer, more detailed prompts — the minimal prompt produced the fewest hallucinations + +### Hallucination Inventory (VLM output, all prompts combined) + +| Claim | Reality | Type | Source | +|-------|---------|------|--------| +| "20-Inch" cutting width | "2XV20" is Craftsman's dual V20 battery platform | Text misread | VLM | +| "clear plastic cutting deck/cover" | Deck is opaque black | Fabricated material | VLM | +| "control panel on the handlebar" | Only a safety lever is visible | Fabricated feature | VLM | +| "safety key" | No safety key visible | Fabricated feature | VLM | +| "Electric Lawn Mower" (prompt 2/3) | Battery-powered (cordless) | Training data inference | VLM | +| "silver accents" on wheels | Wheels are entirely black | Fabricated detail | LLM enhancement | +| "red power button" | Not visible | Fabricated feature | LLM enhancement | + +The LLM enhancement step compounded the VLM's errors (adding "silver accents", "red power button"), but the root cause is the 12B VLM model's vision limitations. + +--- + +## Proposed Solution + +### Fix 1 (Implemented): Skip LLM Enhancement When Unnecessary + +**Status: Done** — merged in this branch. + +The LLM enhancement step is now skipped when no user product data is provided. This eliminates the second layer of hallucinations. + +| Scenario | Current Behavior | New Behavior | +|----------|-----------------|--------------| +| Image only (no user data, no brand instructions) | VLM -> LLM enhance -> output | VLM -> output directly (skip LLM) | +| Image + user product data | VLM -> LLM enhance (merge) -> output | VLM -> LLM enhance (merge) -> output (keep) | +| Image + brand instructions | VLM -> LLM enhance -> LLM brand -> output | VLM -> LLM brand -> output | +| Image + user data + brand instructions | VLM -> LLM enhance -> LLM brand -> output | VLM -> LLM enhance -> LLM brand -> output (keep) | + +### Fix 2 (Future): Shorten the VLM Prompt + +The current VLM prompt in `_call_vlm()` is ~30 lines with detailed rules, category lists, formatting instructions, and output constraints. Testing showed that a minimal prompt ("describe this product") produced the fewest hallucinations — the VLM correctly identified the mower as "20V MAX Lithium Ion Cordless" with that prompt, while the long structured prompt caused it to misread "2XV20" as "20" and fabricate features. + +This is confirmed by the NVIDIA research team: longer system prompts degrade output quality for this VLM model class. The model spends capacity following formatting rules rather than focusing on accurate visual analysis. + +**Proposed approach:** +- Strip the VLM prompt down to a short, focused instruction — prioritize visual accuracy over output formatting +- Move structural concerns (JSON format, category validation, tag count) to a lightweight post-processing step or a separate LLM call +- Test iteratively: compare hallucination rates across prompt lengths using a set of test images (mower, shoes, skincare, etc.) + +**Trade-off:** A shorter VLM prompt may return unstructured text instead of clean JSON. This would require parsing the free-text output into structured fields, either with regex/heuristics or a fast LLM call. The benefit is more accurate visual descriptions at the source. + +### Fix 3 (Future): Upgrade VLM Model + +The `nemotron-nano-12b-v2-vl` (12B parameters) has fundamental vision limitations with stylized text and detail accuracy. A larger VLM (72B+) would likely improve OCR accuracy and reduce training-data hallucinations. This is a infrastructure/cost trade-off rather than a code change. + +--- + +## Impact on FAQ Feature + +The FAQ generation feature (`_call_nemotron_generate_faqs`) consumes the raw VLM observation directly (not the enhanced output), which reduces but does not eliminate the risk: + +- FAQs generated from accurate VLM output will be factually grounded +- Minor VLM OCR errors (e.g., "2x20" vs "2XV20") can still propagate into FAQ answers +- If the proposed fix (skip enhancement) is implemented, the Details tab and FAQ tab will both be grounded in the same factual VLM observation, creating consistency + +--- + +## Reproduction Steps + +1. Start the backend and frontend services +2. Upload `mower.jpeg` (Craftsman 2XV20 lawn mower) +3. Click Generate with default settings (no product data, no brand instructions) +4. Observe the enriched description in the Details tab +5. Compare against the VLM's raw output (visible in backend logs at `[VLM]` level) + +--- + +## Files Referenced + +| File | Relevance | +|------|-----------| +| `src/backend/vlm.py:128-205` | `_call_nemotron_enhance_vlm()` — where hallucinations are introduced | +| `src/backend/vlm.py:167-186` | Enhancement prompt with insufficient anti-hallucination rules | +| `src/backend/vlm.py:175` | Current anti-hallucination rule (too narrow — numbers only) | +| `src/backend/vlm.py:397-439` | `_call_nemotron_enhance()` — orchestrator where the skip logic would go | +| `src/backend/vlm.py:441-510` | `_call_vlm()` — VLM analysis (produces accurate output) | diff --git a/src/backend/main.py b/src/backend/main.py index 938ee68..e787b9f 100644 --- a/src/backend/main.py +++ b/src/backend/main.py @@ -176,7 +176,7 @@ async def vlm_analyze( image_bytes, content_type = validation_result logger.info(f"Running VLM analysis: locale={locale} mode={'augmentation' if product_json else 'generation'}") - vlm_observation = await asyncio.to_thread(extract_vlm_observation, image_bytes, content_type) + vlm_observation = await asyncio.to_thread(extract_vlm_observation, image_bytes, content_type, locale) enrichment_task = asyncio.to_thread( build_enriched_vlm_result, @@ -195,12 +195,7 @@ async def vlm_analyze( "colors": vlm_observation.get("colors", []), }, ) - faq_task = asyncio.to_thread( - _call_nemotron_generate_faqs, - vlm_observation, - locale, - ) - result, policy_contexts, faqs = await asyncio.gather(enrichment_task, retrieval_task, faq_task) + result, policy_contexts = await asyncio.gather(enrichment_task, retrieval_task) if policy_contexts: logger.info("Policy retrieval returned %d candidate policy record(s); running compliance evaluation.", len(policy_contexts)) product_snapshot = { @@ -250,13 +245,11 @@ async def vlm_analyze( "colors": result.get("colors", []), "locale": locale } - + if result.get("enhanced_product"): payload["enhanced_product"] = result["enhanced_product"] if result.get("policy_decision"): payload["policy_decision"] = result["policy_decision"] - if faqs: - payload["faqs"] = faqs logger.info(f"/vlm/analyze success: title_len={len(payload['title'])} desc_len={len(payload['description'])} locale={locale}") return JSONResponse(payload) @@ -271,6 +264,40 @@ async def vlm_analyze( return JSONResponse({"detail": str(exc)}, status_code=500) +@app.post("/vlm/faqs") +async def vlm_faqs( + title: str = Form(""), + description: str = Form(""), + categories: str = Form("[]"), + tags: str = Form("[]"), + colors: str = Form("[]"), + locale: str = Form("en-US"), +) -> JSONResponse: + """Generate FAQs from enriched product data. Called after /vlm/analyze completes.""" + try: + if locale not in VALID_LOCALES: + logger.error(f"/vlm/faqs error: invalid locale={locale}") + return JSONResponse({"detail": f"Invalid locale. Supported locales: {sorted(VALID_LOCALES)}"}, status_code=400) + + enriched = { + "title": title, + "description": description, + "categories": json.loads(categories), + "tags": json.loads(tags), + "colors": json.loads(colors), + } + faqs = await asyncio.to_thread(_call_nemotron_generate_faqs, enriched, locale) + return JSONResponse({"faqs": faqs}) + except (APIConnectionError, httpx.ConnectError) as exc: + logger.exception("/vlm/faqs connection error: %s", exc) + return JSONResponse({ + "detail": "Unable to connect to the NIM endpoint. Please verify that the NVIDIA NIM container is running." + }, status_code=503) + except Exception as exc: + logger.exception("/vlm/faqs exception: %s", exc) + return JSONResponse({"detail": str(exc)}, status_code=500) + + @app.get("/policies") async def list_policies() -> JSONResponse: try: diff --git a/src/backend/vlm.py b/src/backend/vlm.py index c719203..e7cc2ed 100644 --- a/src/backend/vlm.py +++ b/src/backend/vlm.py @@ -56,7 +56,8 @@ "office", "fragrance", "skincare", - "bags" + "bags", + "outdoor" ] def _call_nemotron_filter_user_data( @@ -154,11 +155,11 @@ def _call_nemotron_enhance_vlm( existing_desc = product_data.get("description", "") if product_data else "" title_instruction = ( - f'The user provided this title: "{existing_title}". Every word from it MUST appear in the final title. Merge them naturally with visual details from the analysis to create a single compelling product name.' + f'The user provided this title: "{existing_title}". Use it as the BASE and enrich it with visual details (color, shape, design) from the analysis. Keep all user words unless printed label text on the product clearly contradicts them.' if existing_title else "Create a compelling product name." ) desc_instruction = ( - f'The user provided this description: "{existing_desc}". All its words MUST appear in your output — expand around them with visual insights.' + f'The user provided this description: "{existing_desc}". Use it as the BASE and expand it with visual details from the analysis. Keep all user terms unless printed label text on the product clearly contradicts them.' if existing_desc else "Focus on what makes this product appealing." ) @@ -172,12 +173,14 @@ def _call_nemotron_enhance_vlm( ALLOWED CATEGORIES: {json.dumps(PRODUCT_CATEGORIES)} STRICT RULES: -1. NEVER invent or fabricate specific numbers (wattage, capacity, weight, dimensions, HP, voltage, speed counts) on your own. Only use numbers that appear in the VISUAL ANALYSIS or the user-provided title/description above. -2. Numbers from the user-provided title MUST be preserved — they are trusted input, not hallucinations. +1. NEVER invent or fabricate details on your own. Only use facts from the VISUAL ANALYSIS or the EXISTING PRODUCT DATA above. +2. Printed text readable on the product (brand names, product names, dosages, model numbers) is ground truth. Drop user words that contradict printed label text. +3. Material descriptions from the visual analysis are visual guesses — the camera cannot verify composition. Always use the user's material term when provided. +4. The VISUAL ANALYSIS is authoritative for appearance (colors, shape, design) and printed text. The EXISTING PRODUCT DATA is authoritative for material composition and internal specs. YOUR TASK: - title: {title_instruction} Write in {info['language']}. -- description: Write a rich, persuasive product description highlighting materials, design, and features. Only mention specifications that appear in the visual analysis above. {desc_instruction} Write in {info['language']}. +- description: Write a rich, persuasive product description. Merge visual details with user-provided information. {desc_instruction} Write in {info['language']}. - categories: Pick from allowed list only. English. Array format. - tags: {"Keep all existing user tags AND add more from the visual analysis." if product_data else "Generate 10 relevant search tags."} English. - colors: Use the VLM colors. English. @@ -252,20 +255,15 @@ def _call_nemotron_apply_branding( - DO NOT remove existing fields - Only modify the VALUES of existing fields -2. **Description Field Formatting** (MANDATORY): - - CAREFULLY READ the brand instructions for ANY mention of sections, structure, or content types - - If the brand instructions mention ANY of these, you MUST create clearly labeled sections with headers in the description - - EVERY section or content type mentioned in the brand instructions MUST appear as a distinct, labeled section in the output - do NOT skip or merge any - - Each section MUST have a header followed by detailed bullet points or paragraphs - - CRITICAL: Separate each section with double newlines (\\n\\n) for readability - - Keep everything in the description field - DO NOT create separate JSON fields for sections - - The description must be a single string value with proper line breaks between sections - - When in doubt about whether the brand instructions ask for structure, ALWAYS use structured sections rather than plain prose +2. **Description Field Formatting**: + - Follow the brand instructions for format and structure — if they ask for paragraphs, write paragraphs; if they ask for sections or bullet points, use sections and bullet points + - Keep everything in the description field as a single string value + - Separate sections or paragraphs with double newlines (\\n\\n) for readability 3. **Apply Brand Voice** (in {info['language']} for {info['region']}): - Apply brand voice/tone to title, description, categories, and tags - Use brand-preferred terminology and expressions - - Maintain factual accuracy while applying brand personality + - Do NOT add ingredients, specifications, or features not present in the enhanced content above. Only rephrase and style what is already there 4. **Categories**: - Validate against the allowed categories list above @@ -310,16 +308,16 @@ def _call_nemotron_apply_branding( def _call_nemotron_generate_faqs( - vlm_observation: Dict[str, Any], + enriched_result: Dict[str, Any], locale: str = "en-US" ) -> list: - """Generate 3-5 product FAQs from VLM observation using Nemotron. + """Generate 3-5 product FAQs from the final enriched catalog result. - Runs in parallel with enrichment to add zero latency. On any parse - failure the function returns an empty list so the caller can proceed - without FAQs. + Runs after enrichment so FAQs reflect the final merged output (VLM + + user data + branding). On any parse failure returns an empty list. """ - logger.info("[FAQ] Generating FAQs: vlm_keys=%s, locale=%s", list(vlm_observation.keys()), locale) + logger.info("[FAQ] Generating FAQs: keys=%s, locale=%s", + list(enriched_result.keys()), locale) if not (api_key := os.getenv("NGC_API_KEY")): raise RuntimeError(NGC_API_KEY_NOT_SET_ERROR) @@ -328,12 +326,12 @@ def _call_nemotron_generate_faqs( llm_config = get_config().get_llm_config() client = OpenAI(base_url=llm_config['url'], api_key=api_key) - observation_json = json.dumps(vlm_observation, indent=2, ensure_ascii=False) + product_json = json.dumps(enriched_result, indent=2, ensure_ascii=False) prompt = f"""/no_think You are a retail product FAQ specialist. Generate 3 to 5 frequently asked questions and answers for the product described below. -PRODUCT VISUAL ANALYSIS: -{observation_json} +PRODUCT: +{product_json} TARGET LANGUAGE / REGION: {info['language']} ({info['region']}) {info['context']} @@ -343,7 +341,7 @@ def _call_nemotron_generate_faqs( - Each FAQ must have a "question" and an "answer" field. - Questions should cover practical topics a shopper would ask: materials, care instructions, sizing, use cases, compatibility, durability. - Answers must be helpful, concise (1-3 sentences), and factual. -- ONLY reference attributes visible in the product analysis above. Do NOT fabricate specifications (weight, wattage, capacity, dimensions) unless they appear in the analysis. +- ONLY reference details present in the product data above. Do NOT fabricate specifications. - Write questions and answers in {info['language']} appropriate for {info['region']}. OUTPUT FORMAT: @@ -406,10 +404,11 @@ def _call_nemotron_enhance( Pre-filter (conditional - only if product_data provided): - Removes irrelevant terms from user-provided data using category-aware LLM filter - Step 1: Content enhancement + localization (always runs): + Step 1: Content enhancement + localization (conditional - only if product_data provided): - Merges pre-filtered product_data with VLM output - Applies anti-hallucination rules (no fabricated specs) - Localizes to target language/region + - When no product_data, VLM output is used directly Step 2: Brand alignment (conditional - only if brand_instructions provided): - Applies brand voice, tone, taxonomy @@ -424,9 +423,17 @@ def _call_nemotron_enhance( logger.info("Pre-filter complete: title_before=%s, title_after=%s", repr(product_data.get("title", "")), repr(filtered_product_data.get("title", ""))) - # Step 1: Enhance VLM output and localize to target language (single call for efficiency) - enhanced = _call_nemotron_enhance_vlm(vlm_output, filtered_product_data, locale) - logger.info("Step 1 complete (enhanced + localized to %s): enhanced_keys=%s", locale, list(enhanced.keys())) + # Step 1: Only run enhancement when there is user data with actual content to merge + has_content = filtered_product_data and any( + v for k, v in filtered_product_data.items() + if isinstance(v, str) and v.strip() + ) + if has_content: + enhanced = _call_nemotron_enhance_vlm(vlm_output, filtered_product_data, locale) + logger.info("Step 1 complete (enhanced + localized to %s): enhanced_keys=%s", locale, list(enhanced.keys())) + else: + enhanced = vlm_output + logger.info("Step 1 skipped: no product_data with content — using VLM output directly") # Step 2: Apply brand instructions if provided if brand_instructions: @@ -438,59 +445,23 @@ def _call_nemotron_enhance( logger.info("Nemotron enhancement pipeline complete: final_keys=%s", list(enhanced.keys())) return enhanced -def _call_vlm(image_bytes: bytes, content_type: str) -> Dict[str, Any]: - """Call VLM to analyze product image. - - NOTE: Always analyzes in ENGLISH regardless of target locale. - This prevents hallucinations that occur when VLMs work in non-English languages. - Localization is handled separately by the LLM in a subsequent step. +def _call_vlm(image_bytes: bytes, content_type: str, locale: str = "en-US") -> Dict[str, Any]: + """Call VLM to analyze product image, then structure the output via LLM. + + Uses a short VLM prompt to minimize hallucinations (longer prompts degrade + quality on this model class), then passes the free-text observation to + _call_nemotron_structure_vlm() for JSON structuring and localization. """ - logger.info("Calling VLM: bytes=%d, content_type=%s (English-only analysis)", len(image_bytes or b""), content_type) - + logger.info("Calling VLM: bytes=%d, content_type=%s, locale=%s", len(image_bytes or b""), content_type, locale) + api_key = os.getenv("NGC_API_KEY") if not api_key: raise RuntimeError(NGC_API_KEY_NOT_SET_ERROR) - + vlm_config = get_config().get_vlm_config() client = OpenAI(base_url=vlm_config['url'], api_key=api_key) - categories_str = json.dumps(PRODUCT_CATEGORIES) - - prompt_text = f"""You are a product visual analyst. Analyze ONLY what is physically visible in the image. Be strictly factual. - -CRITICAL RULES: -- ONLY describe what you can physically SEE in the image. -- Do NOT infer or guess technical specifications (wattage, capacity, weight, dimensions, HP, voltage) unless the exact number is clearly printed and readable on the product label. -- Do NOT fill in details from brand knowledge or training data — if a spec is not visible, do not mention it. -- If you can read text on the product (brand names, labels, buttons), report exactly what is written. - -TASK: -1. Describe the product's visible appearance - shape, colors, materials, design elements -2. Transcribe any visible text on the product: brand names, labels, button names, measurements printed on the item -3. Write in ENGLISH - be accurate about what you see, not what you know about the brand - -CATEGORIES - Choose ONLY from this allowed set: {categories_str} -- Pick 1-2 categories that GENUINELY describe the product -- It is BETTER to pick only 1 accurate category than to force a second one that doesn't fit -- If only one category applies, return just one: e.g., "categories": ["kitchen"] -- Do NOT stretch or force-fit categories - if the product doesn't belong in a category, don't include it - -TAGS: Generate exactly 10 descriptive tags (2-4 words each) for search/filtering - -COLORS - What colors would a customer use to describe this product? (1-2 max) -- Include the main material color AND any visible hardware/accent colors (e.g., gold clasps, silver buckles) -- NEVER include the background/backdrop color -- NEVER include hidden parts (shoe soles, inner linings) -- Use simple names: red, blue, black, white, grey, green, yellow, orange, purple, pink, navy, beige, silver, gold, tan, brown, cream, burgundy, olive - -Return ONLY valid JSON: -{{ - "title": "", - "description": "", - "categories": [""], - "tags": ["", "", ...], - "colors": [""] -}}""" + prompt_text = "Describe this product in detail: appearance, shape, colors, materials, visible text, brand names, labels, and any distinctive design features." completion = client.chat.completions.create( model=vlm_config['model'], @@ -498,26 +469,80 @@ def _call_vlm(image_bytes: bytes, content_type: str) -> Dict[str, Any]: {"type": "image_url", "image_url": {"url": f"data:{content_type};base64,{base64.b64encode(image_bytes).decode()}"}}, {"type": "text", "text": prompt_text} ]}], - temperature=0.1, top_p=0.9, max_tokens=1024, stream=True + temperature=0.1, top_p=0.9, max_tokens=4096, stream=True ) text = "".join(chunk.choices[0].delta.content for chunk in completion if chunk.choices[0].delta and chunk.choices[0].delta.content) - logger.info("VLM response received: %d chars", len(text)) + logger.info("VLM free-text response received: %d chars", len(text)) + + return _call_nemotron_structure_vlm(text.strip(), locale) + + +def _call_nemotron_structure_vlm(vlm_text: str, locale: str = "en-US") -> Dict[str, Any]: + """Structure and enhance free-text VLM output into e-commerce catalog JSON. + + Rewrites the VLM observation into polished catalog copy while staying + faithful to the facts described. Localizes to the target language/region. + """ + logger.info("[Structure] Structuring VLM text: %d chars, locale=%s", len(vlm_text), locale) - parsed = parse_llm_json(text) + if not (api_key := os.getenv("NGC_API_KEY")): + raise RuntimeError(NGC_API_KEY_NOT_SET_ERROR) + + info = LOCALE_CONFIG.get(locale, {"language": "English", "region": "United States", "country": "United States", "context": "American English"}) + llm_config = get_config().get_llm_config() + client = OpenAI(base_url=llm_config['url'], api_key=api_key) + + categories_str = json.dumps(PRODUCT_CATEGORIES) + + prompt = f"""/no_think Convert the visual description below into e-commerce product catalog fields. Write in polished, professional catalog language in {info['language']} for {info['region']} ({info['context']}). Do NOT invent features, materials, or specifications not mentioned in the description. + +VISUAL DESCRIPTION: +{vlm_text} + +ALLOWED CATEGORIES: {categories_str} + +RULES: +- title: Compelling product name using only details from the description. Write in {info['language']}. +- description: Write as customer-facing e-commerce catalog copy in {info['language']}. Highlight the product's appeal, materials, design, and features. Do NOT describe the label or packaging text placement (no "brand name is displayed on", "text reads", "prominently displayed", "printed in white"). Instead, naturally incorporate brand and product names into the copy. +- categories: Pick 1-2 from the allowed list. Use "uncategorized" if none fit. English. +- tags: 10 search tags derived from the text. English. +- colors: 1-2 product colors mentioned in the text. English. + +Return ONLY valid JSON: +{{"title": "...", "description": "...", "categories": [...], "tags": [...], "colors": [...]}}""" + + completion = client.chat.completions.create( + model=llm_config['model'], + messages=[{"role": "system", "content": "/no_think"}, {"role": "user", "content": prompt}], + temperature=0.1, top_p=0.9, max_tokens=2048, stream=True, + extra_body={"reasoning_budget": 16384, "chat_template_kwargs": {"enable_thinking": False}} + ) + + text = "".join( + chunk.choices[0].delta.content + for chunk in completion + if chunk.choices[0].delta and chunk.choices[0].delta.content + ) + logger.info("[Structure] LLM response received: %d chars", len(text)) + + parsed = parse_llm_json(text, extract_braces=True, strip_comments=True) if parsed is not None: + logger.info("[Structure] Structured successfully: keys=%s", list(parsed.keys())) return parsed - return {"title": "", "description": text.strip(), "categories": ["uncategorized"], "tags": [], "colors": []} + + logger.warning("[Structure] JSON parse failed, returning raw text as description") + return {"title": "", "description": vlm_text, "categories": ["uncategorized"], "tags": [], "colors": []} -def extract_vlm_observation(image_bytes: bytes, content_type: str) -> Dict[str, Any]: +def extract_vlm_observation(image_bytes: bytes, content_type: str, locale: str = "en-US") -> Dict[str, Any]: """Run only the raw VLM observation step.""" if not image_bytes: raise ValueError("image_bytes is required") if not isinstance(content_type, str) or not content_type.startswith("image/"): raise ValueError("content_type must be an image/* MIME type") - vlm_result = _call_vlm(image_bytes, content_type) + vlm_result = _call_vlm(image_bytes, content_type, locale) logger.info( "VLM analysis complete (English): title_len=%d desc_len=%d categories=%s", len(vlm_result.get("title", "")), @@ -580,5 +605,5 @@ def run_vlm_analysis( Dict with title, description, categories, tags, colors, and enhanced_product (if augmentation) """ logger.info("Running VLM analysis: locale=%s mode=%s brand_instructions=%s", locale, "augmentation" if product_data else "generation", bool(brand_instructions)) - vlm_result = extract_vlm_observation(image_bytes, content_type) + vlm_result = extract_vlm_observation(image_bytes, content_type, locale) return build_enriched_vlm_result(vlm_result, locale, product_data, brand_instructions) diff --git a/src/ui/app/page.tsx b/src/ui/app/page.tsx index 647138d..55c3e65 100644 --- a/src/ui/app/page.tsx +++ b/src/ui/app/page.tsx @@ -9,7 +9,7 @@ import { FieldsCard } from '@/components/FieldsCard'; import { AdvancedOptionsCard } from '@/components/AdvancedOptionsCard'; import { GeneratedVariationsSection } from '@/components/GeneratedVariationsSection'; import { ProductFields, AugmentedData, PolicyDocument, PolicyUploadResult, SUPPORTED_LOCALES } from '@/types'; -import { analyzeImage, clearPolicies, generateImageVariation, generate3DModel, listPolicies, prepareProductData, uploadPolicies } from '@/lib/api'; +import { analyzeImage, generateFaqs, clearPolicies, generateImageVariation, generate3DModel, listPolicies, prepareProductData, uploadPolicies } from '@/lib/api'; function Home() { @@ -33,6 +33,7 @@ function Home() { brandInstructions: '' }); const [augmentedData, setAugmentedData] = useState(null); + const [isLoadingFaqs, setIsLoadingFaqs] = useState(false); const [generatedImages, setGeneratedImages] = useState<(string | null)[]>([null, null]); const [qualityScores, setQualityScores] = useState<(number | null)[]>([null, null]); const [qualityIssues, setQualityIssues] = useState<(string[] | null)[]>([null, null]); @@ -97,6 +98,7 @@ function Home() { setUploadedImage(null); setUploadedFile(null); setAugmentedData(null); + setIsLoadingFaqs(false); setGeneratedImages([null, null]); setQualityScores([null, null]); setQualityIssues([null, null]); @@ -199,17 +201,34 @@ function Home() { brandInstructions: fields.brandInstructions }); - setAugmentedData({ + const enrichedData = { title: analyzeData.title || '', description: analyzeData.description || '', colors: analyzeData.colors || [], tags: analyzeData.tags || [], categories: analyzeData.categories || [], policyDecision: analyzeData.policyDecision, - faqs: analyzeData.faqs || [] - }); + }; + setAugmentedData(enrichedData); setIsAnalyzingFields(false); + // Fire FAQ generation in the background — details are already visible + setIsLoadingFaqs(true); + generateFaqs({ + title: enrichedData.title, + description: enrichedData.description, + categories: enrichedData.categories || [], + tags: enrichedData.tags, + colors: enrichedData.colors, + locale, + }).then((faqs) => { + setAugmentedData(prev => prev ? { ...prev, faqs } : prev); + }).catch((err) => { + console.error('Error generating FAQs:', err); + }).finally(() => { + setIsLoadingFaqs(false); + }); + setIsGeneratingImage(true); const variationParams = { @@ -388,6 +407,7 @@ function Home() { augmentedData={augmentedData} isAnalyzing={isAnalyzingFields} isGenerating={isGeneratingImage} + isLoadingFaqs={isLoadingFaqs} onFieldChange={(field, value) => setFields(prev => ({ ...prev, [field]: value }))} /> diff --git a/src/ui/components/FieldsCard.tsx b/src/ui/components/FieldsCard.tsx index 241423d..5ababae 100644 --- a/src/ui/components/FieldsCard.tsx +++ b/src/ui/components/FieldsCard.tsx @@ -1,4 +1,4 @@ -import { Card, Stack, Text, Flex, FormField, TextInput, TextArea, Tabs, Accordion } from '@/kui-foundations-react-external'; +import { Card, Stack, Text, Flex, FormField, TextInput, TextArea, Tabs, Accordion, Spinner } from '@/kui-foundations-react-external'; import { ProductFields, AugmentedData, PolicyDecision, FAQ } from '@/types'; import { ProcessingSteps } from './ProcessingSteps'; @@ -99,7 +99,20 @@ function PolicyComplianceCard({ decision }: { decision: PolicyDecision }) { ); } -function FaqTabContent({ faqs }: { faqs?: FAQ[] }) { +function FaqTabContent({ faqs, isLoading }: { faqs?: FAQ[]; isLoading?: boolean }) { + if (isLoading) { + return ( +
+ + + + Generating FAQs... + + +
+ ); + } + if (!faqs || faqs.length === 0) { return (
@@ -137,10 +150,11 @@ interface Props { augmentedData: AugmentedData | null; isAnalyzing: boolean; isGenerating: boolean; + isLoadingFaqs?: boolean; onFieldChange: (field: keyof ProductFields, value: string) => void; } -export function FieldsCard({ fields, augmentedData, isAnalyzing, isGenerating, onFieldChange }: Props) { +export function FieldsCard({ fields, augmentedData, isAnalyzing, isGenerating, isLoadingFaqs, onFieldChange }: Props) { const disabled = isAnalyzing || isGenerating; const detailsContent = ( @@ -293,7 +307,7 @@ export function FieldsCard({ fields, augmentedData, isAnalyzing, isGenerating, o { children: "FAQs", value: "faqs", - slotContent:
+ slotContent:
} ]} /> diff --git a/src/ui/lib/api.ts b/src/ui/lib/api.ts index 04e6990..d10c4c7 100644 --- a/src/ui/lib/api.ts +++ b/src/ui/lib/api.ts @@ -33,11 +33,42 @@ export async function analyzeImage({ file, locale, productData, brandInstruction const data = await response.json(); return { ...data, - policyDecision: data.policy_decision, - faqs: data.faqs || [] + policyDecision: data.policy_decision }; } +interface GenerateFaqsParams { + title: string; + description: string; + categories: string[]; + tags: string[]; + colors: string[]; + locale: string; +} + +export async function generateFaqs(params: GenerateFaqsParams): Promise<{ question: string; answer: string }[]> { + const formData = new FormData(); + formData.append('title', params.title); + formData.append('description', params.description); + formData.append('categories', JSON.stringify(params.categories)); + formData.append('tags', JSON.stringify(params.tags)); + formData.append('colors', JSON.stringify(params.colors)); + formData.append('locale', params.locale); + + const response = await fetch(`${API_BASE}/vlm/faqs`, { + method: 'POST', + body: formData + }); + + if (!response.ok) { + const error = await response.json(); + throw new Error(error.detail || 'Failed to generate FAQs'); + } + + const data = await response.json(); + return data.faqs || []; +} + export async function listPolicies(): Promise { const response = await fetch(`${API_BASE}/policies`, { method: 'GET' }); diff --git a/tests/test_vlm_unit.py b/tests/test_vlm_unit.py index 3378189..f296a20 100644 --- a/tests/test_vlm_unit.py +++ b/tests/test_vlm_unit.py @@ -23,6 +23,7 @@ from unittest.mock import Mock, patch, MagicMock from backend.vlm import ( _call_vlm, + _call_nemotron_structure_vlm, _call_nemotron_enhance_vlm, _call_nemotron_apply_branding, _call_nemotron_generate_faqs, @@ -34,109 +35,173 @@ class TestCallVLM: - """Tests for _call_vlm function with mocked OpenAI client.""" - + """Tests for _call_vlm function with mocked VLM + structuring.""" + + @patch('backend.vlm._call_nemotron_structure_vlm') @patch('backend.vlm.OpenAI') @patch('backend.vlm.get_config') - def test_call_vlm_success_with_valid_json(self, mock_get_config, mock_openai_class, sample_image_bytes, sample_vlm_response, mock_env_vars): - """Test successful VLM call with valid JSON response.""" - # Mock config + def test_call_vlm_passes_free_text_to_structuring(self, mock_get_config, mock_openai_class, mock_structure, sample_image_bytes, sample_vlm_response, mock_env_vars): + """Test that VLM free text is passed to the structuring LLM call.""" mock_config = Mock() - mock_config.get_vlm_config.return_value = { - 'url': 'http://test:8000/v1', - 'model': 'test-model' - } + mock_config.get_vlm_config.return_value = {'url': 'http://test:8000/v1', 'model': 'test-model'} mock_get_config.return_value = mock_config - - # Mock OpenAI client + mock_client = Mock() mock_openai_class.return_value = mock_client - - # Mock streaming response + + vlm_free_text = "A black and red Craftsman lawn mower with 2XV20 printed on the deck." + mock_chunk = Mock() + mock_delta = Mock() + mock_delta.content = vlm_free_text + mock_choice = Mock() + mock_choice.delta = mock_delta + mock_chunk.choices = [mock_choice] + mock_client.chat.completions.create.return_value = [mock_chunk] + + mock_structure.return_value = sample_vlm_response + + result = _call_vlm(sample_image_bytes, "image/png", "en-US") + + mock_structure.assert_called_once_with(vlm_free_text, "en-US") + assert result == sample_vlm_response + + @patch('backend.vlm._call_nemotron_structure_vlm') + @patch('backend.vlm.OpenAI') + @patch('backend.vlm.get_config') + def test_call_vlm_uses_short_prompt(self, mock_get_config, mock_openai_class, mock_structure, sample_image_bytes, sample_vlm_response, mock_env_vars): + """Test that the VLM prompt is short (not the old ~35 line prompt).""" + mock_config = Mock() + mock_config.get_vlm_config.return_value = {'url': 'http://test:8000/v1', 'model': 'test-model'} + mock_get_config.return_value = mock_config + + mock_client = Mock() + mock_openai_class.return_value = mock_client + + mock_chunk = Mock() + mock_delta = Mock() + mock_delta.content = "A product description" + mock_choice = Mock() + mock_choice.delta = mock_delta + mock_chunk.choices = [mock_choice] + mock_client.chat.completions.create.return_value = [mock_chunk] + mock_structure.return_value = sample_vlm_response + + _call_vlm(sample_image_bytes, "image/png") + + call_args = mock_client.chat.completions.create.call_args + messages = call_args.kwargs["messages"] + prompt_text = messages[0]["content"][1]["text"] + assert len(prompt_text) < 200 + + @patch('backend.vlm._call_nemotron_structure_vlm') + @patch('backend.vlm.OpenAI') + @patch('backend.vlm.get_config') + def test_call_vlm_with_different_image_types(self, mock_get_config, mock_openai_class, mock_structure, sample_jpeg_bytes, sample_vlm_response, mock_env_vars): + """Test VLM call with different image content types.""" + mock_config = Mock() + mock_config.get_vlm_config.return_value = {'url': 'http://test:8000/v1', 'model': 'test-model'} + mock_get_config.return_value = mock_config + + mock_client = Mock() + mock_openai_class.return_value = mock_client + + mock_chunk = Mock() + mock_delta = Mock() + mock_delta.content = "A product" + mock_choice = Mock() + mock_choice.delta = mock_delta + mock_chunk.choices = [mock_choice] + mock_client.chat.completions.create.return_value = [mock_chunk] + mock_structure.return_value = sample_vlm_response + + result = _call_vlm(sample_jpeg_bytes, "image/jpeg") + assert isinstance(result, dict) + + +class TestCallNemotronStructureVlm: + """Tests for _call_nemotron_structure_vlm function.""" + + @patch('backend.vlm.OpenAI') + @patch('backend.vlm.get_config') + def test_structure_success(self, mock_get_config, mock_openai_class, sample_vlm_response, mock_env_vars): + """Test successful structuring of free text into JSON.""" + mock_config = Mock() + mock_config.get_llm_config.return_value = {'url': 'http://test:8000/v1', 'model': 'test-llm-model'} + mock_get_config.return_value = mock_config + + mock_client = Mock() + mock_openai_class.return_value = mock_client + mock_chunk = Mock() mock_delta = Mock() mock_delta.content = json.dumps(sample_vlm_response) mock_choice = Mock() mock_choice.delta = mock_delta mock_chunk.choices = [mock_choice] - mock_client.chat.completions.create.return_value = [mock_chunk] - - # Call function - result = _call_vlm(sample_image_bytes, "image/png") - - # Assertions + + result = _call_nemotron_structure_vlm("A black handbag with gold accents.") + assert isinstance(result, dict) assert result["title"] == sample_vlm_response["title"] - assert result["description"] == sample_vlm_response["description"] - assert result["categories"] == sample_vlm_response["categories"] - assert "tags" in result - assert "colors" in result - + assert "description" in result + @patch('backend.vlm.OpenAI') @patch('backend.vlm.get_config') - def test_call_vlm_with_invalid_json_fallback(self, mock_get_config, mock_openai_class, sample_image_bytes, mock_env_vars): - """Test VLM call with non-JSON response uses fallback.""" - # Mock config + def test_structure_fallback_on_parse_failure(self, mock_get_config, mock_openai_class, mock_env_vars): + """Test fallback to raw text when LLM returns unparseable output.""" mock_config = Mock() - mock_config.get_vlm_config.return_value = { - 'url': 'http://test:8000/v1', - 'model': 'test-model' - } + mock_config.get_llm_config.return_value = {'url': 'http://test:8000/v1', 'model': 'test-llm-model'} mock_get_config.return_value = mock_config - - # Mock OpenAI client with non-JSON response + mock_client = Mock() mock_openai_class.return_value = mock_client - + mock_chunk = Mock() mock_delta = Mock() - mock_delta.content = "This is not valid JSON" + mock_delta.content = "Not valid JSON at all" mock_choice = Mock() mock_choice.delta = mock_delta mock_chunk.choices = [mock_choice] - mock_client.chat.completions.create.return_value = [mock_chunk] - - # Call function - result = _call_vlm(sample_image_bytes, "image/png") - - # Should return fallback structure - assert isinstance(result, dict) + + vlm_text = "A red lawn mower with Craftsman branding." + result = _call_nemotron_structure_vlm(vlm_text) + assert result["title"] == "" - assert result["description"] == "This is not valid JSON" + assert result["description"] == vlm_text assert result["categories"] == ["uncategorized"] - assert result["tags"] == [] - assert result["colors"] == [] - + @patch('backend.vlm.OpenAI') @patch('backend.vlm.get_config') - def test_call_vlm_with_different_image_types(self, mock_get_config, mock_openai_class, sample_jpeg_bytes, sample_vlm_response, mock_env_vars): - """Test VLM call with different image content types.""" - # Mock config + def test_structure_extracts_from_markdown(self, mock_get_config, mock_openai_class, sample_vlm_response, mock_env_vars): + """Test extraction from markdown-fenced JSON.""" mock_config = Mock() - mock_config.get_vlm_config.return_value = { - 'url': 'http://test:8000/v1', - 'model': 'test-model' - } + mock_config.get_llm_config.return_value = {'url': 'http://test:8000/v1', 'model': 'test-llm-model'} mock_get_config.return_value = mock_config - - # Mock OpenAI client + mock_client = Mock() mock_openai_class.return_value = mock_client - + + wrapped = f"```json\n{json.dumps(sample_vlm_response)}\n```" mock_chunk = Mock() mock_delta = Mock() - mock_delta.content = json.dumps(sample_vlm_response) + mock_delta.content = wrapped mock_choice = Mock() mock_choice.delta = mock_delta mock_chunk.choices = [mock_choice] - mock_client.chat.completions.create.return_value = [mock_chunk] - - # Test with JPEG - result = _call_vlm(sample_jpeg_bytes, "image/jpeg") - assert isinstance(result, dict) + + result = _call_nemotron_structure_vlm("A handbag.") + + assert result["title"] == sample_vlm_response["title"] + + def test_structure_raises_without_api_key(self, monkeypatch): + """Test RuntimeError when NGC_API_KEY is not set.""" + monkeypatch.delenv("NGC_API_KEY", raising=False) + + with pytest.raises(RuntimeError, match="NGC_API_KEY is not set"): + _call_nemotron_structure_vlm("Some text") class TestCallNemotronEnhanceVLM: @@ -514,36 +579,69 @@ def test_generate_faqs_raises_without_api_key(self, sample_vlm_response, monkeyp class TestCallNemotronEnhance: """Tests for _call_nemotron_enhance orchestration function.""" - + + @patch('backend.vlm._call_nemotron_apply_branding') + @patch('backend.vlm._call_nemotron_enhance_vlm') + def test_enhance_skips_step1_without_product_data(self, mock_enhance_vlm, mock_apply_branding, sample_vlm_response): + """Test that Step 1 is skipped when no product_data — VLM output used directly.""" + result = _call_nemotron_enhance(sample_vlm_response, None, "en-US", None) + + # Step 1 should be SKIPPED (no product data to merge) + mock_enhance_vlm.assert_not_called() + # Step 2 should NOT be called + mock_apply_branding.assert_not_called() + assert result == sample_vlm_response + + @patch('backend.vlm._call_nemotron_apply_branding') + @patch('backend.vlm._call_nemotron_enhance_vlm') + def test_enhance_with_brand_instructions_skips_step1(self, mock_enhance_vlm, mock_apply_branding, sample_vlm_response): + """Test that Step 1 is skipped but Step 2 runs on raw VLM output when only brand instructions provided.""" + branded_data = {"title": "Branded", "description": "Branded"} + mock_apply_branding.return_value = branded_data + + brand_instructions = "Use playful tone" + result = _call_nemotron_enhance(sample_vlm_response, None, "en-US", brand_instructions) + + # Step 1 should be SKIPPED (no product data) + mock_enhance_vlm.assert_not_called() + # Step 2 should run on raw VLM output + mock_apply_branding.assert_called_once_with(sample_vlm_response, brand_instructions, "en-US") + assert result == branded_data + + @patch('backend.vlm._call_nemotron_filter_user_data') @patch('backend.vlm._call_nemotron_apply_branding') @patch('backend.vlm._call_nemotron_enhance_vlm') - def test_enhance_without_brand_instructions(self, mock_enhance_vlm, mock_apply_branding, sample_vlm_response): - """Test enhancement pipeline without brand instructions (Step 2 skipped).""" + def test_enhance_runs_step1_with_product_data(self, mock_enhance_vlm, mock_apply_branding, mock_filter, sample_vlm_response, sample_product_data): + """Test that Step 1 runs when product_data is provided.""" enhanced_data = {"title": "Enhanced", "description": "Enhanced"} + mock_filter.return_value = sample_product_data mock_enhance_vlm.return_value = enhanced_data - - result = _call_nemotron_enhance(sample_vlm_response, None, "en-US", None) - - # Step 1 should be called + + result = _call_nemotron_enhance(sample_vlm_response, sample_product_data, "en-US", None) + + # Pre-filter and Step 1 should run + mock_filter.assert_called_once() mock_enhance_vlm.assert_called_once() - # Step 2 should NOT be called + # Step 2 should NOT run mock_apply_branding.assert_not_called() assert result == enhanced_data - + + @patch('backend.vlm._call_nemotron_filter_user_data') @patch('backend.vlm._call_nemotron_apply_branding') @patch('backend.vlm._call_nemotron_enhance_vlm') - def test_enhance_with_brand_instructions(self, mock_enhance_vlm, mock_apply_branding, sample_vlm_response): - """Test enhancement pipeline with brand instructions (both steps).""" + def test_enhance_runs_full_pipeline_with_product_data_and_brand(self, mock_enhance_vlm, mock_apply_branding, mock_filter, sample_vlm_response, sample_product_data): + """Test full pipeline (Step 1 + Step 2) when both product_data and brand_instructions provided.""" enhanced_data = {"title": "Enhanced", "description": "Enhanced"} branded_data = {"title": "Branded", "description": "Branded"} - + mock_filter.return_value = sample_product_data mock_enhance_vlm.return_value = enhanced_data mock_apply_branding.return_value = branded_data - + brand_instructions = "Use playful tone" - result = _call_nemotron_enhance(sample_vlm_response, None, "en-US", brand_instructions) - - # Both steps should be called + result = _call_nemotron_enhance(sample_vlm_response, sample_product_data, "en-US", brand_instructions) + + # All steps should run + mock_filter.assert_called_once() mock_enhance_vlm.assert_called_once() mock_apply_branding.assert_called_once_with(enhanced_data, brand_instructions, "en-US") assert result == branded_data @@ -658,10 +756,10 @@ class TestSplitVLMFlow: def test_extract_vlm_observation_returns_raw_vlm_output(self, mock_call_vlm, sample_image_bytes, sample_vlm_response): mock_call_vlm.return_value = sample_vlm_response - result = extract_vlm_observation(sample_image_bytes, "image/png") + result = extract_vlm_observation(sample_image_bytes, "image/png", "en-US") assert result == sample_vlm_response - mock_call_vlm.assert_called_once_with(sample_image_bytes, "image/png") + mock_call_vlm.assert_called_once_with(sample_image_bytes, "image/png", "en-US") @patch('backend.vlm._call_nemotron_enhance') def test_build_enriched_vlm_result_uses_existing_vlm_observation(self, mock_enhance, sample_vlm_response):