Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -322,17 +322,22 @@ Given the catalog enrichment focus, pay special attention to:
- Ensure new code follows established patterns
- Include appropriate error handling and logging

3. **Documentation**
3. **LLM Prompt Rules**
- **NEVER hardcode specific product examples in prompts.** Rules must be generic and work across all products. For example, do NOT write rules like `"when the user says 'synthetic leather' and the camera sees 'leather', use the user's term"` — instead write `"when there is a conflict, prefer the user's terms for materials and specs"`.
- Prompts are consumed by millions of products — every rule must generalize.
- If a specific scenario fails, fix the underlying rule, not just the example.

4. **Documentation**
- Update relevant documentation when making changes
- Include examples in API documentation
- Keep this AGENTS.md file current as the project evolves

4. **Communication**
5. **Communication**
- Ask for clarification when requirements are ambiguous
- Suggest improvements to architecture and processes
- Flag potential security or performance concerns

5. **Incremental Development**
6. **Incremental Development**
- Start with simple, working solutions
- Iterate and improve based on feedback
- Consider backwards compatibility when making changes
Expand Down
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Refer to [AGENTS.md](AGENTS.md) for all project guidelines, coding standards, and AI assistant instructions.
31 changes: 31 additions & 0 deletions PRD.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,25 @@ A GenAI-powered catalog enrichment system that transforms basic product images i
- Support automated filtering or flagging of low-quality generated images
- Ensure background differences from original are not penalized (backgrounds should differ)

### FR-10: Product FAQ Generation
- Generate 3-5 frequently asked questions and answers for each product
- FAQs are derived from the final enriched catalog data (after VLM analysis, user data merge, and branding)
- Questions cover practical shopper topics: materials, care instructions, sizing, use cases, compatibility, durability
- Answers are concise (1-3 sentences), factual, and grounded in the enriched product data
- Support locale-aware FAQ generation across all 10 supported regional locales
- Separate `/vlm/faqs` endpoint allows asynchronous generation — details display immediately while FAQs load in the background
- UI displays FAQs in a dedicated tab with collapsible accordion items

### FR-11: Policy Compliance Checking
- Accept PDF policy documents through a persistent policy library (`/policies` endpoint)
- Parse and normalize uploaded PDFs into structured policy summaries
- Embed normalized policy records using NVIDIA embeddings and store in Milvus vector database
- During product analysis, perform semantic retrieval of relevant policy records
- Run compliance classification against enriched product data and retrieved policy records
- Return pass/fail status with matched policies, rule details, reasons, evidence, and warnings
- Support deduplication of repeated policy uploads by content hash
- Display compliance results in the UI with visual pass/fail indicators

## Technical Requirements

### TR-1: Model Integration
Expand Down Expand Up @@ -230,6 +249,16 @@ A GenAI-powered catalog enrichment system that transforms basic product images i
**I want to** receive automated quality assessments with detailed scoring and issue detection for generated product images
**So that** I can quickly identify and filter out low-quality variations without manual review, ensuring only high-quality assets enter my catalog

### US-8: Product FAQ Generation
**As a** e-commerce content manager
**I want to** automatically generate frequently asked questions and answers for each product based on its enriched catalog data
**So that** I can populate product FAQ sections without manual copywriting, improving the customer shopping experience

### US-9: Policy Compliance Checking
**As a** catalog compliance officer
**I want to** upload policy PDFs and have the system automatically check enriched product listings against those policies
**So that** I can ensure all catalog entries comply with marketplace regulations and internal guidelines before publishing

## Success Criteria

- **Processing Time**: <1 minute per product for complete enrichment (including quality assessment)
Expand All @@ -256,6 +285,8 @@ A GenAI-powered catalog enrichment system that transforms basic product images i
- [ ] FR-7: Social Media Content Integration
- [x] ~~FR-8: Brand Voice & Taxonomy Customization~~ *(Complete with brand_instructions parameter support)*
- [x] ~~FR-9: Automated Quality Assessment for Generated Images~~ *(VLM-based reflection module integrated into image generation pipeline)*
- [x] ~~FR-10: Product FAQ Generation~~ *(Separate /vlm/faqs endpoint with async loading, Kaizen Tabs + Accordion UI)*
- [x] ~~FR-11: Policy Compliance Checking~~ *(PDF policy library with Milvus embeddings, semantic retrieval, compliance classification)*

- [ ] TR-1: Model Integration
- [x] ~~NVIDIA Nemotron VLM API integration~~
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ A GenAI-powered catalog enrichment system that transforms basic product images i
- **Cultural Image Generation**: Create culturally-appropriate product backgrounds (Spanish courtyards, Mexican family spaces, British formal settings)
- **Quality Evaluation**: Automated VLM-based quality assessment of generated images with detailed scoring
- **3D Asset Generation**: Transform 2D product images into interactive 3D GLB models using Microsoft TRELLIS
- **Modular API**: Separate endpoints for VLM analysis, image generation, and 3D asset generation
- **Product FAQ Generation**: Automatically generate 3-5 product FAQs from enriched catalog data
- **Policy Compliance**: Upload policy PDFs and automatically check product listings against them using RAG + Milvus
- **Modular API**: Separate endpoints for VLM analysis, FAQ generation, image generation, and 3D asset generation

## Documentation

Expand Down
77 changes: 73 additions & 4 deletions docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ Health check endpoint for monitoring service status.
The API provides a modular approach for optimal performance and flexibility:

**1) Fast VLM Analysis (POST `/vlm/analyze`)** - Get product fields quickly
**2) Image Generation (POST `/generate/variation`)** - Generate 2D variations on demand
**3) 3D Asset Generation (POST `/generate/3d`)** - Generate 3D models on demand
**2) FAQ Generation (POST `/vlm/faqs`)** - Generate product FAQs from enriched data
**3) Image Generation (POST `/generate/variation`)** - Generate 2D variations on demand
**4) 3D Asset Generation (POST `/generate/3d`)** - Generate 3D models on demand

**Benefits of this approach:**
- Display product information immediately to users
Expand Down Expand Up @@ -273,7 +274,75 @@ curl -X POST \

---

## 3️⃣ Image Generation: `/generate/variation`
## 3️⃣ FAQ Generation: `/vlm/faqs`

Generate 3-5 frequently asked questions and answers for a product based on its enriched catalog data. Designed to be called after `/vlm/analyze` completes, using the enriched result as input.

**Endpoint**: `POST /vlm/faqs`
**Content-Type**: `multipart/form-data`

### Request Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `title` | string | No | Product title from VLM analysis |
| `description` | string | No | Product description from VLM analysis |
| `categories` | JSON string | No | Categories array (default: `[]`) |
| `tags` | JSON string | No | Tags array (default: `[]`) |
| `colors` | JSON string | No | Colors array (default: `[]`) |
| `locale` | string | No | Regional locale code (default: `en-US`) |

### Response Schema

```json
{
"faqs": [
{
"question": "string",
"answer": "string"
}
]
}
```

### Usage Example

```bash
# Call after /vlm/analyze to generate FAQs from enriched data
curl -X POST \
-F "title=Craftsman 20V Cordless Lawn Mower" \
-F "description=A cordless lawn mower featuring a black and red design..." \
-F 'categories=["electronics"]' \
-F 'tags=["cordless","lawn mower","Craftsman"]' \
-F 'colors=["black","red"]' \
-F "locale=en-US" \
http://localhost:8000/vlm/faqs
```

### Example Response

```json
{
"faqs": [
{
"question": "What type of battery does this mower use?",
"answer": "This mower operates on a 20V cordless battery system, providing the flexibility to mow without a power cord."
},
{
"question": "Does this mower come with a grass collection bag?",
"answer": "Yes, it includes a rear-mounted grass collection bag for convenient clippings management."
},
{
"question": "What are the main colors of this mower?",
"answer": "The mower features a black and red color scheme with prominent Craftsman branding."
}
]
}
```

---

## 4️⃣ Image Generation: `/generate/variation`

Generate culturally-appropriate product variations using FLUX models based on VLM analysis results.

Expand Down Expand Up @@ -334,7 +403,7 @@ curl -X POST \

---

## 4️⃣ 3D Asset Generation: `/generate/3d`
## 5️⃣ 3D Asset Generation: `/generate/3d`

Generate interactive 3D GLB models from 2D product images using Microsoft's TRELLIS model.

Expand Down
170 changes: 170 additions & 0 deletions docs/hallucination-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# LLM Enhancement Hallucination Report

**Date:** 2026-04-15
**Reported by:** Antonio Martinez
**Status:** Open — Separate task pending
**Affected component:** `src/backend/vlm.py` — `_call_nemotron_enhance_vlm()` (Step 1 enhancement)

---

## Summary

The VLM model (`nemotron-nano-12b-v2-vl`, 12B parameters) introduces hallucinations at the source — misreading visible text, fabricating materials and features, and drawing from training data rather than strictly describing the image. The LLM enhancement step (`_call_nemotron_enhance_vlm`) then compounds these errors by rewriting them into confident marketing copy. Both layers contribute, but the root cause is the VLM.

---

## Root Cause Analysis

### Pipeline Flow

```
Image Upload
|
v
[VLM] _call_vlm() <-- Accurate visual analysis
| Model: nemotron-nano-12b-v2-vl
| Output: title, description, categories, tags, colors
v
[LLM] _call_nemotron_enhance_vlm() <-- Hallucinations introduced HERE
| Model: nemotron-3-nano
| Task: "Write rich, persuasive product description"
v
[LLM] _call_nemotron_apply_branding() <-- Inherits errors from Step 1
| (only runs if brand_instructions provided)
v
[LLM] _call_nemotron_generate_faqs() <-- Consumes VLM output directly,
(runs in parallel with Step 1) but FAQs still affected if
VLM has minor OCR issues
```

### Where the Problem Lives

**Layer 1 — VLM** (`src/backend/vlm.py`, `_call_vlm()`):
The 12B VLM model misreads text, fabricates materials/features, and fills in details from training data. This happens regardless of prompt complexity — even "describe this product" triggers hallucinations. Longer prompts produce *more* hallucinations, not fewer. This is confirmed by the NVIDIA research team: longer system prompts degrade VLM output quality for this model class.

**Layer 2 — LLM Enhancement** (`src/backend/vlm.py`, `_call_nemotron_enhance_vlm()`):
The LLM takes the already-hallucinated VLM output and rewrites it into confident marketing copy, compounding errors and adding its own fabrications. Skipping this step when no user data is provided eliminates the second layer.

---

## Evidence: Craftsman 2XV20 Lawn Mower

### Test Image

`mower.jpeg` — Craftsman battery-powered lawn mower with "2XV20" printed on the deck (indicating dual V20 battery platform).

### VLM Direct Testing (2026-04-15)

Three prompts were tested against the same VLM endpoint (`nemotron-nano-12b-v2-vl`) with `mower.jpeg`:

**Prompt 1 — Minimal: "describe this product"**

> "This product is a Craftsman 20-inch 20V MAX Lithium Ion Cordless Lawn Mower. It's a compact, electric lawn mower designed for residential use. The mower features a 20-inch cutting deck [...] The 20V MAX Lithium Ion battery provides cordless convenience [...] includes a grass collection bag [...] equipped with a safety key to prevent accidental startups."

- Gets closest to reality: correctly identifies it as cordless/battery-powered ("20V MAX Lithium Ion")
- Still fabricates: "20-inch cutting deck", "safety key"
- Clearly pulling from Craftsman training data rather than reading "2XV20" text

**Prompt 2 — Detailed descriptive: "In detail, give a description of this image, include everything you see including texts. Be extremely descriptive."**

> "The cutting deck itself is marked with the text '20' indicating the width of the cutting blade in inches [...] a clear plastic cover over the cutting deck, allowing a view of the blades inside."

- Misreads "2XV20" as "20" and reinterprets it as cutting width
- Fabricates "clear plastic cover over the cutting deck"
- More hallucinations than the minimal prompt

**Prompt 3 — Catalog enrichment structured prompt (our production prompt)**

> `"title": "Craftsman 20-Inch Electric Lawn Mower"` ... `"clear plastic front cover"` ... `"control panel on the handlebar"` ... `"model number '20' is visible on the front"`

- Same hallucinations as prompt 2, now in JSON format
- Fabricates: "clear plastic front cover", "control panel on the handlebar"
- Misreads "2XV20" as "20" and calls it a model number

### Key Finding: Hallucinations Originate in the VLM

Initial analysis attributed hallucinations to the LLM enhancement step. **Direct VLM testing disproved this.** The VLM itself:
1. Misreads "2XV20" as "20" across all prompt styles
2. Fabricates materials ("clear plastic") and features ("control panel", "safety key") not visible in the image
3. Draws from training data about Craftsman products rather than strictly describing the image
4. Performs *worse* with longer, more detailed prompts — the minimal prompt produced the fewest hallucinations

### Hallucination Inventory (VLM output, all prompts combined)

| Claim | Reality | Type | Source |
|-------|---------|------|--------|
| "20-Inch" cutting width | "2XV20" is Craftsman's dual V20 battery platform | Text misread | VLM |
| "clear plastic cutting deck/cover" | Deck is opaque black | Fabricated material | VLM |
| "control panel on the handlebar" | Only a safety lever is visible | Fabricated feature | VLM |
| "safety key" | No safety key visible | Fabricated feature | VLM |
| "Electric Lawn Mower" (prompt 2/3) | Battery-powered (cordless) | Training data inference | VLM |
| "silver accents" on wheels | Wheels are entirely black | Fabricated detail | LLM enhancement |
| "red power button" | Not visible | Fabricated feature | LLM enhancement |

The LLM enhancement step compounded the VLM's errors (adding "silver accents", "red power button"), but the root cause is the 12B VLM model's vision limitations.

---

## Proposed Solution

### Fix 1 (Implemented): Skip LLM Enhancement When Unnecessary

**Status: Done** — merged in this branch.

The LLM enhancement step is now skipped when no user product data is provided. This eliminates the second layer of hallucinations.

| Scenario | Current Behavior | New Behavior |
|----------|-----------------|--------------|
| Image only (no user data, no brand instructions) | VLM -> LLM enhance -> output | VLM -> output directly (skip LLM) |
| Image + user product data | VLM -> LLM enhance (merge) -> output | VLM -> LLM enhance (merge) -> output (keep) |
| Image + brand instructions | VLM -> LLM enhance -> LLM brand -> output | VLM -> LLM brand -> output |
| Image + user data + brand instructions | VLM -> LLM enhance -> LLM brand -> output | VLM -> LLM enhance -> LLM brand -> output (keep) |

### Fix 2 (Future): Shorten the VLM Prompt

The current VLM prompt in `_call_vlm()` is ~30 lines with detailed rules, category lists, formatting instructions, and output constraints. Testing showed that a minimal prompt ("describe this product") produced the fewest hallucinations — the VLM correctly identified the mower as "20V MAX Lithium Ion Cordless" with that prompt, while the long structured prompt caused it to misread "2XV20" as "20" and fabricate features.

This is confirmed by the NVIDIA research team: longer system prompts degrade output quality for this VLM model class. The model spends capacity following formatting rules rather than focusing on accurate visual analysis.

**Proposed approach:**
- Strip the VLM prompt down to a short, focused instruction — prioritize visual accuracy over output formatting
- Move structural concerns (JSON format, category validation, tag count) to a lightweight post-processing step or a separate LLM call
- Test iteratively: compare hallucination rates across prompt lengths using a set of test images (mower, shoes, skincare, etc.)

**Trade-off:** A shorter VLM prompt may return unstructured text instead of clean JSON. This would require parsing the free-text output into structured fields, either with regex/heuristics or a fast LLM call. The benefit is more accurate visual descriptions at the source.

### Fix 3 (Future): Upgrade VLM Model

The `nemotron-nano-12b-v2-vl` (12B parameters) has fundamental vision limitations with stylized text and detail accuracy. A larger VLM (72B+) would likely improve OCR accuracy and reduce training-data hallucinations. This is a infrastructure/cost trade-off rather than a code change.

---

## Impact on FAQ Feature

The FAQ generation feature (`_call_nemotron_generate_faqs`) consumes the raw VLM observation directly (not the enhanced output), which reduces but does not eliminate the risk:

- FAQs generated from accurate VLM output will be factually grounded
- Minor VLM OCR errors (e.g., "2x20" vs "2XV20") can still propagate into FAQ answers
- If the proposed fix (skip enhancement) is implemented, the Details tab and FAQ tab will both be grounded in the same factual VLM observation, creating consistency

---

## Reproduction Steps

1. Start the backend and frontend services
2. Upload `mower.jpeg` (Craftsman 2XV20 lawn mower)
3. Click Generate with default settings (no product data, no brand instructions)
4. Observe the enriched description in the Details tab
5. Compare against the VLM's raw output (visible in backend logs at `[VLM]` level)

---

## Files Referenced

| File | Relevance |
|------|-----------|
| `src/backend/vlm.py:128-205` | `_call_nemotron_enhance_vlm()` — where hallucinations are introduced |
| `src/backend/vlm.py:167-186` | Enhancement prompt with insufficient anti-hallucination rules |
| `src/backend/vlm.py:175` | Current anti-hallucination rule (too narrow — numbers only) |
| `src/backend/vlm.py:397-439` | `_call_nemotron_enhance()` — orchestrator where the skip logic would go |
| `src/backend/vlm.py:441-510` | `_call_vlm()` — VLM analysis (produces accurate output) |
Loading
Loading