feat: update experiment summary for clarity and detail; refine corruption strategy evaluation description

mario-koddenbrock · mario-koddenbrock · commit 6ea2cb5e7a3c · 2026-02-02T11:33:21.000+01:00
diff --git a/index.html b/index.html
@@ -87,9 +87,9 @@ <h2>Use Cases</h2>
 
         <section id="experiments-llm-selection">
             <h2>RQ1: Can LLMs Generate Consistent and Context-Aware Corruption Strategies?</h2>
-            <p class="exp-summary" data-exp="llm-selection"><strong>Answer:</strong> Yes, LLMs produce stable, context-aware corruption sets, enabling automated domain-specific robustness evaluation. <strong>Setup:</strong> We prompted GPT‑4o (temperature 0) with each domain description and the predefined corruption list, repeating 10 runs per domain. Final sets were chosen by majority vote and sanity-checked against simple domain-specific whitelist/blacklist rules. <strong>Results:</strong> Core transforms (Brightness, Contrast, ImageRotation) are selected reliably across domains, while domain-specific choices (e.g., CloudGenerator for Satellite, Rain for Driving) reflect contextual adaptation; all final sets satisfy whitelist/blacklist constraints.</p>
+            <p class="exp-summary" data-exp="llm-selection"><strong>Answer:</strong> Across a broad set of open- and closed-source LLMs, corruption strategies remain largely consistent and domain-appropriate. Several models across different sizes and architectures produce fully plausible selections, while others show occasional deviations. Performance does not follow a simple trend with model scale. Among these models, GPT-4o provides fully consistent selections and is used as the reference for downstream robustness experiments. <strong>Setup:</strong> We evaluated corruption-selection behavior across eleven LLMs: closed-source (GPT-4o, GPT-4o-mini), open-source large (GPT-OSS-120B, Qwen2.5-110B, Llama-3.3-70B, DeepSeek-R1-70B), and open-source medium/small (Gemma-3-27B, Llama-4-Scout-17B, Phi-4-14B, Qwen2.5-8B, Mistral-8x7B). Each model produced 10 sampling runs per domain using different seeds. Final corruption sets were obtained by majority vote and checked against domain-specific whitelist/blacklist constraints encoding expert prior knowledge. <strong>Results:</strong> GPT-4o, Llama-3.3-70B, and Llama-4-Scout-17B exhibit zero violations, demonstrating fully domain-consistent behavior. Performance does not correlate with scale: some large models (e.g., DeepSeek-R1-70B) show more violations than mid-size models such as Gemma-3-27B or Phi-4-14B. The mixture-of-experts Mistral-8x7B shows the highest number of violations. Core transforms (Brightness, Contrast, ImageRotation) are selected consistently across domains, while domain-specific choices (e.g., CloudGenerator for Satellite, Rain for Driving) demonstrate contextual adaptation.</p>
             <div class="plot-block">
-                <h3>Corruption Selection Frequencies (GPT‑4o)</h3>
+                <h3>Corruption Selection Frequencies (GPT-4o)</h3>
                 <img src="assets/experiments/augmentation_selection_gpt-4o_heatmap_flipped.png" alt="Heatmap of GPT-4o corruption selection frequencies across domains (10 runs per domain). Green = whitelist, Red = blacklist. Majority vote used for final sets." />
             </div>
         </section>