Skip to content

Commit 6ea2cb5

Browse files
feat: update experiment summary for clarity and detail; refine corruption strategy evaluation description
1 parent 7a1d179 commit 6ea2cb5

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,9 +87,9 @@ <h2>Use Cases</h2>
8787

8888
<section id="experiments-llm-selection">
8989
<h2>RQ1: Can LLMs Generate Consistent and Context-Aware Corruption Strategies?</h2>
90-
<p class="exp-summary" data-exp="llm-selection"><strong>Answer:</strong> Yes, LLMs produce stable, context-aware corruption sets, enabling automated domain-specific robustness evaluation. <strong>Setup:</strong> We prompted GPT‑4o (temperature 0) with each domain description and the predefined corruption list, repeating 10 runs per domain. Final sets were chosen by majority vote and sanity-checked against simple domain-specific whitelist/blacklist rules. <strong>Results:</strong> Core transforms (Brightness, Contrast, ImageRotation) are selected reliably across domains, while domain-specific choices (e.g., CloudGenerator for Satellite, Rain for Driving) reflect contextual adaptation; all final sets satisfy whitelist/blacklist constraints.</p>
90+
<p class="exp-summary" data-exp="llm-selection"><strong>Answer:</strong> Across a broad set of open- and closed-source LLMs, corruption strategies remain largely consistent and domain-appropriate. Several models across different sizes and architectures produce fully plausible selections, while others show occasional deviations. Performance does not follow a simple trend with model scale. Among these models, GPT-4o provides fully consistent selections and is used as the reference for downstream robustness experiments. <strong>Setup:</strong> We evaluated corruption-selection behavior across eleven LLMs: closed-source (GPT-4o, GPT-4o-mini), open-source large (GPT-OSS-120B, Qwen2.5-110B, Llama-3.3-70B, DeepSeek-R1-70B), and open-source medium/small (Gemma-3-27B, Llama-4-Scout-17B, Phi-4-14B, Qwen2.5-8B, Mistral-8x7B). Each model produced 10 sampling runs per domain using different seeds. Final corruption sets were obtained by majority vote and checked against domain-specific whitelist/blacklist constraints encoding expert prior knowledge. <strong>Results:</strong> GPT-4o, Llama-3.3-70B, and Llama-4-Scout-17B exhibit zero violations, demonstrating fully domain-consistent behavior. Performance does not correlate with scale: some large models (e.g., DeepSeek-R1-70B) show more violations than mid-size models such as Gemma-3-27B or Phi-4-14B. The mixture-of-experts Mistral-8x7B shows the highest number of violations. Core transforms (Brightness, Contrast, ImageRotation) are selected consistently across domains, while domain-specific choices (e.g., CloudGenerator for Satellite, Rain for Driving) demonstrate contextual adaptation.</p>
9191
<div class="plot-block">
92-
<h3>Corruption Selection Frequencies (GPT4o)</h3>
92+
<h3>Corruption Selection Frequencies (GPT-4o)</h3>
9393
<img src="assets/experiments/augmentation_selection_gpt-4o_heatmap_flipped.png" alt="Heatmap of GPT-4o corruption selection frequencies across domains (10 runs per domain). Green = whitelist, Red = blacklist. Majority vote used for final sets." />
9494
</div>
9595
</section>

0 commit comments

Comments
 (0)