feat: implement RQ1 model selector with dynamic button generation and heatmap updates

mario-koddenbrock · mario-koddenbrock · commit 00b08c34d200 · 2026-02-02T11:49:29.000+01:00
diff --git a/css/style.css b/css/style.css
@@ -689,3 +689,31 @@ body.coming-soon-active #coming-soon-overlay .coming-soon-content {
   width: 40px;
   height: 40px;
 }
+
+/* RQ1 model selector buttons */
+.rq1-model-btn {
+  appearance: none;
+  -webkit-appearance: none;
+  background: #ffffff;
+  color: #495057;
+  border: 1px solid var(--border-color);
+  border-radius: 9999px;
+  padding: 0.35rem 0.75rem;
+  font-family: inherit;
+  font-size: 0.8rem;
+  font-weight: 600;
+  cursor: pointer;
+  transition: background-color 0.15s ease, border-color 0.15s ease, color 0.15s ease, box-shadow 0.15s ease;
+  white-space: nowrap;
+}
+.rq1-model-btn:hover {
+  background: var(--primary-light);
+  border-color: rgba(0, 123, 255, 0.3);
+  color: var(--primary-color);
+}
+.rq1-model-btn.active {
+  background: var(--primary-color);
+  border-color: var(--primary-color);
+  color: #ffffff;
+  box-shadow: 0 0 0 3px rgba(0, 123, 255, 0.25);
+}
diff --git a/index.html b/index.html
@@ -88,9 +88,10 @@ <h2>Use Cases</h2>
         <section id="experiments-llm-selection">
             <h2>RQ1: Can LLMs Generate Consistent and Context-Aware Corruption Strategies?</h2>
             <p class="exp-summary" data-exp="llm-selection"><strong>Answer:</strong> Across a broad set of open- and closed-source LLMs, corruption strategies remain largely consistent and domain-appropriate. Several models across different sizes and architectures produce fully plausible selections, while others show occasional deviations. Performance does not follow a simple trend with model scale. Among these models, GPT-4o provides fully consistent selections and is used as the reference for downstream robustness experiments. <strong>Setup:</strong> We evaluated corruption-selection behavior across eleven LLMs: closed-source (GPT-4o, GPT-4o-mini), open-source large (GPT-OSS-120B, Qwen2.5-110B, Llama-3.3-70B, DeepSeek-R1-70B), and open-source medium/small (Gemma-3-27B, Llama-4-Scout-17B, Phi-4-14B, Qwen2.5-8B, Mistral-8x7B). Each model produced 10 sampling runs per domain using different seeds. Final corruption sets were obtained by majority vote and checked against domain-specific whitelist/blacklist constraints encoding expert prior knowledge. <strong>Results:</strong> GPT-4o, Llama-3.3-70B, and Llama-4-Scout-17B exhibit zero violations, demonstrating fully domain-consistent behavior. Performance does not correlate with scale: some large models (e.g., DeepSeek-R1-70B) show more violations than mid-size models such as Gemma-3-27B or Phi-4-14B. The mixture-of-experts Mistral-8x7B shows the highest number of violations. Core transforms (Brightness, Contrast, ImageRotation) are selected consistently across domains, while domain-specific choices (e.g., CloudGenerator for Satellite, Rain for Driving) demonstrate contextual adaptation.</p>
+            <div id="rq1-model-thumbnails" class="thumbnail-container"></div>
             <div class="plot-block">
-                <h3>Corruption Selection Frequencies (GPT-4o)</h3>
-                <img src="assets/experiments/augmentation_selection_gpt-4o_heatmap_flipped.png" alt="Heatmap of GPT-4o corruption selection frequencies across domains (10 runs per domain). Green = whitelist, Red = blacklist. Majority vote used for final sets." />
+                <h3 id="rq1-plot-title">Corruption Selection Frequencies (GPT-4o)</h3>
+                <img id="rq1-heatmap" src="assets/experiments/augmentation_selection/gpt-4o/augmentation_selection_gpt-4o_heatmap_flipped.png" alt="Heatmap of corruption selection frequencies" />
             </div>
         </section>
 
@@ -114,7 +115,7 @@ <h2>RQ4: How Does Model Architecture Affect Robustness?</h2>
 
         <section id="experiments-pretraining">
             <h2>RQ5: How Does Pretraining Data Influence Robustness?</h2>
-            <p class="exp-summary" data-exp="pretraining"><strong>Answer:</strong> Curated, semantically aligned pretraining data yields higher accuracy and robustness than noisy corpora. <strong>Setup:</strong> Evaluate CLIP ViT-L/14 on five datasets: LAION-400M (Schuhmann et al., 2021), MetaCLIP FullCC (Xu et al., 2024), CommonPool, DataComp XL (Gadre et al., 2023), DFN2B (Fang et al., 2023), and OpenAI (Radford et al., 2021). Tested on six domains with domain-specific corruptions. <strong>Results:</strong> Curated models outperform noisy ones in accuracy and mean corruption error (mce).</p>
+            <p class="exp-summary" data-exp="pretraining"><strong>Answer:</strong> Models pretrained on curated, semantically aligned data generally achieve higher accuracy and enhanced robustness compared to those trained on larger, noisier corpora. However, the OpenAI baseline, while often weaker, can become the top performer when its proprietary training data is highly aligned with the target domain, as seen in People. <strong>Setup:</strong> To isolate the effect of pretraining data, we evaluate CLIP ViT-L/14 models trained on five distinct datasets representing a spectrum of data collection philosophies: LAION-400M (Schuhmann et al., 2021) and MetaCLIP FullCC (Xu et al., 2024) as large-scale, minimally filtered web scrapes; CommonPool and DataComp XL (Gadre et al., 2023) and DFN2B (Fang et al., 2023) as corpora prioritizing high-quality, semantic image-text alignment; and OpenAI (Radford et al., 2021) as a proprietary baseline. All models are evaluated under domain-specific corruption sets across six application domains. <strong>Results:</strong> Curated datasets consistently outperform larger, noisier corpora in both clean accuracy and mean corruption error (mCE). For instance, CommonPool achieves top-tier performance in Driving and Manufacturing, while LAION-400M lags behind with a notable robustness deficit in Medical. A critical exception is People, where the OpenAI baseline surpasses all other models, highlighting that domain-specific alignment of pretraining data can be more decisive than general data quality or scale.</p>
             <div class="experiment" data-exp="pretraining"></div>
         </section>
 
diff --git a/js/main.js b/js/main.js
@@ -486,10 +486,61 @@
         return thumbnail;
     }
 
+    // --- RQ1 MODEL SELECTOR ---
+    const rq1Models = [
+        { id: 'gpt-4o',              label: 'GPT-4o' },
+        { id: 'gpt-4o-mini',         label: 'GPT-4o-mini' },
+        { id: 'gpt-oss_120b',        label: 'GPT-OSS 120B' },
+        { id: 'qwen_110b',           label: 'Qwen 110B' },
+        { id: 'llama3.3_70b',        label: 'Llama-3.3 70B' },
+        { id: 'deepseek-r1_70b',     label: 'DeepSeek-R1 70B' },
+        { id: 'gemma3_27b',          label: 'Gemma-3 27B' },
+        { id: 'llama_4_scout_17b',   label: 'Llama-4 Scout 17B' },
+        { id: 'phi4_14b',            label: 'Phi-4 14B' },
+        { id: 'qwen3_8b',            label: 'Qwen 8B' },
+        { id: 'mistral_latest',      label: 'Mistral 8x7B' },
+    ];
+
+    function buildRQ1ModelSelector() {
+        const container = document.getElementById('rq1-model-thumbnails');
+        const heatmap = document.getElementById('rq1-heatmap');
+        const title = document.getElementById('rq1-plot-title');
+        if (!container || !heatmap) return;
+
+        let selectedModel = rq1Models[0].id;
+
+        function update() {
+            const model = rq1Models.find(m => m.id === selectedModel);
+            const path = `assets/experiments/augmentation_selection/${selectedModel}/augmentation_selection_${selectedModel}_heatmap_flipped.png`;
+            heatmap.src = path;
+            heatmap.alt = `Heatmap of ${model.label} corruption selection frequencies`;
+            if (title) title.textContent = `Corruption Selection Frequencies (${model.label})`;
+            container.querySelectorAll('.rq1-model-btn').forEach(btn => {
+                btn.classList.toggle('active', btn.dataset.model === selectedModel);
+            });
+        }
+
+        rq1Models.forEach(model => {
+            const btn = document.createElement('button');
+            btn.type = 'button';
+            btn.className = 'rq1-model-btn';
+            btn.dataset.model = model.id;
+            btn.textContent = model.label;
+            btn.addEventListener('click', () => {
+                selectedModel = model.id;
+                update();
+            });
+            container.appendChild(btn);
+        });
+
+        update();
+    }
+
     // --- INITIALIZATION ---
     function initialize() {
         buildCorruptionGallery();
         buildUseCasesSection();
+        buildRQ1ModelSelector();
         document.querySelectorAll('.experiment').forEach(expDiv => {
             const expKey = expDiv.dataset.exp;
             const basePath = `assets/experiments/${expKey}`;