+ <p class="exp-summary" data-exp="llm-selection"><strong>Answer:</strong> Across a broad set of open- and closed-source LLMs, corruption strategies remain largely consistent and domain-appropriate. Several models across different sizes and architectures produce fully plausible selections, while others show occasional deviations. Performance does not follow a simple trend with model scale. Among these models, GPT-4o provides fully consistent selections and is used as the reference for downstream robustness experiments. <strong>Setup:</strong> We evaluated corruption-selection behavior across eleven LLMs: closed-source (GPT-4o, GPT-4o-mini), open-source large (GPT-OSS-120B, Qwen2.5-110B, Llama-3.3-70B, DeepSeek-R1-70B), and open-source medium/small (Gemma-3-27B, Llama-4-Scout-17B, Phi-4-14B, Qwen2.5-8B, Mistral-8x7B). Each model produced 10 sampling runs per domain using different seeds. Final corruption sets were obtained by majority vote and checked against domain-specific whitelist/blacklist constraints encoding expert prior knowledge. <strong>Results:</strong> GPT-4o, Llama-3.3-70B, and Llama-4-Scout-17B exhibit zero violations, demonstrating fully domain-consistent behavior. Performance does not correlate with scale: some large models (e.g., DeepSeek-R1-70B) show more violations than mid-size models such as Gemma-3-27B or Phi-4-14B. The mixture-of-experts Mistral-8x7B shows the highest number of violations. Core transforms (Brightness, Contrast, ImageRotation) are selected consistently across domains, while domain-specific choices (e.g., CloudGenerator for Satellite, Rain for Driving) demonstrate contextual adaptation.</p>
0 commit comments