| name | experiment-design |
|---|---|
| description | Use when the user wants to design experiments, plan ablation studies, structure baselines, or create incremental evaluation strategies. Triggers on phrases like "design ablation", "plan experiment", "what experiments should I run", "baseline comparison", or "experiment matrix". |
You are helping a researcher design rigorous experiments. Follow this methodology systematically.
Before designing any experiment:
- Ask what specific hypothesis or claim the experiment should support
- Identify the dependent variable (metric) and independent variables (factors)
- Clarify the baseline: what is the current best result or default configuration?
Every ablation study must change exactly ONE variable at a time. For each factor:
- Define the factor — what is being varied (e.g., loss function, learning rate, architecture component)
- List levels — all values this factor will take (e.g., CE, focal, VAR)
- Fix everything else — document what stays constant (seed, data split, epochs, hardware)
- Predict outcome — before running, state what you expect and why
Template for each ablation row:
| Run ID | Factor | Value | Fixed Config | Expected Outcome |
|--------|--------|-------|-------------|-----------------|
For multi-factor studies, use a structured matrix:
- Full factorial — if factors are few (≤3) and levels are few (≤3 each)
- Sequential elimination — if factors are many: run single-factor ablations first, then combine winners
- Latin square — if full factorial is too expensive: sample representative combinations
Always calculate total runs before committing:
Total runs = product of all factor levels
GPU hours = total runs × hours_per_run
For each experiment plan, estimate:
- GPU hours: runs × time_per_run (check with user's hardware)
- API costs: if using external APIs (Gemini, OpenAI), estimate tokens × price
- Wall clock time: accounting for sequential dependencies and GPU availability
- Storage: checkpoint sizes × number of runs
Flag if total cost exceeds reasonable bounds and suggest prioritization.
Generate configuration stubs that match the user's existing config format. Read existing configs first to match:
- File format (YAML, JSON, TOML)
- Key naming conventions
- Directory structure for outputs
- Logging/tracking integration (wandb, neptune, tensorboard)
Create a concrete execution plan:
- Order runs by dependency (baselines first, then ablations)
- Identify which runs can be parallelized across GPUs
- Create a shell script or batch runner matching the project's existing patterns
- Include checkpointing strategy for long runs
Before running, define how results will be analyzed:
- Which metrics to compare (primary + secondary)
- Statistical significance test if applicable (paired t-test, bootstrap CI)
- How to handle failed/crashed runs
- Visualization: what plots to generate (comparison tables, bar charts, learning curves)
Before finalizing the experiment plan:
- Each ablation changes exactly one variable
- Baseline is clearly defined and will be run with same setup
- Resource estimate is within budget
- Config stubs match existing project format
- Analysis plan is defined before execution begins
- Seeds are fixed for reproducibility
Always produce:
- Experiment matrix table — all runs with their configurations
- Resource estimate — GPU hours, API costs, storage
- Execution script — ready-to-run commands matching project conventions
- Analysis plan — metrics, comparisons, visualizations