MassGen Case Study Template

MassGen is focused on case-driven development. New features should be discovered and reflected through performance improvements grounded in real-world use cases.

Importantly, each new version of MassGen will be associated with a case study detailed in this way.

Use this template to detail the "why" behind development changes. Once a new version is created, a corresponding case study in this format will be produced that explains the motivation behind the new features, the new features themselves, and the corresponding improvements that result.

🤝 Contributing

To guide future versions of MassGen, we encourage anyone to submit an issue using the corresponding case-study issue template based on the "PLANNING PHASE" section found in this template.

Then, this issue will be resolved by a PR (use the link in the issue template), describing new features and with a link to a full case study doc. This can happen in multiple ways: 1) the development team will try to address the issue, scheduling it for a further release, or 2) you can contribute yourself, as in the CONTRIBUTING.md.

Regardless, please make sure you join our Discord to discuss your changes and get more involved with MassGen.

The more case studies we have, the more we can improve the features according to real use-cases and the better MassGen will be!

📋 PLANNING PHASE
🚀 TESTING PHASE
📊 EVALUATION & ANALYSIS
- Results
- 🎯 Conclusion
📌 Status Tracker

📋 PLANNING PHASE

Complete this section before implementation

📝 Evaluation Design

Prompt

The prompt should be a specific task that is driving the development of the new desired features. It should be clear that the new features are used when the prompt is answered. This need not be the original prompt you were using when you thought of the new set of desired features; instead, it should be carefully chosen such that it is relatively simple, clear, and unambiguous to allow for better evaluation.

Baseline Config

Provide the yaml file or link to the yaml file that describes the configuration file for the baseline.

Baseline Command

Place the command here that describes how to test MassGen on the baseline version:

massgen --config @examples/xxx "<prompt>"

🔧 Evaluation Analysis

Results & Failure Modes

Explain the results of running your command with the current version of MassGen, then how and why MassGen fails to perform well. Include both general details and specifics. If there is auxiliary information such as code or external artifacts that explain why it fails, include detailed descriptions, snippets, and/or screenshots here.

Success Criteria

Define success for this case study. Given we have outputs from two different versions of MassGen, how would we know the new version of MassGen is more successful? This will likely be based not only on the prompt, but also on how MassGen behaved when answering the prompt.

🎯 Desired Features

Describe the desired features we need to implement in MassGen to complete this case study well. This is an important step, as it will drive development of the new version. Thinking of new features to develop will often go hand-in-hand with trying out various targeted prompts and analyzing their outputs in detail.

🚀 TESTING PHASE

Complete this section after implementation

📦 Implementation Details

Version

Place the version or branch of MassGen here with the new features.

✨ New Features

Describe the new features that were implemented to improve MassGen to perform better on the case study.

New Config

Provide the yaml file or link to the yaml file that describes how to run the configuration file for the new version.

Note that this may be the same as the baseline config or not depending on what the new features look like. If it is the same, delete this section.

Command

Place the command here that describes how to test MassGen on the new version:

massgen --config @examples/xxx "<prompt>"

Note that this may be the same as the baseline command or not depending on what the new features look like. If it is the same, delete this section.

🤖 Agents

Describe the agents used and any relevant information about them.

Agent 1: ...
Agent 2: ...
Agent 3: ...

🎥 Demo

Each case study should be accompanied by a recording demonstrating the new functionalities obtained by running MassGen with the described command.

Recording Your Demo

You can record terminal sessions automatically using the run_massgen_with_recording tool:

# Using the terminal evaluation config
massgen --config massgen/configs/tools/custom_tools/terminal_evaluation.yaml \
  "Record a demo of the case study config running: '<your case study prompt>'"

The tool will:

Record the MassGen session as MP4/GIF/WebM
Save the video to the workspace
Optionally evaluate the terminal display quality

Prerequisites:

Install VHS: brew install vhs (macOS) or go install github.com/charmbracelet/vhs@latest
OpenAI API key configured in .env

Recommended Settings:

Format: MP4 for high quality, GIF for easy embedding
Frames: 8-12 for evaluation feedback
Timeout: Adjust based on task complexity (60-600 seconds)

See the Terminal Evaluation User Guide for detailed instructions.

Demo Link

Replace with your actual recording. If using the recording tool, the video will be saved as workspace/massgen_terminal.{format}

📊 EVALUATION & ANALYSIS

Results

Given the success criteria described above, how does the new version of MassGen perform relative to the old one? Describe in specific terms and address each portion of the MassGen orchestration pipeline, highlighting different areas as applicable. It is important here to directly connect the results to the new features.

Like with baseline analysis, be specific enough such that it is clear to anyone reading this why the new version is better. If there is auxiliary information such as code or external artifacts, include detailed descriptions, snippets, and/or screenshots here.

Below are important aspects to consider describing:

The Collaborative Process

How do agents generate new answers? Did our new features change any aspects of this?

The Voting Pattern

How do agents vote? Did our new features change any aspects of this?

The Final Answer

How is the final answer created? Did our new features change any aspects of this?

Anything Else

Highlight anything else that our new features may have impacted, if applicable.

Note: If MassGen performs similarly to the baseline, we likely need to 1) change or improve the implementation of the desired features, or 2) change the prompt used in the case study to something simpler, easier to evaluate, or better representative of the changes.

🎯 Conclusion

Explain why the new features resulted in an improvement in performance after they were implemented. Also discuss the broader implications of the new features and what they can enable.

📌 Status Tracker

Planning phase completed
Features implemented
Testing completed
Demo recorded (consider using run_massgen_with_recording tool)
Terminal display evaluated (optional - use terminal evaluation tool for UX feedback)
Results analyzed
Case study reviewed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MassGen Case Study Template

🤝 Contributing

Table of Contents

📋 PLANNING PHASE

📝 Evaluation Design

Prompt

Baseline Config

Baseline Command

🔧 Evaluation Analysis

Results & Failure Modes

Success Criteria

🎯 Desired Features

🚀 TESTING PHASE

📦 Implementation Details

Version

✨ New Features

New Config

Command

🤖 Agents

🎥 Demo

Recording Your Demo

Demo Link

📊 EVALUATION & ANALYSIS

Results

The Collaborative Process

The Voting Pattern

The Final Answer

Anything Else

🎯 Conclusion

📌 Status Tracker

FilesExpand file tree

case-study-template.md

Latest commit

History

case-study-template.md

File metadata and controls

MassGen Case Study Template

🤝 Contributing

Table of Contents

📋 PLANNING PHASE

📝 Evaluation Design

Prompt

Baseline Config

Baseline Command

🔧 Evaluation Analysis

Results & Failure Modes

Success Criteria

🎯 Desired Features

🚀 TESTING PHASE

📦 Implementation Details

Version

✨ New Features

New Config

Command

🤖 Agents

🎥 Demo

Recording Your Demo

Demo Link

📊 EVALUATION & ANALYSIS

Results

The Collaborative Process

The Voting Pattern

The Final Answer

Anything Else

🎯 Conclusion

📌 Status Tracker