-
Notifications
You must be signed in to change notification settings - Fork 756
[Feature] Question: where would a long horizon "tension crash test" TXT pack fit in OpenCompass? #2396
Description
Describe the feature
Hi OpenCompass team, and thanks for building this platform.
From what I understand, OpenCompass is becoming a central place to run many different LLM benchmarks in a unified way, including long context tasks. I have a question about how to place a slightly unusual kind of workload.
On my side, I maintain an open TXT pack called "WFGY 3.0 · Singularity Demo (BlackHole-131)". It is a plain text universe of 131 S-class questions that cover things like alignment, extreme physics, long horizon reasoning, and high tension decision problems. The whole pack is designed to be read directly by an LLM that supports file input.
The idea is not just to ask one question, measure accuracy, and stop. Instead, you push a model through dozens of very hard questions in a row and watch when its internal world state quietly drifts or collapses. In other words, it is more of a long-horizon stability and "reasoning under semantic tension" test than a standard single-task benchmark.
My question is:
From the perspective of OpenCompass, is there a natural way to host this kind of TXT-only, long-horizon tension crash test?
For example:
- would this be closer to a "long context" dataset,
- or closer to a stress-test suite that people opt into when they care about failure modes and collapse patterns?
Right now everything is open source under MIT, fully readable as text, and already lives in a main GitHub repo with existing experiments and numbers. I am trying to understand whether this belongs inside OpenCompass, or whether it is too far from your current design.
If this seems at least directionally relevant, I am happy to:
- share a small subset as a prototype dataset, and
- provide traces of how different models behave when they run through the TXT pack.
Thanks for any guidance on whether and where a workload like this should live inside OpenCompass.
Will you implement it?
- I would like to implement this feature and create a PR!