Idea: using a 131-question "tension crash test" TXT pack as a long-horizon stress profile in EvalScope #1195
onestardao
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, and thanks for building EvalScope.
The way you position it as a one-stop evaluation and stress-testing framework for large models is very close to what I have been looking for.
I wanted to share a slightly unusual asset I have been working on, and ask how you would think about integrating something like this into EvalScope, either now or in the future.
I maintain an open TXT pack called “WFGY 3.0 · Singularity Demo (BlackHole-131)”.
It is a plain-text test field with 131 “S-class” questions designed to push models into long-horizon, high-tension reasoning: alignment edge cases, extreme physics, long-term planning, and other situations where models often look fine for 5–10 turns and then suddenly drift or collapse.
Some properties that might be relevant for EvalScope:
It is pure text, no code, no external network calls. Any LLM that supports file upload can run it.
It is designed as a long-horizon tension crash test rather than a one-shot benchmark. You can either sample individual questions, or keep a model inside the whole sequence and watch when it breaks.
The project is fully open source (MIT) and has been developed in the open on GitHub with ~1.4k stars.
What I am trying to figure out is:
From the EvalScope point of view, does a “tension crash test” TXT pack like this make sense as
a custom dataset that users can plug into your existing pipelines,
a candidate “stress profile” for long-context / long-horizon evaluation, or
something better kept as an external experiment until the scoring schemes are more standardized?
If you think this kind of workload is interesting, what would be the minimal, cleanest way to wrap it so it fits the EvalScope architecture?
For example, would you prefer:
a very small, clearly defined subset of questions with explicit metrics, or
a more open-ended “stress lab” where users inspect traces and failure cases?
I am not looking for marketing here.
I mostly want to align with your design philosophy before I invest time in adapting BlackHole-131 into an EvalScope-shaped benchmark.
Happy to share a tiny subset of the TXT plus example traces if that helps clarify what the test actually looks like in practice.
Thanks again for maintaining EvalScope and documenting it so thoroughly.
Beta Was this translation helpful? Give feedback.
All reactions