-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Note
This issue was translated by Claude.
Issue Checklist
- I understand that issues are for reporting problems and requesting features, not for off-topic comments, and I will provide as much detail as possible to help resolve the issue.
- I have checked the pinned issues and searched through the existing open issues, closed issues, and discussions and did not find a similar suggestion.
- I have provided a short and descriptive title so that developers can quickly understand the issue when browsing the issue list, rather than vague titles like "A suggestion" or "Stuck."
- The latest version of Cherry Studio does not include the feature I am suggesting.
Platform
Windows
Version
v2.0.0.0
Is your feature request related to an existing issue?
Is your feature request related to a problem? Please describe.
With the deepening application of LLMs, simple Prompt Engineering is no longer sufficient to meet demands. Fine-tuning and Multi-Agent collaboration are becoming inevitable trends. However, the current pain point lies in the "absence of evaluation":
- Difficult Evaluation: Whether for fine-tuned models or complex Agent logic, there lacks objective means to verify effectiveness. Current testing mainly relies on manual "vibe checks," which cannot be quantified.
- Data Scarcity: Building high-quality test sets or fine-tuning datasets is very time-consuming.
- Fragmented Workflow: Even with excellent open-source tools like easy-dataset, users still need to switch back and forth between command lines, scripts, and Cherry Studio, resulting in a disjointed experience.
Describe the solution you'd like
I suggest Cherry Studio reference the concept of easy-dataset and build an integrated "Dataset & Eval Workbench" to achieve a closed loop from "data generation" to "effect evaluation."
Specific feature suggestions are as follows:
Automated Dataset Generation:
- Leverage Cherry Studio's existing model connection capabilities to allow users to input raw corpus (Raw Text/PDF/MD).
- Built-in prompt templates (referencing easy-dataset) that use high-intelligence models (like DeepSeek-V3/GPT-4) to automatically extract QA pairs, reasoning chains (CoT), or dialogue samples.
- Support export formats as JSONL (Alpaca/ShareGPT format) for direct use in fine-tuning.
Batch Inference & LLM-as-a-Judge:
- Provide an "evaluation mode" where users select a generated test set and run the target model (model under test) in batch.
- Introduce LLM-as-a-Judge: Set up a judge model (like Claude 3.5 Sonnet or DeepSeek-R1) to score or compare output results (Win/Loss rate).
Visualization Reports:
- Simply display evaluation results to help users quickly determine the effectiveness improvements after prompt modifications or model fine-tuning.
Describe alternatives you've considered
- Manual Testing: Extremely low efficiency and unable to cover corner cases.
- External Scripts: Using Python scripts to call libraries like easy-dataset or Ragas, but this raises the usage barrier for non-technical users and cannot reuse the model services already configured in Cherry Studio.
- Commercial Evaluation Platforms: High cost and involves data privacy concerns.
Additional context
easy-dataset is an excellent reference implementation that proves the feasibility of using LLMs to synthesize data.
In the context of Agent moving towards industrial-grade deployment (2026 outlook), "data asset management" and "automated testing" will be core competitiveness for client tools. If Cherry Studio can build in this capability, it will become an indispensable "model iteration artifact" in developers' hands.
Desired Solution
No
Alternative Solutions
No
Additional Information
No response
Original Content
Issue Checklist
- I understand that issues are for reporting problems and requesting features, not for off-topic comments, and I will provide as much detail as possible to help resolve the issue.
- I have checked the pinned issues and searched through the existing open issues, closed issues, and discussions and did not find a similar suggestion.
- I have provided a short and descriptive title so that developers can quickly understand the issue when browsing the issue list, rather than vague titles like "A suggestion" or "Stuck."
- The latest version of Cherry Studio does not include the feature I am suggesting.
Platform
Windows
Version
v2.0.0.0
Is your feature request related to an existing issue?
Is your feature request related to a problem? Please describe.
随着 LLM 应用的深入,单纯的 Prompt Engineering 已无法满足需求,微调和多 Agent 协作正成为必然趋势。然而,目前的痛点在于"评估的缺位":
评估难: 无论是微调后的模型,还是复杂的 Agent 逻辑,缺乏客观手段来验证效果。目前的测试主要靠人工"凭感觉聊",无法量化。
数据缺: 构建高质量的测试集或微调数据集非常耗时。
流程割裂: 即使有像 easy-dataset 这样优秀的开源工具,用户仍需要在命令行、脚本和 Cherry Studio 之间反复切换,体验不连贯。
Describe the solution you'd like
建议 Cherry Studio 参考 easy-dataset 的理念,内置一个"数据集与评估工作台" (Dataset & Eval Workbench),实现从"数据生成"到"效果评估"的闭环。
具体功能建议如下:
自动化数据集生成 :
利用 Cherry Studio 现有的模型连接能力,允许用户输入原始语料(Raw Text/PDF/MD)。
内置 Prompt 模版(参考 easy-dataset),利用高智商模型(如 DeepSeek-V3/GPT-4)自动抽取 QA 对、推理链或对话样本。
支持导出格式为 JSONL (Alpaca/ShareGPT 格式),方便直接用于微调。
批量推理与裁判员机制 (Batch Inference & LLM-as-a-Judge):
提供一个"评估模式":用户选择一个生成的测试集,让目标模型(待测模型)批量运行。
引入 LLM-as-a-Judge:设置一个裁判模型(如 Claude 3.5 Sonnet 或 DeepSeek-R1),对输出结果进行打分或对比(Win/Loss rate)。
可视化报告:
简单展示评估结果,帮助用户快速判断 Prompt 修改或模型微调后的效果提升。
Describe alternatives you've considered
手动测试: 效率极低,无法覆盖 Corner Case。
外部脚本: 使用 Python 脚本调用 easy-dataset 或 Ragas 等库,但这提高了非技术用户的使用门槛,且无法复用 Cherry Studio 已配置好的模型服务。
商业评测平台: 成本高且涉及数据隐私问题。
Additional context
easy-dataset (https://github.com/ConardLi/easy-dataset) 是一个很好的参考实现,它证明了利用 LLM 合成数据是可行的。
在 Agent 走向工业级落地的背景下(2026 展望),"数据资产管理"和"自动化测试"将是客户端工具的核心竞争力。Cherry Studio 如果能内置此能力,将成为开发者手中不可或缺的"模型迭代神器"。
Desired Solution
No
Alternative Solutions
No
Additional Information
No response