Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1725 commits
Select commit Hold shift + click to select a range
a6d6a0e
[TASK] Added MindCube Task (#876)
oscarqjh Nov 3, 2025
da97bf2
Fix data loading for coco_karpathy_test (#884)
felifri Nov 6, 2025
0bfcd41
fix hallusionbench processing for distributed eval (#885)
felifri Nov 6, 2025
a1dd1a4
[feat] Add llava ov 1.5 chat (#887)
kcz358 Nov 6, 2025
706251f
Add Qwen3-VL models (w/o vllm or sglang) (#883)
ArdalanM Nov 7, 2025
9f3c23b
Update README.md (#891)
kcz358 Nov 10, 2025
e3a9a36
【Task】add UEval benchmark to lmms-eval (#890)
primerL Nov 12, 2025
5422162
[TASK] MME-SCI Benchmark (#878)
Xian-Gao Nov 16, 2025
062a03c
[benchmark] add SciVideoBench benchmark to lmms-eval (#875)
dengandong Nov 16, 2025
7058207
fix qwen to handle bsz>1 (#889)
ArdalanM Nov 16, 2025
fd310e9
[Fix] LongVila prepare request and gqa doc to visual (#906)
kcz358 Nov 16, 2025
7274632
Add OmniSpatial task (#896)
pangyyyyy Nov 17, 2025
97fefa8
edit current_tasks to include mathvision which is already implemented…
CLRT19 Nov 17, 2025
2aaeff4
[Docs] Add MME-SCI to current_tasks.md (#909)
Xian-Gao Nov 25, 2025
cada749
refactor mmstar with default template (#907)
ArdalanM Nov 25, 2025
74e6681
Update available models in __init__.py (#914)
Lornatang Nov 27, 2025
4848636
[bugfix] Fix nested dictionary input for vllm mm_processor_kwargs (#915)
dwlim-nota Nov 28, 2025
66f7e95
Update Qwen3VL video generation logic to match reproduce VideoMME off…
ArdalanM Nov 28, 2025
40a2ce4
Update apply method to handle empty responses (#917)
yhyang201 Dec 2, 2025
0e08d99
update qwen3vl (simple model) processor (#922)
ArdalanM Dec 6, 2025
f546c16
[bugfix] Filter unsupported model_kwargs for LLaVA-OneVision-1.5 (#924)
mwxely Dec 6, 2025
351a8b9
[feat] Add reasoning version of image and text dataset (#926)
kcz358 Dec 10, 2025
450b4d2
[Task] Spatial benchmarks: Blink, CV_Bench, Embspatial, ERQA (#927)
superAIyah Dec 11, 2025
ba9e26d
add VLMs are biased benchmark (#928)
fabianandresgrob Dec 11, 2025
c195eb1
Update MMMU and MMMUpro with Qwen3VL prompt (#929)
ArdalanM Dec 15, 2025
ebc5d88
add snsbench (#930)
ArdalanM Dec 15, 2025
4e5fa54
Implement 'Vision Language Models are Blind' benchmark (#931)
claasdeboer Dec 15, 2025
be6537d
[fix] Fixing the tqdm bar for the qwen vl series for batch infer (#936)
kcz358 Dec 15, 2025
59315b5
[feat] Bagel lmms-engine eval inference pipeline (#938)
kcz358 Dec 15, 2025
e46c534
[Dataset] Add Gedit Bench for Bagel in lmms-eval (#939)
kcz358 Dec 16, 2025
7d01e37
add jmmmu_pro (#937)
AtsuMiyai Dec 16, 2025
97dcc10
[Task] Pointing benchmarks: RefSpatial, Where2Place (#940)
superAIyah Dec 16, 2025
c16d574
[NEW TASK] Add FALCON-Bench to tasks (#942)
cplou99 Dec 18, 2025
58a2132
feat: add LongVT evaluation tasks for long video understanding with t…
mwxely Dec 18, 2025
eca29db
docs: update current_tasks.md with comprehensive model and task listings
Luodian Dec 22, 2025
442985a
Add claude GitHub actions 1767101104580 (#958)
Luodian Dec 30, 2025
9b1c44f
[bugfix] mmsi bugfix (#945)
superAIyah Dec 30, 2025
1ba114b
sub_task metrics added (#954)
superAIyah Dec 30, 2025
bc92dcb
[Task] Video Streaming Benchmark: OVOBench (#957)
Mikhail-Repin Dec 30, 2025
9be51c8
feat: add automated PR code review skill
Luodian Dec 30, 2025
69f9804
[feat]: add generate_until_multi_round for qwen_2_5_vl and qwen_2_vl …
Mikhail-Repin Dec 30, 2025
ae4d9a9
Add Qwen 3 Omni and Video Salmonn 2 (#955)
ngquangtrung57 Dec 30, 2025
f0dbb32
[Task] Add imgedit bench (#941)
kcz358 Dec 30, 2025
9f4254f
Add STARE task (#893)
pangyyyyy Dec 30, 2025
4b265b7
bugfix: missing fields in doc when using --log_samples (#731)
VincentYCYao Dec 30, 2025
e499653
The evaluation strategy is changed from LLM-judge to rule-judge + LLM…
mathCrazyy Dec 30, 2025
5da1952
feat: add task groundingme (#949)
lirang04 Dec 30, 2025
d667d10
[Task] add AV-SpeakerBench (#943)
plnguyen2908 Dec 30, 2025
399d087
add-task-seephys (#903)
RitzChow Dec 30, 2025
cdc5aef
WhisperTT evals (#899)
idjuricTT Dec 30, 2025
a015ba8
Add SpatialViz task (#894)
pangyyyyy Dec 30, 2025
d740e8b
easier code for multiple images (#879)
Kyunnilee Dec 30, 2025
d1bbf70
fix: filter multimodal content from log samples while preserving meta…
Luodian Dec 30, 2025
9e7e520
fix: improve spatialviz utils quality (#961)
Luodian Dec 30, 2025
c446df4
[Fix] Fix imgedit eval logic for call openai client (#966)
kcz358 Jan 4, 2026
e5b95ca
Add intern vl3 and internvl3_5 (#963)
ngquangtrung57 Jan 6, 2026
930faea
Update current_tasks.md (#965)
plnguyen2908 Jan 6, 2026
9a31194
[Fix] Qwen2VL batchsize>1 visual alignment (#971)
Terry-Uv Jan 6, 2026
d1d6faf
use deps lower bounds (#969)
Laurent2916 Jan 6, 2026
3a82c68
chore: remove automatic Claude code review workflow (#973)
Luodian Jan 6, 2026
dd751c4
[Task] Added VSIBench debiased & pruned (#975)
oscarqjh Jan 8, 2026
7f16618
docs: add i18n README translations for 18 languages (#979)
Luodian Jan 10, 2026
a1079e4
docs: improve Chinese translation quality (#980)
Luodian Jan 10, 2026
38c29c9
[TASK] Add Mantis-Eval Task (#978)
MYMY-young Jan 10, 2026
a877383
[Model] Added Cambrian-S model (#977)
oscarqjh Jan 10, 2026
901df9e
[feat] Init an http eval server and entrypoints for lmms_eval (#972)
kcz358 Jan 10, 2026
c3139de
docs: clarify that batch_size=auto is not implemented (#981)
Luodian Jan 10, 2026
60e6389
fix: add missing 'all' extra to pyproject.toml (#982)
songyuc Jan 11, 2026
cdde2a1
[Task] Added ViewSpatial task (#983)
oscarqjh Jan 12, 2026
fd03cff
[Task] Added SiteBench task (#984)
oscarqjh Jan 12, 2026
8abf4fb
[fix] Align the image text order in evaluation with original evaluati…
kcz358 Jan 12, 2026
082c103
fix: properly handle qwen2.5 video frames edge case (#987)
Luodian Jan 12, 2026
74049fb
[feat] Add decontamination probing settings for video benchmarks (#990)
mwxely Jan 13, 2026
62c6dff
[Task] Add vsibench multi-image variant (#993)
oscarqjh Jan 15, 2026
d54cf3a
[feat] Add CLT and clustered standard error estimation for statistica…
mwxely Jan 16, 2026
38db407
[feat] ignore opencode files
Luodian Jan 17, 2026
7999a52
fix: qwen2.5vl nframes bug (#992)
oscarqjh Jan 17, 2026
d9e3753
Add CaptionQA benchmark task (#991)
bronyayang Jan 17, 2026
eebd8db
Revert "Add CaptionQA benchmark task (#991)" (#1002)
Luodian Jan 17, 2026
189e900
[Bug] internvl3 duplicate <image> token issue (#999)
oscarqjh Jan 17, 2026
3fdca38
[Task] Added SiteBench multi-image variant and bug fix (#996)
oscarqjh Jan 17, 2026
54aba46
[TASK] Add SpatialTreeBench task (#994)
loongfeili Jan 17, 2026
020d1e4
Add CaptionQA benchmark task (#1004)
bronyayang Jan 19, 2026
aa82ee1
Add BabyVision benchmark task (#1008)
Luodian Jan 19, 2026
ff5f96d
Revert "Add BabyVision benchmark task (#1008)" (#1009)
Luodian Jan 19, 2026
22f21ca
Update README and PR Code Review documentation
Luodian Jan 19, 2026
d02ff94
update README.md
Luodian Jan 19, 2026
a48c435
update README.md
Luodian Jan 19, 2026
d210177
update i18n README.md
Luodian Jan 19, 2026
3702b7f
[Task] Added SPAR-bench (#1011)
oscarqjh Jan 21, 2026
7c9650b
scalable choice selection added (#1005)
superAIyah Jan 21, 2026
70bf209
add structeditbench task (#1016)
KemingWu Jan 22, 2026
fe72d38
[Model] Bagel UMM (#1012)
oscarqjh Jan 22, 2026
a83deeb
feat(tasks): add BabyVision Und task with LLM-based evaluation (#1015)
mwxely Jan 22, 2026
28f0981
feat(tasks): add BabyVision Gen task with LLM-based evaluation (#1010)
kcz358 Jan 22, 2026
82860f4
[feat] Add baseline comparison with paired t-test (#1006)
mwxely Jan 23, 2026
3654895
chore: add .worktrees to gitignore
Luodian Jan 23, 2026
52a07fd
[feat] Add Power Analysis for Pre-Evaluation Planning (#1007)
mwxely Jan 23, 2026
a74c71f
[Release] v0.6 Development Branch - TUI, CLT/Clustered SE, Paired T-T…
Luodian Jan 24, 2026
a1afe62
feat: add reasoning task versions for multiple benchmarks (#1038)
kcz358 Jan 26, 2026
5fa7b0a
Update BLINK benchmark link in current_tasks.md (#1036)
Ryoo72 Jan 27, 2026
8b40b20
fix: support for `partial` used in vsibench metric calculation. (#1041)
akawincent Jan 27, 2026
016886e
add kris_bench task (#1017)
KemingWu Jan 27, 2026
a2aeb5f
feat: add WenetSpeech test_net split for evaluation (#1027)
Luodian Jan 29, 2026
42f5843
feat(tasks): add MMVP task with ground truth corrections (#1028)
Luodian Jan 29, 2026
6d1ec98
feat(tasks): add RealUnify benchmark (#1033)
Luodian Jan 29, 2026
99cf539
feat(tasks): add Spatial457 benchmark for 6D spatial reasoning (#1031)
Luodian Jan 29, 2026
7c0fe4f
feat(tasks): add AuxSolidMath benchmark (#1034)
Luodian Jan 29, 2026
64f0959
feat(tasks): add IllusionBench (#1035)
Luodian Jan 29, 2026
9b4ff82
feat(tasks): add Uni-MMMU benchmark (#1029)
Luodian Jan 29, 2026
69be029
feat(tasks): add Geometry3K benchmark for geometry problem solving (#…
Luodian Jan 29, 2026
d2d2ef9
fix: replace hardcoded .cuda() with .to(self._device) for multi-GPU s…
Luodian Jan 29, 2026
b19c071
fix: add resource cleanup in video loaders to prevent memory leaks (#…
Luodian Jan 29, 2026
89e574a
[Model]: add InternVL-HF model support (#1039)
CodeTeamster Jan 29, 2026
dd4ab93
Refine StructEditBench utils logic more simple (#1044)
KemingWu Jan 29, 2026
1563379
add dependency for reasoning tasks (#1048)
Salieri0515 Jan 29, 2026
ab3263b
[Task] add PAIBench-U (#1050)
Jialuo-Li Jan 29, 2026
4410bde
[Task] Add MMSI-Video-Bench (#1053)
oscarqjh Feb 2, 2026
635740e
docs: restructure README with HTTP eval server and custom integration…
kcz358 Feb 4, 2026
abc9fa9
Add MMSearch-Plus (#1054)
xijia-tao Feb 4, 2026
cc14e3b
[Model] Add Audio Flamingo 3 and Kimi Audio (#1055)
ngquangtrung57 Feb 4, 2026
47a39eb
[Task] Add mmar benchmark (#1057)
ngquangtrung57 Feb 4, 2026
0d65b89
Update README.md
Luodian Feb 7, 2026
edbd4f3
[Model] Add Uni-MoE-2.0-Omni and Baichuan-Omni-1d5 (#1059)
ngquangtrung57 Feb 8, 2026
4fffb56
[Fix] Add dynamic max_num calculation to InternVL3 to align with VLME…
oscarqjh Feb 8, 2026
6c1090f
[Task] Added OSI-bench (#1068)
oscarqjh Feb 8, 2026
96f998a
[Model] Add GLM4V and LLaMA 4 (#1056)
ngquangtrung57 Feb 8, 2026
9448348
[Task] Add PRISMM-Bench (#1063)
da-luggas Feb 8, 2026
ae7bcfc
Add --offset option (#1042)
chiungyit Feb 8, 2026
b1527f6
[Model] Add OmniVinci and MiniCPM-o-2_6 (#1060)
ngquangtrung57 Feb 8, 2026
abeb3aa
[Task] Add MMSU benchmark (#1058)
ngquangtrung57 Feb 8, 2026
3e35292
[Task] Add CoreCognition bench (#1064)
Irisicy4 Feb 8, 2026
a57df0b
feat: add VLMEvalKit-compatible Qwen task variants for MMMU and MMSta…
Luodian Feb 8, 2026
229b97a
[Task] Add 3DSR bench (#1072)
oscarqjh Feb 10, 2026
35d2d43
fix: Add revision field to LLaVA-OneVision-1.5 model (#1073)
oscarqjh Feb 10, 2026
be65420
Improve community contribution infrastructure and contributor funnel …
Luodian Feb 10, 2026
6f50b3f
refactor: introduce manifest-driven model registry v2 (#1070)
Luodian Feb 10, 2026
ff258a6
refactor: deduplicate model manifest logic in __init__.py
Luodian Feb 10, 2026
8c5c355
Remove extra opening of chat_template file. (#1071)
dstnluong Feb 11, 2026
8857a70
fix: insert correct number of image tokens for multi-image in Cambria…
oscarqjh Feb 11, 2026
985334c
feat: add SiteBench image/video result merge script (#1076)
oscarqjh Feb 11, 2026
81168fe
fix: raise minimum supported Python version to 3.10 (#1079)
Luodian Feb 14, 2026
81a569c
feat: show throughput metrics in final results table (#1078)
Luodian Feb 14, 2026
387e7e2
Add OpenRouter Molmo throughput compare and lint fixes (#1080)
Luodian Feb 14, 2026
9cbfdf0
fix: resolve async_openai lint and adaptive concurrency issues (#1082)
Luodian Feb 14, 2026
cb53f59
[Model] Fixed the parameter name error in qwen2_audio (#1081)
YichenG170 Feb 15, 2026
76b9298
refactor: unify OpenAI model naming via Registry V2 conventions (#1084)
Luodian Feb 15, 2026
5ab23ac
refactor: unify async openai compatible naming and update v0.6 docs (…
Luodian Feb 16, 2026
71c1938
docs: add comprehensive developer guidance for AI agents and contribu…
Luodian Feb 16, 2026
bb1d8aa
docs: restructure README and v0.6 release notes (#1086)
Luodian Feb 16, 2026
13d9166
docs: apply missing v0.6 What's New and Why section refinements
Luodian Feb 16, 2026
83f0c18
feat: add GitHub Actions workflow for publishing to PyPI (#1087)
pufanyi Feb 16, 2026
87190ea
Update README.md
Luodian Feb 17, 2026
e36cb87
Update Discord link in README
Luodian Feb 17, 2026
17c651a
docs: add Qwen3.5 runtime compatibility docs and examples (#1094)
Luodian Feb 18, 2026
431e2cf
docs: add agent skill for lmms-eval (agentskills.io standard) (#1092)
Luodian Feb 18, 2026
a5cce07
fix simple qwen3_vl inference when batch_size > 1 (#1090)
ArdalanM Feb 18, 2026
f6a676c
Fix multi-GPU metric gather key ordering (#1089)
YasserdahouML Feb 18, 2026
43a9e98
Add SAM3 model and SA-Co/Gold benchmark integration (#1088)
YasserdahouML Feb 18, 2026
45014fb
chore: update all OpenAI model references to latest versions (#1096)
Luodian Feb 18, 2026
2c0f3f8
feat: response-level cache with determinism-aware bypass
Luodian Feb 18, 2026
055cb0e
chore: consolidate OpenRouter throughput benchmark scripts (#1097)
Luodian Feb 19, 2026
60fd7fd
fix: resolve model registry conflict and saco_gold tag warning (#1098)
Luodian Feb 19, 2026
fc1bd43
fix: bump version to 0.6.1 and auto-sync version from git tag in publ…
Luodian Feb 19, 2026
2bb5e98
feat: strengthen response cache fingerprint contract (#1149)
Luodian Feb 22, 2026
66a5e42
feat: add token-only efficiency metrics and TTFT coverage docs (#1125)
Luodian Feb 22, 2026
8b62dec
feat: integrate DUDE benchmark task (#1151)
Luodian Feb 22, 2026
dae4340
chore: keep openrouter smoke script simple with stronger defaults (#1…
Luodian Feb 22, 2026
ca006f8
feat: integrate OmniDocBench benchmark task (#1152)
Luodian Feb 22, 2026
bb57887
LMMs-Eval v0.7 - Audio Update (#1124)
YichenG170 Feb 22, 2026
d5eb952
feat: integrate OfficeQA benchmark task (#1150)
Luodian Feb 22, 2026
fd9afcc
chore: refresh agent memory and add qwen3 run scripts
Luodian Feb 22, 2026
af81f63
refactor(models/chat): improve async_openai code structure and readab…
kcz358 Feb 19, 2026
ee50410
feat: integrate EgoTempo benchmark task (#1155)
Luodian Feb 23, 2026
0b71775
[Benchmark Backfill] Integrate CountBench into lmms-eval (#1156)
Luodian Feb 23, 2026
cfc5a25
feat: integrate Point-Bench benchmark task (#1142) (#1157)
Luodian Feb 23, 2026
8d79eb6
feat: integrate MathKangaroo benchmark task (#1135) (#1158)
Luodian Feb 23, 2026
e818fea
feat: backfill VisuLogic benchmark integration (LMM-288) (#1159)
Luodian Feb 23, 2026
468973b
feat: integrate TVBench benchmark tasks (#1160)
Luodian Feb 23, 2026
f149485
feat: integrate MathCanvas benchmark task (#1161)
Luodian Feb 23, 2026
9d11279
fix: harden mmsi-bench utils parsing (#1162)
Luodian Feb 23, 2026
1a0b4fa
feat: integrate FSC-147 benchmark task (#1163)
Luodian Feb 23, 2026
18f2cc1
feat: integrate MMLongBench-Doc benchmark task (#1164)
Luodian Feb 23, 2026
b98934b
fix: align osworld_g polygon scoring with osworld-verified annotation…
Luodian Feb 23, 2026
11392d9
feat: integrate ViVerBench benchmark task (#1166)
Luodian Feb 23, 2026
dca47e5
feat: integrate mtvqa benchmark task (#1167)
Luodian Feb 23, 2026
0059a43
feat: integrate worldvqa benchmark task (#1168)
Luodian Feb 23, 2026
3fc5184
feat: integrate Neptune long-video benchmark tasks (#1187)
Luodian Feb 23, 2026
e40e04e
feat: add ZeroBench benchmark task (#1182)
Luodian Feb 23, 2026
84b7602
feat: add simplevqa benchmark task (#1184)
Luodian Feb 23, 2026
3890a13
feat: add vpct benchmark task (#1183)
Luodian Feb 23, 2026
a1353c4
feat: add mme-cc benchmark task (#1185)
Luodian Feb 23, 2026
867404b
feat: add HiPhO benchmark task (#1186)
Luodian Feb 23, 2026
df8872b
feat: integrate MMLongBench benchmark task (#1133) (#1169)
Luodian Feb 23, 2026
2fa592c
feat: add branded evaluation banner with version metadata
Luodian Feb 23, 2026
0d93809
fix: distinguish image vs video file paths in auto_doc_to_messages fa…
Luodian Feb 23, 2026
5959c84
docs: add MMMU eval discrepancy report and TLDR FP definitions
Luodian Feb 23, 2026
ff32722
feat: add ARC-AGI-1, ARC-AGI-2, and BrowseComp benchmark tasks (#1190)
Luodian Feb 24, 2026
16021b4
docs(spatialtreebench): document TreeBench naming variants (#1189)
Luodian Feb 24, 2026
94da674
test: unified CLI dispatch and task pipeline tests (#1203)
Luodian Feb 24, 2026
05ed5a7
feat: integrate six traceable benchmarks with unified smoke test (#1202)
Luodian Feb 24, 2026
716e192
feat: add async hf model multi-gpu worker backend (#1204)
Luodian Feb 24, 2026
466ab83
refactor: remove dead read_video_pyav_pil and deduplicate _resize_ima…
Luodian Feb 23, 2026
53df90f
refactor: rename read_video_pyav -> read_video, remove dead code
Luodian Feb 24, 2026
71a97c9
docs: rewrite Section 7.1 to document read_video backends, remove dea…
Luodian Feb 24, 2026
a591d49
docs: add external usage guide for CLI and library access
Luodian Feb 24, 2026
3bffeea
feat(tasks): switch benchmark media/tasks to HF-source resolution
Luodian Feb 24, 2026
8ff2395
perf(tasks): cache media path expansion in resolver
Luodian Feb 24, 2026
83f6fb5
test: include neptune in benchmark registration coverage
Luodian Feb 25, 2026
405258d
feat(models): add NanoVLM chat model with async multi-GPU eval (#1207)
Jinghao-Guo Feb 25, 2026
3e25556
fix(ci): make lint workflow fork-PR safe
Luodian Feb 25, 2026
711ace3
clean coutix
Luodian Feb 26, 2026
418bfe6
test: add pytest infra, prompt stability tests, and trim dead test code
Luodian Feb 26, 2026
6ba4641
docs: add comprehensive test suite README
Luodian Feb 26, 2026
f3a9151
refactor: convert test_protocol.py from unittest to pytest style
Luodian Feb 26, 2026
49ed2c1
refactor: convert test_construct_requests.py from unittest to pytest …
Luodian Feb 26, 2026
bd85e00
refactor: convert test_evaluator.py from unittest.TestCase to pure py…
Luodian Feb 26, 2026
2dabc74
refactor: convert test_task_pipeline.py from unittest to pytest style
Luodian Feb 26, 2026
be4b7f8
refactor: convert remaining tests to pure pytest, dedup CLI tests, up…
Luodian Feb 26, 2026
772a0e8
docs: add concrete code examples to each test README section
Luodian Feb 26, 2026
45049d4
docs: rewrite docs/README.md with pipeline diagram, code examples, an…
Luodian Feb 26, 2026
61b792c
docs: switch examples to OpenAI API model, promote MMMU/VideoMMU/Long…
Luodian Feb 26, 2026
e3e6641
fix(ci): restore task_input_specs/redundancy_refactor.yaml deleted by…
Luodian Feb 27, 2026
b12f87a
[Docs] Added dedicated changelogs folder (#1206)
oscarqjh Feb 27, 2026
38a82e5
feat(tasks): add reasoning collection for LLaVA-OV 1.5 RL (#1208)
kcz358 Feb 27, 2026
2a2bdb0
style: fix isort import ordering across 20 files
Luodian Feb 27, 2026
e26854b
refactor(tasks): deduplicate reasoning utils via factory functions
Luodian Feb 27, 2026
111497b
feat(models): add Phi4 multimodal backend (#1211)
kcz358 Feb 28, 2026
b0ef792
chore: untrack AGENTS.md and internalize into skills/lmms-eval-guide
Luodian Feb 28, 2026
68c4c73
docs(skills): update agent skill to cover all v0.7 features
Luodian Feb 28, 2026
6cd1d9a
chore: bump version to v0.7.0 and add v0.7 changelog/readme entry
Luodian Feb 28, 2026
74a9da8
feat: add MCP server for AI agent integration (#1209)
Luodian Feb 28, 2026
1552141
docs: reframe v0.7 theme as operational simplicity + pipeline maturity
Luodian Feb 28, 2026
f011c32
merge: resolve conflicts with main (keep dev-v0d7 v0.7 versions)
Luodian Feb 28, 2026
59e6a58
ci: add line-stats workflow for push-to-main diffs
Luodian Feb 28, 2026
e57e15d
docs: reorder v0.7 release notes and changelog to match editorial TOC
Luodian Feb 28, 2026
eca09bb
docs: renumber v0.7 sections sequentially 1-12
Luodian Feb 28, 2026
5f0558f
docs: add §8 Agentic Task Evaluation to v0.7 release notes
Luodian Feb 28, 2026
ac871cd
refactor: rename agentic tasks to drop _seed postfix
Luodian Feb 28, 2026
9aaa442
docs: add headline speedup summary to §3 I/O section
Luodian Feb 28, 2026
d526509
docs: explain why video I/O speedups are achieved in §3
Luodian Feb 28, 2026
1806d57
improve doc
Luodian Feb 28, 2026
a484427
improve doc
Luodian Feb 28, 2026
313bda3
docs: reorder v0.7 release note sections, condense skill intro, updat…
Luodian Feb 28, 2026
2ed548b
improve doc
Luodian Feb 28, 2026
a229f1c
fix(tasks): use submission metric for mmbench reasoning test splits
kcz358 Feb 28, 2026
8f5c4a4
feat(tasks): add infovqa reasoning test split and group
kcz358 Feb 28, 2026
854ff07
feat(tasks): add docvqa reasoning test split and group
kcz358 Feb 28, 2026
af74616
fix: replace custom admonition syntax with standard markdown blockquotes
Luodian Feb 28, 2026
355b0bb
merge: resolve conflicts with main, keeping dev-v0d7 as source of truth
Luodian Feb 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
106 changes: 106 additions & 0 deletions .github/ISSUE_TEMPLATE/design_proposal.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
name: Design Proposal / Follow-up
description: Propose architecture-level follow-up work with clear scope and acceptance criteria
labels: ["needs decision"]
body:
- type: markdown
attributes:
value: |
Use this template for non-trivial design or architecture work.
Keep it concise and evidence-driven, aligned with PR template style.

- type: checkboxes
attributes:
label: Checklist
options:
- label: I have searched for related issues/PRs and linked them below.
required: true
- label: I have separated in-scope and out-of-scope items.
required: true

- type: textarea
id: summary
attributes:
label: Summary
description: Max 3 bullets. What is the problem and what outcome is expected?
placeholder: |
- Problem: ...
- Impact: ...
- Desired outcome: ...
validations:
required: true

- type: textarea
id: in_scope
attributes:
label: In Scope
description: Explicit list of what this issue will change.
placeholder: |
- ...
- ...
validations:
required: true

- type: textarea
id: out_of_scope
attributes:
label: Out of Scope
description: Explicit list of what this issue will NOT change.
placeholder: |
- ...
validations:
required: true

- type: textarea
id: proposal
attributes:
label: Proposed Plan
description: 3-6 concrete steps.
placeholder: |
1. ...
2. ...
3. ...
validations:
required: true

- type: textarea
id: validation_plan
attributes:
label: Validation Plan
description: How will we verify success? Include commands/benchmarks where applicable.
placeholder: |
- `command` | sample size: `N=<...>` | key metrics: `<...>` | result: `pass/fail`
validations:
required: true

- type: textarea
id: risk
attributes:
label: Risk / Compatibility
description: 1-3 bullets on behavior changes, migration risk, or blockers.
placeholder: |
- ...
validations:
required: true

- type: textarea
id: acceptance_criteria
attributes:
label: Acceptance Criteria
description: Objective done conditions.
placeholder: |
- [ ] ...
- [ ] ...
validations:
required: true

- type: textarea
id: references
attributes:
label: References
description: Related PRs, issues, docs, benchmark artifacts, or Linear links.
placeholder: |
- PR: ...
- Issue: ...
- Doc: ...
validations:
required: false
54 changes: 29 additions & 25 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,35 @@
## Description

<!-- Briefly describe what this PR does and why. Focus on the problem it solves. -->
## Summary
<!-- Max 3 bullets. -->
-
-
-

## In scope
<!-- Explicitly list what this PR changes. -->
-

## Out of scope
<!-- Explicitly list what this PR does NOT change. -->
-

## Validation
<!--
Max 3 bullets.
Use this format:
`<command>` | sample size: `N=<...>` | key metrics: `<...>` | result: `pass/fail`
If you ran tests/benchmarks with metrics, include concrete numbers.
-->
-

## Risk / Compatibility
<!-- 1-2 bullets. Note breaking changes, behavior changes, or migration impact. -->
-

## Type of Change

- [ ] Bug fix (non-breaking change that fixes an issue)
- [ ] New feature (non-breaking change that adds functionality)
- [ ] Bug fix (non-breaking change)
- [ ] New feature
- [ ] New benchmark/task
- [ ] New model integration
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Breaking change
- [ ] Documentation update
- [ ] Refactoring (no functional changes)

## Changes Made

<!-- List the key changes. Keep it high-level; the diff tells the details. -->

-

## Testing

<!-- Describe how you tested your changes. -->

- [ ] Tested locally with: `python -m lmms_eval --model <model> --tasks <task> --limit 8`
- [ ] Ran pre-commit: `pre-commit run --all-files`
- [ ] Added/updated tests (if applicable)

## Additional Notes

<!-- Any context, screenshots, or related issues. -->
31 changes: 31 additions & 0 deletions .github/workflows/line-stats.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Line Stats

on:
push:
branches: [main]

permissions:
contents: read

jobs:
line-stats:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 2
- name: Count lines added/deleted
run: |
echo "### Commit: ${{ github.sha }}"
STATS=$(git diff --shortstat HEAD~1 HEAD)
echo "$STATS"
ADDED=$(git diff --numstat HEAD~1 HEAD | awk '{s+=$1} END {print s+0}')
DELETED=$(git diff --numstat HEAD~1 HEAD | awk '{s+=$2} END {print s+0}')
echo "Lines added: $ADDED"
echo "Lines deleted: $DELETED"
echo "### Summary" >> "$GITHUB_STEP_SUMMARY"
echo "| Metric | Count |" >> "$GITHUB_STEP_SUMMARY"
echo "|--------|-------|" >> "$GITHUB_STEP_SUMMARY"
echo "| Lines added | $ADDED |" >> "$GITHUB_STEP_SUMMARY"
echo "| Lines deleted | $DELETED |" >> "$GITHUB_STEP_SUMMARY"
echo "| Net change | $((ADDED - DELETED)) |" >> "$GITHUB_STEP_SUMMARY"
126 changes: 126 additions & 0 deletions .github/workflows/task-input-ab.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
name: task-input-ab

on:
pull_request:
paths:
- "lmms_eval/tasks/**"
- "lmms_eval/api/**"
- "tools/task_input_capture.py"
- "test/eval/task_input_specs/**"
- ".github/workflows/task-input-ab.yml"
workflow_dispatch:
inputs:
base_sha:
description: "Optional base commit SHA"
required: false
type: string

jobs:
compare-task-input-boundary:
runs-on: ubuntu-latest
timeout-minutes: 45
steps:
- name: Checkout head
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install uv
uses: astral-sh/setup-uv@v3

- name: Sync dependencies
run: uv sync

- name: Resolve BASE revision
id: base
run: |
BASE_SHA="${{ github.event.pull_request.base.sha }}"
if [ -z "$BASE_SHA" ]; then
BASE_SHA="${{ github.event.inputs.base_sha }}"
fi
if [ -z "$BASE_SHA" ]; then
DEFAULT_HEAD="$(git symbolic-ref refs/remotes/origin/HEAD)"
DEFAULT_BRANCH="${DEFAULT_HEAD#refs/remotes/origin/}"
BASE_SHA="$(git merge-base HEAD "origin/${DEFAULT_BRANCH}")"
fi
BASE_WORKTREE="/tmp/lmms-base-${{ github.run_id }}"
echo "base_sha=${BASE_SHA}" >> "$GITHUB_OUTPUT"
echo "base_worktree=${BASE_WORKTREE}" >> "$GITHUB_OUTPUT"

- name: Prepare BASE worktree
run: git worktree add "${{ steps.base.outputs.base_worktree }}" "${{ steps.base.outputs.base_sha }}"

- name: Resolve pinned checker
id: checker
run: |
BASE_CHECKER="${{ steps.base.outputs.base_worktree }}/tools/task_input_capture.py"
BASE_SPEC="${{ steps.base.outputs.base_worktree }}/test/eval/task_input_specs/redundancy_refactor.yaml"
if [ -f "$BASE_CHECKER" ] && [ -f "$BASE_SPEC" ]; then
CHECKER_PATH="$BASE_CHECKER"
SPEC_PATH="$BASE_SPEC"
else
echo "Pinned checker/spec missing in base revision: ${{ steps.base.outputs.base_sha }}"
echo "Bootstrap mode: use HEAD checker/spec for this run."
CHECKER_PATH="tools/task_input_capture.py"
SPEC_PATH="test/eval/task_input_specs/redundancy_refactor.yaml"
fi

if [ ! -f "$CHECKER_PATH" ] || [ ! -f "$SPEC_PATH" ]; then
echo "Checker/spec not found in current checkout."
exit 1
fi

echo "checker_path=${CHECKER_PATH}" >> "$GITHUB_OUTPUT"
echo "spec_path=${SPEC_PATH}" >> "$GITHUB_OUTPUT"

- name: Capture HEAD snapshot
run: |
source .venv/bin/activate
HF_HOME=/tmp/hf-cache python "${{ steps.checker.outputs.checker_path }}" \
--repo-root . \
--spec "${{ steps.checker.outputs.spec_path }}" \
--output /tmp/task-input-head.json

- name: Capture BASE snapshot
run: |
source .venv/bin/activate
HF_HOME=/tmp/hf-cache python "${{ steps.checker.outputs.checker_path }}" \
--repo-root "${{ steps.base.outputs.base_worktree }}" \
--spec "${{ steps.checker.outputs.spec_path }}" \
--output /tmp/task-input-base.json

- name: Compare snapshots
run: |
source .venv/bin/activate
python - <<'PY'
import json
from pathlib import Path

base = json.loads(Path('/tmp/task-input-base.json').read_text(encoding='utf-8'))
head = json.loads(Path('/tmp/task-input-head.json').read_text(encoding='utf-8'))
if base != head:
print('Task input snapshot mismatch detected.')
raise SystemExit(1)
print('Task input snapshots match.')
PY

- name: Upload snapshots on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: task-input-snapshots
path: |
/tmp/task-input-base.json
/tmp/task-input-head.json

- name: Cleanup BASE worktree
if: always()
run: |
if [ -n "${{ steps.base.outputs.base_worktree }}" ]; then
git worktree remove --force "${{ steps.base.outputs.base_worktree }}"
fi
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ cache_dir
ckpt
pretrained/
LLaVA/
/*logs
*logs
*.isorted
temp/
InternVL/
Expand All @@ -58,3 +58,5 @@ remote_code/*
docs/plans/
.opencode
.worktrees/
AGENTS.md
.ignored/
Loading