Add deterministic agentic tool runtime#875
Open
Zigfreidish wants to merge 6 commits into
Open
Conversation
Melix PR Scoped Performance Report
Changed Files
Benchmark evaluation report running aggregates
Evaluation job-id high-water mark
Evaluation sample probe aggregation
Evaluation answer normalization fast path
Evaluation latency percentile vector reuse
Evaluation dialogue diagnostics top-k
Evaluation compare target lookup early stop
Evaluation compare target lookup short-circuit
Training dataset quality/token summary
Training dataset validation split partial selection
Training dataset validation sample-limit loading
Maintenance bench report readback
Maintenance percentile vector reuse
Maintenance prompt shape vector repeat
Maintenance benchmark parameter normalization single convert
Upload receipt published-files scandir
MLX-LM structured result tail parse
Model ops bundle artifact byte accounting
|
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces the 'Unified Agentic Tool Runtime Contract' documentation, which standardizes tool registry, execution, and observation across SFT, RL, benchmarks, and evaluations. The review feedback identifies discrepancies between the document and the current implementation regarding tool kind and observation naming for the 'image_search' and 'visit' tools. Additionally, it suggests improving traceability by linking specific issue numbers within the milestone matrix.
0343bd3 to
33cb239
Compare
Keith-CY
approved these changes
May 12, 2026
14bb6e5 to
2fe05c0
Compare
2fe05c0 to
558f85a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
agentic_tool.*metrics and updated the governing docs.Plan or Spec
docs/unified-agentic-tool-runtime-contract.mddocs/plans/2026-05-12-unified-agentic-tool-runtime-execution.mdCommands Run
Coverage and Metrics
97%(451/463executable changed lines).100%(7/7executable changed lines).agentic_tool.call_countagentic_tool.observation_countagentic_tool.completed_countagentic_tool.timeout_countagentic_tool.failed_countagentic_tool.observation_emitted_bytestraining-dataset-token-percentiles-single-sortno longer fails onModuleNotFoundError: googlewhen the lightweight stats probe loadstraining_dataset.py.training-dataset-token-percentiles-single-sort:ok,+0.341ms (+0.10%),0verification failures.model-ops-bundle-artifact-byte-accounting:ok,-0.017ms (-0.77%),bundle_scandir_calls_mean=0.0.model-ops-bundle-artifact-byte-accountingrepeated runs stayed between1.325msand1.478mswithbundle_scandir_calls_mean=0.0; the previous GitHub+0.456msdelta was over an unchanged conversion/quantization code path and was not reproduced.evaluation-job-id-high-water-mark+0.687ms (+6.39%)andevaluation-compare-target-lookup-early-stop+0.002ms (+6.10%).2fe05c0fmeasuredevaluation-job-id-high-water-markat base9.235msvs head7.773ms, andevaluation-compare-target-lookup-early-stopat base0.691256msvs head0.535589ms.agentic_tool_metricsKnown Gaps
pythonfor the final coverage parser on a shell without apythonalias; pytest andcoverage jsonhad already passed, and the parser was rerun successfully withpython3.Evidence Checklist
N/Ais stated explicitly with the reason.