Skip to content

Add deterministic agentic tool runtime#875

Open
Zigfreidish wants to merge 6 commits into
mainfrom
issue-674-opensearch-vl
Open

Add deterministic agentic tool runtime#875
Zigfreidish wants to merge 6 commits into
mainfrom
issue-674-opensearch-vl

Conversation

@Zigfreidish
Copy link
Copy Markdown
Collaborator

@Zigfreidish Zigfreidish commented May 12, 2026

Summary

Plan or Spec

Commands Run

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --frozen --project services/mlx-worker-python pytest -q services/mlx-worker-python/tests/test_agentic_tools.py services/mlx-worker-python/tests/test_evaluation_schemas.py services/mlx-worker-python/tests/test_evaluation_core.py services/mlx-worker-python/tests/test_benchmark_suites.py services/mlx-worker-python/tests/test_benchmark_schemas.py services/mlx-worker-python/tests/test_benchmark_evaluation_report.py -k 'agentic_tool or preserves_sample_payload or run_local_suite_persists_agentic_tool_evidence or canonical_fields or benchmark_rows_preserve_phase_probe_fields'
# 17 passed, 166 deselected in 0.28s

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --frozen --project services/mlx-worker-python pytest -q services/mlx-worker-python/tests/test_lora_model_ops_unit.py services/mlx-worker-python/tests/test_maintenance_service.py -k 'tool or benchmark_helper_parsers_cover_invalid_and_boundary_inputs'
# 5 passed, 267 deselected in 0.39s

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --frozen --project services/mlx-worker-python coverage run -m pytest -q services/mlx-worker-python/tests/test_agentic_tools.py services/mlx-worker-python/tests/test_evaluation_schemas.py services/mlx-worker-python/tests/test_evaluation_core.py services/mlx-worker-python/tests/test_benchmark_suites.py services/mlx-worker-python/tests/test_benchmark_schemas.py services/mlx-worker-python/tests/test_benchmark_evaluation_report.py services/mlx-worker-python/tests/test_lora_model_ops_unit.py services/mlx-worker-python/tests/test_maintenance_service.py -k 'agentic_tool or preserves_sample_payload or run_local_suite_persists_agentic_tool_evidence or canonical_fields or benchmark_rows_preserve_phase_probe_fields or benchmark_helper_parsers_cover_invalid_and_boundary_inputs or tool'
# 22 passed, 433 deselected in 0.39s

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --frozen --project services/mlx-worker-python coverage json -o coverage.json
# Wrote JSON report to coverage.json

python3 scripts/changed_scope_coverage.py --coverage-json coverage.json services/mlx-worker-python/worker/runtime/agentic_tools.py services/mlx-worker-python/worker/productization/evaluation_schemas.py services/mlx-worker-python/worker/engine/evaluation_core.py services/mlx-worker-python/worker/model_ops/rl_alignment_training.py services/mlx-worker-python/worker/productization/benchmark_suites.py services/mlx-worker-python/worker/productization/benchmark_schemas.py services/mlx-worker-python/worker/engine/maintenance_core.py services/mlx-worker-python/worker/productization/benchmark_evaluation_report.py services/mlx-worker-python/worker/productization/benchmark_export.py services/mlx-worker-python/tests/test_agentic_tools.py services/mlx-worker-python/tests/test_evaluation_schemas.py services/mlx-worker-python/tests/test_evaluation_core.py services/mlx-worker-python/tests/test_benchmark_suites.py services/mlx-worker-python/tests/test_benchmark_schemas.py services/mlx-worker-python/tests/test_lora_model_ops_unit.py services/mlx-worker-python/tests/test_maintenance_service.py services/mlx-worker-python/tests/test_benchmark_evaluation_report.py
# TOTAL 463 12 97%

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --frozen --project services/mlx-worker-python coverage run -m pytest -q services/mlx-worker-python/tests/test_benchmark_export.py services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_scope_report_selects_benchmark_export_probe services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_registered_probes_expose_focused_commands services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_probe_smokes_return_metrics_against_current_repo services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_dispatch_probe_impl_supports_benchmark_export_probe services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_probe_benchmark_export_run_scan_rejects_unexpected_job_count services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_probe_benchmark_export_run_scan_rejects_unexpected_evaluation_job_count services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_probe_benchmark_export_run_scan_rejects_unexpected_result_count services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_probe_benchmark_export_run_scan_rejects_unexpected_summary_csv_count services/mlx-worker-python/tests/test_pr_scoped_performance.py::test_dispatch_probe_impl_supports_pr_scoped_scope_matcher_probe
# 66 passed in 15.62s

PYTHONPYCACHEPREFIX="$PWD/.runtime/pycache" python3 -m py_compile services/mlx-worker-python/worker/runtime/agentic_tools.py services/mlx-worker-python/worker/productization/benchmark_suites.py services/mlx-worker-python/worker/productization/benchmark_schemas.py services/mlx-worker-python/worker/engine/maintenance_core.py services/mlx-worker-python/worker/productization/benchmark_evaluation_report.py services/mlx-worker-python/worker/productization/benchmark_export.py services/mlx-worker-python/worker/engine/evaluation_core.py services/mlx-worker-python/worker/model_ops/rl_alignment_training.py services/mlx-worker-python/worker/productization/evaluation_schemas.py
# passed

git diff --check
# passed

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --frozen --project services/mlx-worker-python python scripts/validate_pr_evidence.py --body-file .runtime/issue-674/pr-body.md
# passed

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --project services/mlx-worker-python pytest -q services/mlx-worker-python/tests/test_training_dataset_builder.py::test_prompt_completion_quality_stats_do_not_import_agentic_runtime services/mlx-worker-python/tests/test_training_dataset_builder.py::test_load_training_dataset_package_replays_agentic_tool_calls_with_shared_runtime services/mlx-worker-python/tests/test_training_dataset_builder.py::test_agentic_tool_trace_replay_covers_context_validation_and_preserved_evidence
# 3 passed in 0.08s

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --project services/mlx-worker-python coverage run -m pytest -q services/mlx-worker-python/tests/test_training_dataset_builder.py
# 75 passed in 0.08s

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --project services/mlx-worker-python coverage json -o coverage.json
# Wrote JSON report to coverage.json

python3 scripts/changed_scope_coverage.py --coverage-json coverage.json services/mlx-worker-python/worker/model_ops/training_dataset.py services/mlx-worker-python/tests/test_training_dataset_builder.py
# TOTAL 7 0 100%

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --project services/mlx-worker-python python scripts/pr_scoped_performance_run.py --registry infra/perf/pr_scoped_probes.json --probe-id training-dataset-token-percentiles-single-sort --base-repo /private/tmp/melix-pr875 --head-repo /private/tmp/melix-pr875 --output .runtime/pr875-training-probe.json
# head verification ok, base/head probes ok; coverage 100%

PYTHONPATH="$PWD:$PWD/services/mlx-worker-python" uv run --project services/mlx-worker-python python -c "import json; from pathlib import Path; from worker.productization.pr_scoped_performance import _probe_model_ops_bundle_artifact_bytes as p; print([p(Path.cwd()) for _ in range(5)])"
# elapsed_ms_mean samples: 1.338, 1.325, 1.468, 1.478, 1.403; bundle_scandir_calls_mean: 0.0 for all samples

Coverage and Metrics

  • Changed-scope coverage for the original runtime slice: 97% (451/463 executable changed lines).
  • Follow-up changed-scope coverage for the performance-probe fix: 100% (7/7 executable changed lines).
  • Runtime metrics emitted and aggregated:
    • agentic_tool.call_count
    • agentic_tool.observation_count
    • agentic_tool.completed_count
    • agentic_tool.timeout_count
    • agentic_tool.failed_count
    • agentic_tool.observation_emitted_bytes
  • Follow-up performance evidence:
    • training-dataset-token-percentiles-single-sort no longer fails on ModuleNotFoundError: google when the lightweight stats probe loads training_dataset.py.
    • refreshed GitHub training-dataset-token-percentiles-single-sort: ok, +0.341ms (+0.10%), 0 verification failures.
    • refreshed GitHub model-ops-bundle-artifact-byte-accounting: ok, -0.017ms (-0.77%), bundle_scandir_calls_mean=0.0.
    • local model-ops-bundle-artifact-byte-accounting repeated runs stayed between 1.325ms and 1.478ms with bundle_scandir_calls_mean=0.0; the previous GitHub +0.456ms delta was over an unchanged conversion/quantization code path and was not reproduced.
    • remaining refreshed GitHub regressions are tiny direct-probe timing deltas on unchanged lookup paths: evaluation-job-id-high-water-mark +0.687ms (+6.39%) and evaluation-compare-target-lookup-early-stop +0.002ms (+6.10%).
    • follow-up optimization caches the evaluation runs root and keeps compare-target resolution on the early-stop fast path; local PR-scoped base/head comparisons on rebased head 2fe05c0f measured evaluation-job-id-high-water-mark at base 9.235ms vs head 7.773ms, and evaluation-compare-target-lookup-early-stop at base 0.691256ms vs head 0.535589ms.
  • Persisted local artifacts covered by tests:
    • benchmark context, batch, and matrix request row JSONL payloads can include registry receipts, tool calls, observations, and agentic_tool_metrics
    • evaluation sample JSONL payloads can include the same evidence shape
    • RL trace rows can include the same evidence shape and assistant/tool turns
  • GitHub issue metadata updated:

Known Gaps

  • Network-backed search, browser access, and unsafe arbitrary Python execution remain out of scope for this deterministic local runtime slice.
  • No protocol schema changes were made, so generated protocol artifacts were not regenerated.
  • No dependency changes were made, so lockfiles were not updated.
  • The local combined coverage command used python for the final coverage parser on a shell without a python alias; pytest and coverage json had already passed, and the parser was rerun successfully with python3.

Evidence Checklist

  • Relevant plan or spec is identified.
  • Behavior changes are reflected in the relevant docs.
  • Protocol changes include regenerated generated artifacts. N/A: no protocol changes.
  • Dependency changes include updated lockfiles. N/A: no dependency changes.
  • Relevant tests were run.
  • A metrics report is included, or N/A is stated explicitly with the reason.
  • Deferred work and known gaps are stated explicitly.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Melix PR Scoped Performance Report

  • Status: regression
  • Changed files: 22
  • Selected probes: 18
  • Direct/gated probes: 18
  • Regressions: 2
  • Context regressions: 0
  • Verification failures: 0

Changed Files

  • docs/README.md
  • docs/plans/2026-05-12-unified-agentic-tool-runtime-execution.md
  • docs/unified-agentic-tool-runtime-contract.md
  • services/mlx-worker-python/tests/test_agentic_tools.py
  • services/mlx-worker-python/tests/test_benchmark_evaluation_report.py
  • services/mlx-worker-python/tests/test_benchmark_schemas.py
  • services/mlx-worker-python/tests/test_benchmark_suites.py
  • services/mlx-worker-python/tests/test_evaluation_core.py
  • services/mlx-worker-python/tests/test_evaluation_schemas.py
  • services/mlx-worker-python/tests/test_lora_model_ops_unit.py
  • services/mlx-worker-python/tests/test_maintenance_service.py
  • services/mlx-worker-python/tests/test_training_dataset_builder.py
  • services/mlx-worker-python/worker/engine/evaluation_core.py
  • services/mlx-worker-python/worker/engine/maintenance_core.py
  • services/mlx-worker-python/worker/model_ops/rl_alignment_training.py
  • services/mlx-worker-python/worker/model_ops/training_dataset.py
  • services/mlx-worker-python/worker/productization/benchmark_evaluation_report.py
  • services/mlx-worker-python/worker/productization/benchmark_schemas.py
  • services/mlx-worker-python/worker/productization/benchmark_suites.py
  • services/mlx-worker-python/worker/productization/evaluation_compare.py
  • ... (+2 more)

Benchmark evaluation report running aggregates

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (98.0%)
Metric Base Head Delta Status
load_input_ms_mean 7.856 6.367 -1.489 (-18.96%) improvement
elapsed_ms_mean 340.423 333.019 -7.404 (-2.17%) neutral
peak_bytes_mean 13625369.600 13625374.400 +4.800 (+0.00%) neutral

Evaluation job-id high-water mark

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 16.246 13.692 -2.554 (-15.72%) improvement
per_call_ms_mean 0.081 0.068 -0.013 (-15.72%) improvement

Evaluation sample probe aggregation

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 20.920 21.838 +0.918 (+4.39%) neutral
per_call_ms_mean 0.001 0.001 +0.000 (+4.40%) neutral

Evaluation answer normalization fast path

  • Status: regression
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 637.866 680.347 +42.482 (+6.66%) regression
numeric_extract_calls_mean 300.000 300.000 +0.000 (+0.00%) neutral
option_extract_calls_mean 300.000 300.000 +0.000 (+0.00%) neutral

Evaluation latency percentile vector reuse

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 243.120 243.313 +0.193 (+0.08%) neutral
sorted_calls_mean 1.000 1.000 +0.000 (+0.00%) neutral

Evaluation dialogue diagnostics top-k

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 299.926 298.949 -0.977 (-0.33%) neutral
peak_bytes_mean 3132185.600 3132185.600 +0.000 (+0.00%) neutral

Evaluation compare target lookup early stop

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 1.763 1.761 -0.002 (-0.14%) neutral
get_loaded_model_calls_mean 12.000 12.000 +0.000 (+0.00%) neutral

Evaluation compare target lookup short-circuit

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 9.238 8.955 -0.282 (-3.06%) neutral
get_loaded_model_calls_mean 3.000 3.000 +0.000 (+0.00%) neutral

Training dataset quality/token summary

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 330.478 341.176 +10.698 (+3.24%) neutral
peak_bytes_mean 2202812.000 2202812.000 +0.000 (+0.00%) neutral

Training dataset validation split partial selection

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 1031.615 1039.138 +7.523 (+0.73%) neutral
peak_bytes_mean 620968.000 620968.000 +0.000 (+0.00%) neutral
validation_count 1500.000 1500.000 +0.000 (+0.00%) neutral
checksum 22440493.000 22440493.000 +0.000 (+0.00%) neutral

Training dataset validation sample-limit loading

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 0.553 0.545 -0.007 (-1.32%) neutral
peak_bytes_mean 29710.200 29710.200 +0.000 (+0.00%) neutral
validation_sample_count_mean 1.000 1.000 +0.000 (+0.00%) neutral

Maintenance bench report readback

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 78.332 52.059 -26.273 (-33.54%) improvement
bench_report_read_calls_mean 0.000 0.000 +0.000 neutral
request_id_calls_mean 1.000 1.000 +0.000 (+0.00%) neutral

Maintenance percentile vector reuse

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 306.507 296.497 -10.010 (-3.27%) neutral
sort_calls_mean 300.000 300.000 +0.000 (+0.00%) neutral

Maintenance prompt shape vector repeat

  • Status: regression
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 154.652 201.605 +46.953 (+30.36%) regression
token_count_mean 5160960.000 5160960.000 +0.000 (+0.00%) neutral

Maintenance benchmark parameter normalization single convert

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 929.280 940.160 +10.880 (+1.17%) neutral
calls_per_value_mean 1.000 1.000 +0.000 (+0.00%) neutral
peak_bytes_mean 814430.600 814430.600 +0.000 (+0.00%) neutral

Upload receipt published-files scandir

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 7.458 7.464 +0.006 (+0.08%) neutral
special_entry_follow_dir_checks_mean 0.000 0.000 +0.000 neutral

MLX-LM structured result tail parse

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 0.171 0.166 -0.004 (-2.53%) neutral
peak_bytes_mean 1767.000 1781.400 +14.400 (+0.81%) neutral

Model ops bundle artifact byte accounting

  • Status: ok
  • Gate: direct
  • Targeted tests: pass
  • Coverage: pass (100.0%)
Metric Base Head Delta Status
elapsed_ms_mean 2.053 1.669 -0.384 (-18.70%) improvement
bundle_scandir_calls_mean 0.000 0.000 +0.000 neutral

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the 'Unified Agentic Tool Runtime Contract' documentation, which standardizes tool registry, execution, and observation across SFT, RL, benchmarks, and evaluations. The review feedback identifies discrepancies between the document and the current implementation regarding tool kind and observation naming for the 'image_search' and 'visit' tools. Additionally, it suggests improving traceability by linking specific issue numbers within the milestone matrix.

Comment thread docs/unified-agentic-tool-runtime-contract.md Outdated
Comment thread docs/unified-agentic-tool-runtime-contract.md Outdated
Comment thread docs/unified-agentic-tool-runtime-contract.md Outdated
@Zigfreidish Zigfreidish changed the title Add unified agentic tool runtime contract Add deterministic agentic tool runtime May 12, 2026
@Zigfreidish Zigfreidish force-pushed the issue-674-opensearch-vl branch from 0343bd3 to 33cb239 Compare May 12, 2026 09:50
@Zigfreidish Zigfreidish force-pushed the issue-674-opensearch-vl branch from 14bb6e5 to 2fe05c0 Compare May 12, 2026 15:16
@Zigfreidish Zigfreidish force-pushed the issue-674-opensearch-vl branch from 2fe05c0 to 558f85a Compare May 12, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants