Skip to content

llnl/dftracer-agents

Repository files navigation

dftracer-agents

MCP (Model Context Protocol) server that exposes the dftracer I/O tracing toolkit as tools for LLM agents.

Includes three services:

  • dftracer_utils — 21 tools wrapping the dftracer_* CLI binaries (info, stats, merge, split, index, reader, pgzip, tar, organize, replay, …)
  • dfanalyzer — 5 tools for trace analysis (analyze, summarize_trace, detect_preset, query, list_presets)
  • dftracer_plot — 2 tools for generating charts from trace data (plot, plot_all)

The server speaks the MCP stdio transport and works with any MCP-compatible agent (Goose, Claude Desktop, custom agents).


Setup

git clone <this-repo>
cd dftracer-agents

python -m venv venv
source venv/bin/activate

pip install -e .

This installs two console scripts: dftracer-mcp-server and dftracer-install-skills.


Installing agent skills

The package bundles a set of agent skills (Claude Code / Goose SKILL.md files) that teach the AI agent how to use dftracer effectively. After pip install, copy them into your project so the agent harness can find them:

# Copy skills into ./.agents/skills/ (current directory)
dftracer-install-skills

# Copy into a specific project directory
dftracer-install-skills /path/to/my-project

# Replace any existing skills with the packaged versions
dftracer-install-skills --overwrite

# Print the path to the bundled skills without copying
dftracer-install-skills --list

Skills are installed to <target>/.agents/skills/ and cover:

Skill Purpose
dftracer-pipeline Full annotation + trace pipeline workflow
dftracer-annotate-c / cpp / python Per-language annotation rules
dftracer-trace-utils When and how to use MCP trace tools
dftracer-install Installation and configuration
dftracer-io-optimization I/O bottleneck analysis and optimization
dftracer-lessons / dftracer-pitfalls Hard-won annotation lessons and common mistakes
dftracer-cheatsheet Quick reference for dftracer macros and APIs
18 skills total

You can also locate or install skills programmatically:

from dftracer_agents import bundled_skills_dir, install_skills

# Path to the skills inside the installed package
path = bundled_skills_dir()

# Copy to a directory (skips existing by default)
install_skills("/path/to/project", overwrite=False)

Pipeline

The full annotation → trace → diagnosis → optimization pipeline is documented with a Mermaid flowchart:

docs/pipeline.md

It covers every MCP tool call in order, which sub-service owns each one, the workspace directory layout, and how per-file annotation parallelism and L1 optimization iterations work.


Running the tests

Tests call the real MCP tool functions and compare output against direct subprocess calls — no mocking.

source venv/bin/activate
cd dftracer-agents

# Run everything
python -m pytest test/ -v

# Run only dfanalyzer service tests
python -m pytest test/test_dfanalyzer_service.py -v

# Run only dftracer_utils tool tests
python -m pytest test/test_dftracer_utils_mcp_tools.py -v

# Run a single test by name
python -m pytest test/test_dfanalyzer_service.py::test_summarize_trace_on_sample_data -v

Example output:

test/test_dfanalyzer_service.py::test_hydra_args_minimal PASSED
test/test_dfanalyzer_service.py::test_summarize_trace_on_sample_data PASSED

  summarize_trace output:
  DFTracer Trace Summary: test/data/cm1_1_48_20240926
  ============================================================
    Trace files : 48 total
    Total events: 284,041
    Processes   : 48
    Threads     : 48
    Duration    : 145.364s
    Bytes read  : 2.7 MB
    Bytes written: 688.3 MB

  Top I/O operations:
    write                  112,353
    __xstat                 46,899
    fclose                  25,992
    open                    21,814
    close                   21,563
    ...

test/test_dftracer_utils_mcp_tools.py::test_pgzip_on_empty_dir_succeeds PASSED
test/test_dftracer_utils_mcp_tools.py::test_tool_and_subprocess_fail_consistently[info] PASSED
...
33 passed, 3 skipped

The 3 skipped tests are server (blocks forever), aggregator_mpi, and call_tree_mpi (require an MPI build).

What the tests verify

Each parametrized test in test_dftracer_utils_mcp_tools.py runs both the direct binary and the MCP tool function, then asserts both produce the same outcome (both succeed or both fail). This catches drift between the MCP wrapper and the underlying binary.

test_tool_and_subprocess_fail_consistently[info]
  [info] segfaults on this platform (io_uring probe failure)
  direct: dftracer_info -d test/data/cm1_1_48_20240926 --query summary
    rc=-11
  mcp tool (info): raised=True
    CalledProcessError(returncode=-11)
  both failed — direct rc=-11, mcp rc=-11  ✓

Interactive MCP REPL

A text REPL for manual tool exploration — useful for debugging tools without writing test code.

source venv/bin/activate

# All services: utils + analyzer + plot (28 tools, default)
python test/mcp_repl.py --service both

# dftracer_utils only (21 tools)
python test/mcp_repl.py --service utils

# dfanalyzer + plot (7 tools)
python test/mcp_repl.py --service analyzer

Example session:

Starting MCP server (both)…
============================================================
  dftracer-agents MCP REPL
============================================================
  28 tools available.  Type 'list' to see them.
  Syntax:  <tool_name> [<json-args>]
  Example: info {"directory": "/path/to/traces"}
  Commands: list, desc <tool>, quit
============================================================

mcp> list
  aggregator              Aggregate trace files …
  analyze                 Analyze an I/O trace using dfanalyzer …
  info                    Show a summary of trace files in a directory …
  summarize_trace         Summarize a dftracer I/O trace directory …
  …

mcp> desc summarize_trace
Tool: summarize_trace
Parameters:
  trace_path: string (required)
  max_files:  integer  default=50

mcp> summarize_trace {"trace_path": "/path/to/traces"}
  → calling summarize_trace({'trace_path': '/path/to/traces'}) ...

DFTracer Trace Summary: /path/to/traces
============================================================
  Trace files : 48 total
  Total events: 284,041
  Processes   : 48
  Duration    : 145.364s
  Bytes read  : 2.7 MB
  Bytes written: 688.3 MB
  …

mcp> quit

LLM agent (ollama)

test/mcp_agent.py connects the MCP server to a local LLM via the OpenAI-compatible API. You ask questions in plain English; the LLM decides which tools to call.

Prerequisites

# Install and start ollama
ollama serve
ollama pull qwen2.5-coder:7b

Configure the LLM endpoint

Edit test/openai_client_config.json:

{
  "provider": "openai",
  "base_url": "http://localhost:11434/v1",
  "api_key": "ollama",
  "model": "qwen2.5-coder:7b"
}

If running inside Docker and ollama is on the host:

{
  "base_url": "http://host.docker.internal:11434/v1",
  "api_key": "ollama",
  "model": "qwen2.5-coder:7b"
}

Run

source venv/bin/activate

# All services: utils + analyzer + plot (28 tools)
python test/mcp_agent.py --service both

# dfanalyzer + plot (7 tools)
python test/mcp_agent.py --service analyzer

# dftracer_utils only (21 tools)
python test/mcp_agent.py --service utils

# Custom LLM config file
python test/mcp_agent.py --config /path/to/config.json

Example session

Starting MCP server (both)…
Connecting to LLM at http://localhost:11434/v1 (model=qwen2.5-coder:7b)

============================================================
  dftracer-agents LLM Agent
============================================================
  Model : qwen2.5-coder:7b
  Tools : 28 MCP tools available
  Type your question in plain English.  Ctrl-C or 'quit' to exit.
============================================================

you> summarize the trace at /data/cm1_trace

  [tool] summarize_trace({"trace_path":"/data/cm1_trace"})
  [result] DFTracer Trace Summary: /data/cm1_trace …

agent> The trace contains 48 processes, ran for 145 seconds, and performed
       112,353 write calls transferring 688 MB. The dominant operation was
       write (40% of all calls), followed by stat and open/close pairs typical
       of HPC checkpoint I/O.

you> which files were accessed most?

  [tool] analyze({"trace_path":"/data/cm1_trace","view_types":["file_name"]})
  …

Goose integration

Goose is an open-source AI agent that natively supports MCP servers as extensions.

Install Goose

curl -fsSL https://github.com/aaif-goose/goose/releases/download/stable/download_cli.sh \
  | CONFIGURE=false GOOSE_BIN_DIR=/usr/local/bin bash

Node.js ≥ 20 is required for goose tui. Install it if not present:

curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
apt-get install -y nodejs

Configure the dftracer extension

Add to ~/.config/goose/config.yaml:

extensions:
  dftracer:
    type: stdio
    cmd: dftracer-mcp-server
    args: []
    enabled: true
    description: >
      DFTracer I/O tracing toolkit. Use summarize_trace() for a quick overview
      of any trace directory, analyze() for full dfanalyzer pipeline output,
      or the dftracer_utils tools (info, stats, merge, split, …) for file-level
      operations.

If dftracer-mcp-server is not on your PATH (e.g. installed in a venv), point directly at the Python interpreter:

extensions:
  dftracer:
    type: stdio
    cmd: /path/to/dftracer-agents/venv/bin/python
    args:
      - /path/to/dftracer-agents/dftracer_mcp_server.py
    enabled: true

Expose only one service if you want a smaller tool set:

extensions:
  dftracer:
    type: stdio
    cmd: dftracer-mcp-server
    args: [--service, analyzer]   # or: --service utils
    enabled: true

Using Goose to annotate an application and run the full pipeline

The annotation pipeline clones a target application, detects languages, auto-annotates source files with dftracer macros, verifies correctness with a build + smoke test, pauses for your review, then collects and analyzes traces.

Pipeline files live in dftracer-agents/recipes/:

File Purpose
pipeline.yaml Main orchestration recipe (headless)
annotate-c.yaml Per-file C annotation sub-recipe
annotate-cpp.yaml Per-file C++ annotation sub-recipe
annotate-python.yaml Per-file Python annotation sub-recipe
_inc-top.inc Shared: Step 0 (lessons), General Rules, Step 1 (read)
_inc-write.inc Shared: Step 5 (write back)
_inc-report.inc Shared: Step 7 (report format) + General Pitfalls

A lessons-learned file at .agents/skills/dftracer-annotation-lessons/SKILL.md is read by every annotation agent at startup and appended after each session.

Option A — Interactive session (recommended)

Runs interactively: the agent asks you questions, pauses for confirmation before the trace run, and lets you request fixes between steps.

# Prerequisites: Node.js >=20 (for goose tui, if using that route)
# The recommended interactive method uses the pipeline skill + wrapper script:

./dftracer-agents/run-pipeline.sh

This starts a goose session, injects the startup prompt automatically, then connects your keyboard for the Q&A. The agent will ask:

What is the Git URL of the application you want to annotate?
> https://github.com/org/myapp

Which branch or tag? (default: main)
> main

Smoke test command? (leave blank to auto-detect)
> make test

Extra CMake build flags? (leave blank to skip)
>

After annotation and build verification it pauses:

┌─────────────────────────────────────────────────────────┐
│  ANNOTATION REPORT — myapp/20260616_120000              │
│  src/io.c       DONE   12 annotated  2 skipped          │
│  src/main.c     DONE    4 annotated  0 skipped          │
│  Coverage: 16 / 18   io=10  comm=4  mem=2  cpu=0        │
│  Build: PASSED   Smoke test: PASSED                      │
└─────────────────────────────────────────────────────────┘

Proceed with dftracer trace run? [yes / no / fix <file>]
> yes

Alternatively, start a plain goose session and paste this prompt manually:

Load and follow the dftracer annotation pipeline skill from:
/workspaces/dftracer-agents/.agents/skills/dftracer-pipeline/SKILL.md

Read the full SKILL.md file now, then immediately start Step 1 by asking me:
  'What is the Git URL of the application you want to annotate?'

Option B — Headless goose run (CI / scripted use)

Pass all inputs as --params. The pipeline runs end-to-end unattended, annotating files in parallel via sub-recipes, then prints the full report and trace analysis.

goose run --recipe dftracer-agents/recipes/pipeline.yaml \
  --params app_url="https://github.com/org/myapp" \
  --params ref="main" \
  --params smoke_cmd="make test" \
  --params extra_flags="-DENABLE_MPI=ON"

smoke_cmd and extra_flags are optional (auto-detected if omitted).

Annotation recipes directly

You can also invoke a single-file annotation recipe without the pipeline:

goose run --recipe dftracer-agents/recipes/annotate-c.yaml \
  --params run_id="myapp/20260616_120000" \
  --params filepath="src/io.c"

Pass build_errors to fix a specific compiler failure:

goose run --recipe dftracer-agents/recipes/annotate-c.yaml \
  --params run_id="myapp/20260616_120000" \
  --params filepath="src/io.c" \
  --params build_errors="io.c:42: error: 'DFTRACER_C_FUNCTION_END' undeclared"

Example Goose session — trace analysis

goose session
Goose is running! Enter your instructions, or try asking for help.

( O )> summarize the I/O trace at /data/cm1_1_48_20240926

─── Tool use: summarize_trace ────────────────────────────────────
{
  "trace_path": "/data/cm1_1_48_20240926"
}
──────────────────────────────────────────────────────────────────

DFTracer Trace Summary: /data/cm1_1_48_20240926
============================================================
  Trace files : 48 total
  Total events: 284,041
  Processes   : 48
  Threads     : 48
  Duration    : 145.364s
  Bytes read  : 2.7 MB
  Bytes written: 688.3 MB

Top I/O operations:
  write                  112,353
  __xstat                 46,899
  fclose                  25,992
  open                    21,814
  close                   21,563
  fopen64                 20,496
  read                    13,489
  opendir                  9,215

This trace shows a write-heavy HPC workload (CM1 atmospheric model) with
48 MPI ranks. The high write volume (688 MB) relative to reads (2.7 MB)
is typical of checkpoint I/O. The __xstat and opendir calls suggest
frequent directory polling, common in parallel file systems.

( O )> compress all the .pfw files in that directory

─── Tool use: pgzip ──────────────────────────────────────────────
{
  "directory": "/data/cm1_1_48_20240926"
}
──────────────────────────────────────────────────────────────────
…

The annotation pipeline prompt and the analysis workflow below can be combined — after the pipeline completes, ask Goose to continue with analyze and query on the trace directory it just produced.

Recommended analysis workflow

When working with a new trace, follow this sequence — the agent will do this automatically if you ask it to "analyse" a trace:

1. detect_preset(trace_path)
   → reads event categories from the trace
   → returns "posix" (HPC/scientific) or "dlio" (AI/ML deep learning)

2. summarize_trace(trace_path)
   → pure-Python overview: file count, duration, bytes read/written, top ops
   → always works, no native C++ required

3. query(trace_path, view_type="proc_name")
   → per-process I/O breakdown (who writes the most?)

4. query(trace_path, view_type="file_name",
         filter_expr='cat == "POSIX" and dur > 1000')
   → hot-file analysis (which files take the longest per call?)

5. query(trace_path, view_type="time_range")
   → I/O rate over time (when does the burst happen?)

6. analyze(trace_path, analyzer_preset=<from step 1>)
   → full dfanalyzer pipeline (requires native C++ and dfanalyzer install)

AI/ML auto-detection (detect_preset): inspects the cat field of every event against the AI/ML category signatures from dftracer's ai_common.py. If any of COMPUTE, DATA, DATALOADER, COMM, DEVICE, CHECKPOINT, or PIPELINE categories are present — or AI/ML function names like forward, backward, epoch, fetch — the dlio preset is recommended; otherwise posix.

Available tools in Goose

Tool Service Description
detect_preset analyzer Auto-detect posix vs dlio preset from trace event categories
summarize_trace analyzer Pure-Python trace summary (always works, no native deps)
query analyzer Exploratory groupby views: file_name, proc_name, time_range, raw
analyze analyzer Full dfanalyzer pipeline (POSIX, DLIO, Darshan presets)
list_presets analyzer Show all dfanalyzer configuration options
plot plot Generate a chart from trace data (5 types, PNG/HTML/SVG)
plot_all plot Generate all 5 standard plots in one call
info utils Trace directory summary
stats utils Per-function I/O statistics
merge utils Merge multiple trace files
split utils Split a trace by process/time
pgzip utils Parallel gzip compression
index utils Build bloom-filter index for fast queries
reader utils Read and dump trace events
organize utils Organize traces by run
replay utils Replay trace I/O for benchmarking
tar utils Archive trace directories
utils 21 dftracer_utils tools total

query filter expression syntax

The filter_expr parameter uses the same DSL as dftracer_stats --query. Available fields:

Field Type Example
cat string cat == "POSIX"
name string name in ("read", "write")
dur int (µs) dur > 1000
ts int (µs)
pid, tid int pid == 3537780

Expressions can be combined with and / or:

'cat == "POSIX" and dur > 500'
'name in ("read", "write") and dur > 10000'
'cat == "COMPUTE"'

MCP server directly

You can also run the server manually for debugging or to wire it into any MCP client:

# Start the server (it blocks, reading MCP requests from stdin)
dftracer-mcp-server --service both

# Or via Python
python dftracer_mcp_server.py --service analyzer

The server writes MCP JSON-RPC to stdout and reads from stdin. Any MCP client (Goose, Claude Desktop, a custom script) can connect by spawning it as a subprocess.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors