Skip to content

llnl/dfdiagnoser

Repository files navigation

DFDiagnoser

DFDiagnoser turns DFAnalyzer's per-window analysis facts into longitudinal diagnosis findingsis this bottleneck persistent, getting worse, where, and what class of fix does it call for — that DFOptimizer can act on. It runs the same way offline (replay a saved fact bundle) and online (consume a live stream over Mofka or ZMQ), producing identical findings.

What it does

DFAnalyzer emits analyzer.fact-envelope.v1 facts — one per analysis window per view (a temporal view like window/epoch/step/time_range, or a spatial view like file_name/proc_name). Each fact carries a fact_type (e.g. fetch_pressure), a two-level scope (layer:view aggregate, or layer:view:entity detail), a continuous severity, and opportunity_tags.

DFDiagnoser is a pure fact consumer (scoring/fact-building live in the analyzer). It tracks each (fact_type, scope) along its temporal axis and summarizes the trajectory:

  • persistence — longest run of consecutive windows the fact appears in
  • prevalence — fraction of windows it appears in
  • trend — improving / stable / worsening
  • motif — the classified pattern (e.g. persistent_pressure, metadata_bound)
  • recommendation_bundle + opportunity_tags — the class of fix (e.g. input_pipeline_tuning)

Temporal views (window online; epoch/step/time_range offline) are the longitudinal axis; spatial views (file/proc/host) are one-shot. Online, it can also emit per-window control findings so the optimizer acts on fresh state.

Installation

pip install dfdiagnoser                 # core (offline)
pip install "dfdiagnoser[streaming]"    # + online transports (pyzmq / mofka)

From source:

git clone https://github.com/LLNL/dfdiagnoser.git && cd dfdiagnoser
uv sync && uv pip install -e .

Usage

The CLI is a Hydra app selecting an input (file / mofka / zmq) and an output (console / file).

Offline — replay a DFAnalyzer fact bundle

A minimal time_range rule (offline rules are workload-specific; the shipped dlio.yaml rules target the streaming epoch axis). Save as /tmp/tr.yaml:

schema_version: analysisfact-rules.v1
defaults: {rule_version: "1.0.0", emit_mode: aggregate, confidence: "0.80"}
rules:
  - id: tr.reader_pressure.v1
    priority: 100
    source_view: time_range
    fact_type: reader_pressure
    required_metrics: [reader_posix_time_proc_max, app_time_proc_max]
    derived_metrics:
      reader_frac: "fillna0(reader_posix_time_proc_max) / max(fillna0(app_time_proc_max), 1e-9)"
    when: "reader_frac >= 0.10"
    severity_score: "clip01(reader_frac)"
    opportunity_tags: [dataloader_prefetch, reader_parallelism]
# 1. DFAnalyzer writes a fact bundle (facts.jsonl) with output=file. time_range is
#    the offline longitudinal axis (epoch/window come from the streaming path below).
dfanalyzer analyzer/preset=dlio trace_path=tests/data/extracted/dftracer-dlio \
    view_types=[time_range] facts.enabled=true \
    facts.eval_rule_file=/tmp/tr.yaml output=file output.path=/tmp/bundle

# 2. DFDiagnoser replays it into findings
dfdiagnoser input=file input.path=/tmp/bundle output=console
╭─ DFDiagnoser Diagnosis ─╮
│ Findings   1            │
│ Scopes     1            │
│ Severity   critical: 1  │
╰─────────────────────────╯
Findings
└── view_type: time_range (1)
    └── time_range (1)
        └── [C] reader_pressure: unclassified (severity critical 1.00, conf 0.50)
            ├── prevalence 0.93, persistence 39, trend stable -> investigate
            └── (fact) reader_pressure @ time_range

(Read it as: reader_pressure was critical across 39 consecutive time windows; its opportunity_tags carry the fix class the optimizer acts on.)

Online — live stream (ZMQ); window/epoch axis

Streaming windows the event flow, so each window carries epoch/window coordinates — the per-epoch longitudinal axis. DFAnalyzer streams fact envelopes with output=zmq; DFDiagnoser consumes them live and prints the summary on idle (or, with input.output_address set, streams control findings onward to the optimizer):

# DFDiagnoser binds and waits for facts
dfdiagnoser input=zmq input.address="tcp://*:5556" output=console

# DFAnalyzer side (separate process) pushes window-view facts to it
dfanalyzer analyzer/preset=dlio input=zmq input.address="tcp://*:5555" \
    view_types=[window] facts.enabled=true \
    output=zmq output.address="tcp://127.0.0.1:5556"
Findings
└── view_type: window (1)
    └── window (1)
        └── [C] fetch_pressure: persistent_pressure (severity critical 1.00, conf 0.80)
            ├── prevalence 1.00, persistence 18, trend stable -> input_pipeline_tuning
            └── (fact) fetch_pressure @ window

Online — live stream (Mofka, LiveFlow)

dfdiagnoser input=mofka \
    input.group_file="$MOFKA_GROUP_FILE" \
    input.topic_name=analyzer_facts \
    input.output_topic=diagnosis_findings \
    output=console

All three paths yield the same DiagnosisResult.findings; only the transport differs.

Inputs and outputs

  • Input: analyzer.fact-envelope.v1 envelopes — a .jsonl file / bundle dir (offline), or a Mofka/ZMQ stream (online).
  • Output: findings rendered to the console, written to JSON (output=file), or streamed onward. Each finding's wire form carries finding_type, scope, motif, severity_score, prevalence, persistence, trend_direction, opportunity_tags, recommendation_bundle, and key_metrics — the fields DFOptimizer gates actuation on.

Requirements

  • Python >= 3.9
  • DFAnalyzer fact envelopes (run DFAnalyzer with facts.enabled=true)
  • Online transports: pyzmq (ZMQ) or mochi-mofka (Mofka), via the [streaming] extra

License

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages