Skip to content

prevent WorkflowMonitor OOM by projecting only metric fields#1145

Open
mohammaddanishali-bit wants to merge 5 commits into
conductor-oss:mainfrom
mohammaddanishali-bit:fix/workflow-monitor-oom
Open

prevent WorkflowMonitor OOM by projecting only metric fields#1145
mohammaddanishali-bit wants to merge 5 commits into
conductor-oss:mainfrom
mohammaddanishali-bit:fix/workflow-monitor-oom

Conversation

@mohammaddanishali-bit

@mohammaddanishali-bit mohammaddanishali-bit commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Pull Request type

  • Bugfix
  • Feature
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • WHOSUSING.md
  • Other (please describe):

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

WorkflowMonitor.reportMetrics() runs on every node (enabled by default) and
loaded
every version of every workflow/task definition via
getWorkflowDefs()/getTaskDefs(),
deserializing the full json_data into WorkflowDef/TaskDef objects just to
read name,
ownerApp, and concurrencyLimit. At ~100K definitions this peaked at hundreds
of MB of heap
per refresh, causing a recurring OutOfMemoryError that cascaded into sweeper,
polling, and
HikariPool failures.

This PR adds lightweight projections:

  • New DTOs WorkflowMetricInfo(name, ownerApp) and TaskMetricInfo(name, ownerApp, concurrencyLimit).
  • getWorkflowMetricInfo() / getTaskMetricInfo() on MetadataDAO with a
    default
    implementation projecting from getAllWorkflowDefsLatestVersions() /
    getAllTaskDefs()
    (Redis/Cassandra/SQLite unchanged); Postgres and MySQL override with
    of catalog size,
    with no change to the emitted metrics.

Issue # #1006

Alternatives considered

Set conductor.workflow-monitor.enabled=false on deployments that don't need these metrics and increase heap on deployments where it stays enabled.

WorkflowMonitor.reportMetrics() loaded every version of every workflow/task
definition and deserialized full json_data into WorkflowDef/TaskDef object
graphs, only to read name, ownerApp and concurrencyLimit. At ~100K definitions
this peaked at hundreds of MB of heap per refresh, causing a recurring
OutOfMemoryError on every node (the monitor is enabled by default) that cascaded
into sweeper, polling and Hikari pool failures.

Add lightweight projections: WorkflowMetricInfo/TaskMetricInfo DTOs and
getWorkflowMetricInfo()/getTaskMetricInfo() on MetadataDAO with a default that
projects from getAllWorkflowDefsLatestVersions()/getAllTaskDefs(); Postgres and
MySQL override with DB-side projection. WorkflowMonitor holds the DTOs and the
per-name grouping helper is removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants