Skip to content

[Bug]: Heartbeat re-fires in tight loop when agent uses exec during heartbeat turn #858

@Cstewart-HC

Description

@Cstewart-HC

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • I am using the latest version of Moltis
  • If this happened during a chat session, I included as much full session context as possible and redacted secrets

What happened?

When heartbeat is enabled and the heartbeat agent turn invokes exec (directly or via a skill that shells out), the heartbeat re-fires every few seconds in a tight self-reinforcing loop instead of respecting its configured interval (e.g. 30m).

Observed: ~20 heartbeat runs in ~30 minutes. The __heartbeat__ cron job shows lastStatus: "ok" for every run but the nextRunAtMs keeps advancing to "now".

Expected behavior

Heartbeat fires once per configured interval. exec calls made within the heartbeat turn should not cause the heartbeat to re-schedule itself immediately.

Steps to reproduce

  1. Enable heartbeat in moltis.toml ([heartbeat] enabled = true)
  2. Set a heartbeat prompt or HEARTBEAT.md that instructs the agent to run diagnostic shell commands (e.g. check service status, run a collection script)
  3. Observe the heartbeat cron job firing repeatedly — every few seconds to minutes — instead of at the configured interval

Root cause

A feedback loop between the exec completion callback and the heartbeat wake mechanism:

  1. Heartbeat fires → agent starts LLM turn
  2. Agent calls exec (any shell command)
  3. exec completes → ExecCompletionFn fires (crates/gateway/src/server/prepare_core/post_state.rs:622-629)
  4. Callback enqueues a system event and unconditionally calls cs.wake("exec-event")
  5. CronService::wake() (crates/cron/src/service.rs:227-238) sets next_run_at_ms = now on the __heartbeat__ job
  6. Current heartbeat run finishes → running_at_ms cleared
  7. Timer loop sees next_run_at_ms <= now and running_at_ms == None → fires heartbeat again
  8. Goto step 2

The running_at_ms guard in wake() prevents firing during a run, but the next_run_at_ms = now assignment persists and takes effect the instant the run completes.

Additionally, there is no moltis.toml config option to disable exec-completion-triggered heartbeat wakes.

Did this happen during a chat session?

Yes

Chat session context (if applicable)

Custom __heartbeat__ job with sessionTarget: { named: "heartbeat" } and an agentTurn payload that runs diagnostic commands via exec.

Error messages / logs

No errors — every run succeeds with lastStatus: "ok". The issue is the excessive frequency.

Is this a regression?

No — this is a long-standing architectural gap.

Moltis version

Built from source (current main)

Component

Core / Gateway, Cron scheduler

Install method

Built from source

Operating system

Debian 12 (bookworm) / Ubuntu host

Proposed fixes

Fix 1: Session-aware wake filter (recommended)

In post_state.rs, skip cs.wake() when the exec completion originates from the heartbeat session (cron:heartbeat). The exec callback already has access to the session context — check whether the session key matches the heartbeat session and, if so, only enqueue the event without calling wake().

// In ExecCompletionFn callback (post_state.rs:622-629):
let is_heartbeat_session = event.session_key.as_deref() == Some("cron:heartbeat");
tokio::spawn(async move {
    eq.enqueue(summary, "exec-event".into()).await;
    if !is_heartbeat_session {
        cs.wake("exec-event").await;
    }
});

Fix 2: Heartbeat wake cooldown (debounce)

Add a configurable minimum cooldown between heartbeat wakes. Even if wake() is called, ignore it if the heartbeat completed less than N minutes ago. This is a broader defense that protects against any source of rapid re-waking.

// In HeartbeatConfig:
pub wake_cooldown_secs: Option<u64>,  // e.g. 300 = 5 minutes

// In CronService::wake():
// Skip if last completed less than cooldown_secs ago
if let Some(cooldown_ms) = wake_cooldown_ms {
    if let Some(last_run) = job.state.last_run_at_ms {
        if now.saturating_sub(last_run) < cooldown_ms {
            return; // too soon
        }
    }
}

Both fixes can coexist. Fix 1 is the targeted solution; Fix 2 is a safety net.

Additional context

This was previously worked around by running heartbeat-loop.sh as a no-LLM background process that writes HEARTBEAT.md, then having the heartbeat LLM only read the file (no exec). However, any heartbeat prompt that triggers exec will reintroduce the loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions