Skip to content

centreon-broker RRD min_heartbead problem with low service check interval #3055

@HHerrgesell

Description

@HHerrgesell

BUG REPORT INFORMATION

may be related to #797

Version

rpm -qa | egrep 'centreon-(broker|web|gorgone)'
centreon-web-24.10.16-1.el8.noarch
centreon-broker-24.10.14-1.el8.x86_64
centreon-broker-cbd-24.10.14-1.el8.x86_64
centreon-broker-core-24.10.14-1.el8.x86_64
centreon-broker-cbmod-24.10.14-1.el8.x86_64
centreon-gorgone-24.10.9-1.el8.noarch
centreon-gorgone-centreon-config-24.10.9-1.el8.noarch

Operating System

AlmaLinux 8.10

Browser used

  • Google Chrome
  • Firefox
  • Internet Explorer IE11
  • Safari

Version: 143.0.7499.170

Additional environment details (AWS, VirtualBox, physical, etc.):
VMWare VM with external Database Server (also AlmaLinux8)


Description

RRD files do not retain valid data points for services with check intervals longer than ~10 minutes, despite performance data being correctly stored in centreon_storage.data_bin and broker logs confirming successful RRD update attempts. This affects passive checks (e.g., Greenbone vulnerability scans, external data feeds) and any active checks with extended intervals.

Key finding: Manually modifying an RRD file's minimal_heartbeat from 600 seconds to a value matching 2-3× the actual check interval resolves the issue completely, confirming this parameter as the root cause.


Steps to Reproduce

  1. Create a passive or active service with check interval of 60 minutes that returns performance data
  2. Wait for multiple check periods (≥3 hours) without manually triggering checks
  3. Verify database contains perfdata: SELECT * FROM centreon_storage.data_bin WHERE id_metric = <metric_id>;
  4. Examine RRD file: rrdtool info /var/lib/centreon/metrics/<metric_id>.rrd
  5. Check for valid data: rrdtool dump /var/lib/centreon/metrics/<metric_id>.rrd | grep -v NaN
  6. View service graph in Centreon web interface

Describe the received result

RRD File Analysis (60-min check interval)

Metric ID 761038 (default broker configuration):

[root@monitoring ~]# rrdtool info /var/lib/centreon/metrics/761038.rrd
filename = "/var/lib/centreon/metrics/761038.rrd"
rrd_version = "0003"
step = 1
last_update = 1767717430
ds[value].type = "GAUGE"
ds[value].minimal_heartbeat = 600     # Only 10 minutes tolerance
ds[value].last_ds = "2.000000"
ds[value].value = NaN                 # No valid current value
rra[0].cf = "AVERAGE"
rra[0].pdp_per_row = 60
rra[0].cdp_prep[0].unknown_datapoints = 10
rra[1].cf = "AVERAGE"
rra[1].pdp_per_row = 3600
rra[1].cdp_prep[0].unknown_datapoints = 2230  # Nearly all data unknown

Data dump showing minimal valid data:

[root@monitoring ~]# rrdtool dump /var/lib/centreon/metrics/761038.rrd | grep -v NaN
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE rrd SYSTEM "http://oss.oetiker.ch/rrdtool/rrdtool.dtd">
<rrd>
        <version>0003</version>
        <step>1</step>
        <lastupdate>1767717430</lastupdate> <!-- 2026-01-06 17:37:10 CET -->
        <ds>
                <name> value </name>
                <type> GAUGE </type>
                <minimal_heartbeat>600</minimal_heartbeat>
                <last_ds>2.000000</last_ds>
                <unknown_sec> 0 </unknown_sec>
        </ds>
        <rra>
                <cf>AVERAGE</cf>
                <pdp_per_row>60</pdp_per_row>
                <params><xff>5.0000000000e-01</xff></params>
                <cdp_prep>
                        <ds>
                        <value>0.0000000000e+00</value>
                        <unknown_datapoints>10</unknown_datapoints>
                        </ds>
                </cdp_prep>
                <database>
                        <!-- Only 6 consecutive minutes have data -->
                        <!-- 2026-01-06 13:57:00 --> <row><v>2.0000000000e+00</v></row>
                        <!-- 2026-01-06 13:58:00 --> <row><v>2.0000000000e+00</v></row>
                        <!-- 2026-01-06 13:59:00 --> <row><v>2.0000000000e+00</v></row>
                        <!-- 2026-01-06 14:00:00 --> <row><v>2.0000000000e+00</v></row>
                        <!-- 2026-01-06 14:01:00 --> <row><v>2.0000000000e+00</v></row>
                        <!-- 2026-01-06 14:02:00 --> <row><v>2.0000000000e+00</v></row>
                </database>
        </rra>
</rrd>

Broker Log

The Broker Log (with post translated metric ids to names) confirms update attempts:

[2026-01-06T16:47:14.743+01:00] [rrd] [debug] RRD: new pb data for LR44AP006::4_Misc:_Greenbone-Security-Status::medium(761007) (time 1767713834)
[2026-01-06T16:47:14.743+01:00] [rrd] [debug] RRD: updating file '/var/lib/centreon/metrics/761007.rrd' (1767713834:0.000000) [LR44AP006::4_Misc:_Greenbone-Security-Status::medium]

Updates are logged successfully but data is not retained in the RRD.

Describe the expected result

RRD files should contain valid consolidated data points for each check interval, displaying continuous historical graphs matching the service's check frequency (e.g., hourly data points for 60-minute checks).


Workaround / Verification

Manually recreating the RRD file with increased minimal_heartbeat resolves the issue:

Metric ID 761039 (manually created with rrdtool create using heartbeat=36000):

[root@monitoring ~]# rrdtool info /var/lib/centreon/metrics/761039.rrd
filename = "/var/lib/centreon/metrics/761039.rrd"
rrd_version = "0003"
step = 1
last_update = 1767717430
ds[value].type = "GAUGE"
ds[value].minimal_heartbeat = 36000   # 10 hours tolerance
ds[value].last_ds = "9.800000"
ds[value].value = 0.0000000000e+00    # Valid current value present
rra[0].cf = "AVERAGE"
rra[0].pdp_per_row = 3600
rra[0].cdp_prep[0].unknown_datapoints = 0  # All data marked as valid

Data dump shows continuous hourly values:

[root@monitoring ~]# rrdtool dump /var/lib/centreon/metrics/761039.rrd | grep -v NaN
<?xml version="1.0" encoding="utf-8"?>
<rrd>
        <version>0003</version>
        <step>1</step>
        <lastupdate>1767717430</lastupdate>
        <ds>
                <name> value </name>
                <type> GAUGE </type>
                <minimal_heartbeat>36000</minimal_heartbeat>
                <last_ds>9.800000</last_ds>
                <value>0.0000000000e+00</value>
                <unknown_sec> 0 </unknown_sec>
        </ds>
        <rra>
                <cf>AVERAGE</cf>
                <pdp_per_row>3600</pdp_per_row>
                <cdp_prep>
                        <ds>
                        <primary_value>9.8000000000e+00</primary_value>
                        <secondary_value>9.8000000000e+00</secondary_value>
                        <value>2.1854000000e+04</value>
                        <unknown_datapoints>0</unknown_datapoints>
                        </ds>
                </cdp_prep>
                <database>
                        <!-- Continuous hourly data, no NaN gaps -->
                        <!-- 2026-01-06 15:00:00 --> <row><v>9.8000000000e+00</v></row>
                        <!-- 2026-01-06 16:00:00 --> <row><v>9.8000000000e+00</v></row>
                        <!-- 2026-01-06 17:00:00 --> <row><v>9.8000000000e+00</v></row>
                </database>
        </rra>
</rrd>

Result: Graphs display correctly with proper hourly resolution when heartbeat accommodates the check interval.


Additional relevant information

Affected Scope

  • Use cases: Passive vulnerability scans (Greenbone, Nessus), custom monitoring scripts, external passive data feeds
  • Frequency: Hundreds of affected metrics in production
  • Workaround: Manual RRD recreation required per metric (impractical at scale)

Configuration

  • Monitoring Engine Interval Length: 60 seconds (Administration > Parameters > Monitoring)
  • Service check intervals: 60-120 minutes (passive checks)
  • Both broker logs and database confirm correct data flow

Technical Context

The issue appears related to how the broker calculates minimal_heartbeat during RRD creation. Currently, RRDs are created with heartbeat=600 seconds regardless of the service's actual check interval. When checks run every 60+ minutes, rrdtool marks the gap between updates as "unknown," producing NaN values.

Possible locations requiring investigation:

  • RRD creation logic in broker (determining heartbeat calculation)
  • Metric event structure (whether check interval information is available)
  • Configuration options for heartbeat multiplier

Important note: The global "Interval Length" (60s) should remain unchanged to preserve high-resolution graphs for frequent checks (e.g., 1-minute ping checks). The issue specifically affects services with intervals exceeding the current heartbeat / 2 threshold.

Is the current minimal_heartbeat behavior (appears to be fixed at 600s) intended to support only high-frequency checks?
Should the heartbeat calculation consider the service's actual check interval, or is there a configuration option we're missing?
Are there performance implications of using larger heartbeat values (e.g., 2-6 hours) that should be considered?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions