Skip to content

Metrics with NULL values are not processed correctly #178

@PhantomPhreak

Description

@PhantomPhreak

Recently we migrated from graphite-web + whisper to graphite-web + graphite-clickhouse + clickhouse schema, and there are some issues with processing NULL values.

Whisper's *.wsp files has pre-defined resolution:

aggregationMethod: average
maxRetention: 157680000
xFilesFactor: 0.5
fileSize: 12614440

Archive 0
offset: 40
secondsPerPoint: 60
points: 525600
retention: 31536000
size: 6307200

which means that value for each timestamp may be NULL or some actual value, like following:

# whisper-fetch metric.wsp
1642054380      0.000977
1642054440      0.000977
1642054500      0.000977
1642054560      None
1642054620      0.000977
1642054680      0.000977
1642054740      None

In this case, graphite + whisper always return some value for each datapoint, even if all of them are NULLs. This may be used by transformNull() graphite functions to convert them into specific values and fill the gaps if necessary.

graphite-clickhouse + clickhouse has different behavior:
in case if at least 1 datapoint has non-NULL value, this value + NULL for other datapoints will be returned. This is query result from ClickHouse and how data is presented in Grafana (exported CSV)

SELECT
    Path,
    groupArray(Time),
    groupArray(Value),
    groupArray(Timestamp)
FROM graphite.data
PREWHERE (Date >= toDate(1642020420)) AND (Date <= toDate(1642023719))
WHERE (Path IN ('metric')) AND ((Time >= 1642020420) AND (Time <= 1642023719))
GROUP BY Path
FORMAT Vertical

Query id: 1eccd72b-9a8d-4825-b7f1-46bc21e5fde7

Row 1:
──────
Path:                  metric
groupArray(Time):      [1642022160]
groupArray(Value):     [0.0009765625]
groupArray(Timestamp): [1642022217]

1 rows in set. Elapsed: 0.020 sec. Processed 40.96 thousand rows, 3.68 MB (2.02 million rows/s., 181.21 MB/s.)
Grafana (exported CSV)

"Time","metric"
2022-01-13 03:47:00,
2022-01-13 03:48:00,
2022-01-13 03:49:00,
2022-01-13 03:50:00,
2022-01-13 03:51:00,
2022-01-13 03:52:00,
2022-01-13 03:53:00,
2022-01-13 03:54:00,
2022-01-13 03:55:00,
2022-01-13 03:56:00,
2022-01-13 03:57:00,
2022-01-13 03:58:00,
2022-01-13 03:59:00,
2022-01-13 04:00:00,
2022-01-13 04:01:00,
2022-01-13 04:02:00,
2022-01-13 04:03:00,
2022-01-13 04:04:00,
2022-01-13 04:05:00,
2022-01-13 04:06:00,
2022-01-13 04:07:00,
2022-01-13 04:08:00,
2022-01-13 04:09:00,
2022-01-13 04:10:00,
2022-01-13 04:11:00,
2022-01-13 04:12:00,
2022-01-13 04:13:00,
2022-01-13 04:14:00,
2022-01-13 04:15:00,
2022-01-13 04:16:00,0.00098
2022-01-13 04:17:00,
2022-01-13 04:18:00,
2022-01-13 04:19:00,
2022-01-13 04:20:00,
2022-01-13 04:21:00,
2022-01-13 04:22:00,
2022-01-13 04:23:00,
2022-01-13 04:24:00,
2022-01-13 04:25:00,
2022-01-13 04:26:00,
2022-01-13 04:27:00,
2022-01-13 04:28:00,
2022-01-13 04:29:00,
2022-01-13 04:30:00,
2022-01-13 04:31:00,
2022-01-13 04:32:00,
2022-01-13 04:33:00,
2022-01-13 04:34:00,
2022-01-13 04:35:00,
2022-01-13 04:36:00,
2022-01-13 04:37:00,
2022-01-13 04:38:00,
2022-01-13 04:39:00,
2022-01-13 04:40:00,
2022-01-13 04:41:00,

In case if selected interval has only NULL values, nothing is returned from ClickHouse to graphite-clickhouse, and from graphite-web to Grafana:

SELECT
    'metric' AS Path,
    groupArray(Time),
    groupArray(Value),
    groupArray(Timestamp)
FROM graphite.data
PREWHERE (Date >= toDate(1642020420)) AND (Date <= toDate(1642023719))
WHERE (Path IN ('metric)) AND ((Time >= 1642020420) AND (Time <= 1642023719))
GROUP BY Path
FORMAT Vertical

Query id: c1d5a623-0d5e-4826-9303-f5af306a7e2c

Ok.

0 rows in set. Elapsed: 0.012 sec. 

Grafana displaying "No data" for this time period, but "No data" and "NULL" values are completely different, and expected behavior in this case - return NULL values for all datapoints.

Real-world problem: we have a metric representing request processing quantiles, and if there are no requests (night time), there are no values for quantiles is calculated. In the monitoring system, tracking this quantiles, we can replace NULL values with 0 using transformNull() , and differentiate the situation, when metric is broken and when there are just no requests served. This logic is broken now, because we may have a sequence of NULL values for the metric, exceeding selected time-window (5 minutes in our case).

I suppose it can be possible to send NULLs for all datapoints from graphite-clickhouse even if there are empty result returned from the ClickHouse, if we're using rollup config and already know the metric's precision for selected time interval.

We're using "dummy" rollup configuration for now:

SELECT *
FROM system.graphite_retentions

Query id: ce6cd594-5498-47fd-a3a4-d54adade8d1f

┌─config_name─────┬─regexp─┬─function─┬─age─┬─precision─┬─priority─┬─is_default─┬─Tables.database─┬─Tables.table───┐
│ graphite_rollup │        │ any      │   0 │        60 │    65535 │          1 │ ['graphite']    │ ['data_local'] │
└─────────────────┴────────┴──────────┴─────┴───────────┴──────────┴────────────┴─────────────────┴────────────────┘

1 rows in set. Elapsed: 1.548 sec.

Or, if there are any other way for workaround - i'd be happy to know.

And thank you for the amazing go-graphite stack :)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions