Skip to content
Merged

Dev #66

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Data is accessed via hierarchical selectors:
Config file structure (see `configs/config.json`):
- `main` - Server address, TLS certs, JWT public key, user/group for privilege drop
- `metrics` - Per-metric frequency and aggregation strategy (sum/avg/null)
- `metric-store` - Checkpoints, memory cap, retention, cleanup mode, NATS subscriptions
- `metric-store` - Checkpoints (`file-format` "wal"(default)/"json", `directory`, `max-wal-size`), `checkpoint-interval`, `memory-cap` (GB), `retention-in-memory`, `num-workers`, cleanup (`mode`, `directory`), `nats-subscriptions`
- `nats` - Optional NATS connection for receiving metrics

## Test JWT
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
TARGET = ./cc-metric-store
VAR = ./var/checkpoints/
VERSION = 1.5.0
VERSION = 1.5.3
GIT_HASH := $(shell git rev-parse --short HEAD || echo 'development')
CURRENT_TIME = $(shell date +"%Y-%m-%d:T%H:%M:%S")
LD_FLAGS = '-s -X main.date=${CURRENT_TIME} -X main.version=${VERSION} -X main.commit=${GIT_HASH}'
Expand Down
20 changes: 9 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ It supports the following targets:
./cc-metric-store -logdate # Add date and time to log messages
./cc-metric-store -version # Show version information and exit
./cc-metric-store -gops # Enable gops agent for debugging
./cc-metric-store -cleanup-checkpoints # Delete/archive old checkpoints per retention settings, then exit
```

## REST API Endpoints
Expand Down Expand Up @@ -174,9 +175,9 @@ Per-metric configuration. Each key is the metric name:
}
```

- `checkpoints.file-format`: Checkpoint format: `"json"` (default, human-readable) or `"wal"` (binary WAL, crash-safe). See [Checkpoint formats](#checkpoint-formats) below.
- `checkpoints.file-format`: Checkpoint format: `"wal"` (default, binary WAL, crash-safe) or `"json"` (human-readable). See [Checkpoint formats](#checkpoint-formats) below.
- `checkpoints.directory`: Root directory for checkpoint files (organized as `<dir>/<cluster>/<host>/`)
- `memory-cap`: Approximate memory cap in MB for metric buffers
- `memory-cap`: Memory cap in GB for metric buffers
- `retention-in-memory`: How long to keep data in memory (e.g. `"48h"`)
- `num-workers`: Number of parallel workers for checkpoint/archive I/O (0 = auto, capped at 10)
- `cleanup.mode`: What to do with data older than `retention-in-memory`: `"archive"` (write Parquet) or `"delete"`
Expand All @@ -187,12 +188,7 @@ Per-metric configuration. Each key is the metric name:

The `checkpoints.file-format` field controls how in-memory data is persisted to disk.

**`"json"` (default)** — human-readable JSON snapshots written periodically. Each
snapshot is stored as `<dir>/<cluster>/<host>/<timestamp>.json` and contains the
full metric hierarchy. Easy to inspect and recover manually, but larger on disk
and slower to write.

**`"wal"`** — binary Write-Ahead Log format designed for crash safety. Two file
**`"wal"` (default)** — binary Write-Ahead Log format designed for crash safety. Two file
types are used per host:

- `current.wal` — append-only binary log. Every incoming data point is appended
Expand All @@ -206,9 +202,11 @@ On startup the most recent `.bin` snapshot is loaded, then any remaining WAL
entries are replayed on top. The WAL is rotated (old file deleted, new one
started) after each successful snapshot.

The `"wal"` option is the default and will be the only supported option in the
future. The `"json"` checkpoint format is still provided to migrate from
previous cc-metric-store version.
**`"json"`** — human-readable JSON snapshots written periodically. Each snapshot
is stored as `<dir>/<cluster>/<host>/<timestamp>.json` and contains the full
metric hierarchy. Easy to inspect and recover manually, but larger on disk and
slower to write. Still provided to migrate from previous installations; `"wal"`
will be the only supported format in a future release.

### Parquet archive

Expand Down
63 changes: 41 additions & 22 deletions ReleaseNotes.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,49 @@
# `cc-metric-store` version 1.5.0
# `cc-metric-store` version 1.5.3

This is a major release of `cc-metric-store`, the metric timeseries cache
This is a bugfix release of `cc-metric-store`, the metric timeseries cache
implementation of ClusterCockpit. Since the storage engine is now part of
`cc-backend` we will follow the version number of `cc-backend`.
For release specific notes visit the [ClusterCockpit Documentation](https://clusterockpit.org/docs/release/).

## Breaking changes
## Notable changes

- **`-cleanup-checkpoints` CLI flag**: New flag triggers checkpoint cleanup
(delete or archive to Parquet) based on the configured retention and cleanup
settings, then exits. Useful for one-off maintenance without starting the full
server.
- **GC initialised before checkpoint load**: `GOGC=15` is now set before
`metricstore.Init` so the garbage-collector baseline is established prior to
the largest allocation event (loading checkpoints from disk), reducing
unnecessary heap growth at startup.
- **Dependency upgrades**: `cc-backend` updated from v1.5.0 to v1.5.3;
`cc-lib` updated from v2.8.0 to v2.11.0; `nats.go` bumped from v1.49.0 to
v1.50.0; `parquet-go` bumped from v0.28.0 to v0.29.0; various other module
upgrades.

## Metricstore package fixes (cc-backend v1.5.0 → v1.5.3)

The following fixes landed in the upstream `cc-backend/pkg/metricstore` package
and are included via the dependency upgrade:

- **WAL correctness**: Fixed WAL rotation being skipped for all nodes due to a
non-blocking send on a too-small channel; fixed unbound growth of WAL files
when a checkpointing error occurs; fixed bugs in the WAL journal pipeline.
- **WAL throughput**: Sharded the WAL consumer for higher write throughput; added
buffered I/O to WAL writes.
- **Checkpoint stability**: Paused WAL writes during binary checkpoint creation
to prevent message drops; restructured cleanup archiving to stay within the
32 k row limit of `parquet-go`.
- **Memory**: Fixed a memory explosion caused by broken emergency-free and batch
aborts; reduced memory usage in the Parquet checkpoint archiver; fixed
preventing memory spikes in the Parquet writer during the move/archive policy.
- **NATS**: Fixed blocking `ReceiveNats` call; fixed NATS contention under load.
- **Shutdown**: Increased shutdown timeouts; added WAL flush interval tuning;
added shutdown timing logs.
- **Observability**: Added verbose logs for `DataDoesNotAlign` errors; reduced
noise by demoting a missing-metric warning to debug level.
- **Configuration**: Restored `checkpointInterval` as an optional config key.

## Breaking changes (from v1.4.x)

- The internal `memorystore`, `avro`, `resampler`, and `util` packages have been
removed. The storage engine is now provided by the
Expand All @@ -14,22 +52,3 @@ For release specific notes visit the [ClusterCockpit Documentation](https://clus
only.
- The configuration schema has changed. Refer to `configs/config.json` for the
updated structure.

## Notable changes

- **Storage engine extracted to `cc-backend` library**: The entire in-memory
time-series storage engine was moved to `cc-backend/pkg/metricstore`. This
reduces duplication in the ClusterCockpit suite and enables shared maintenance
of the storage layer.
- **HealthCheck API endpoint**: New `GET /api/healthcheck/` endpoint reports the
health status of cluster nodes.
- **Dynamic memory management**: Memory limits can now be adjusted at runtime via
a callback from the `cc-backend` library.
- **Configuration schema validation**: The config and metric config JSON schemas
have been updated and are now validated against the structs they describe.
- **Startup refactored**: Application startup has been split into `cli.go` and
`server.go` for clearer separation of concerns.
- **`go fix` applied**: Codebase updated to current Go idioms.
- **Dependency upgrades**: `nats.go` bumped from 1.36.0 to 1.47.0;
`cc-lib` updated to v2.8.0; `cc-backend` updated to v1.5.0; various other
module upgrades.
58 changes: 47 additions & 11 deletions cmd/cc-metric-store/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@ package main

import (
"context"
"encoding/json"
"flag"
"fmt"
"os"
"os/signal"
"runtime/debug"
"sync"
"syscall"
"time"

"github.com/ClusterCockpit/cc-backend/pkg/metricstore"
ccconf "github.com/ClusterCockpit/cc-lib/v2/ccConfig"
Expand All @@ -36,8 +38,8 @@ var (
)

var (
flagGops, flagVersion, flagDev, flagLogDateTime bool
flagConfigFile, flagLogLevel string
flagGops, flagVersion, flagDev, flagLogDateTime, flagCleanupCheckpoints bool
flagConfigFile, flagLogLevel string
)

func printVersion() {
Expand All @@ -60,13 +62,14 @@ func runServer(ctx context.Context) error {
return fmt.Errorf("missing metricstore configuration")
}

metricstore.Init(mscfg, config.GetMetrics(), &wg)

// Set GC percent if not configured
// Set GC percent before loading checkpoints so the GC baseline is established
// with a low target from the start of the largest allocation event.
if os.Getenv(envGOGC) == "" {
debug.SetGCPercent(15)
}

metricstore.Init(mscfg, config.GetMetrics(), &wg)

if config.Keys.BackendURL != "" {
ms := metricstore.GetMemoryStore()
ms.SetNodeProvider(api.NewBackendNodeProvider(config.Keys.BackendURL))
Expand Down Expand Up @@ -127,6 +130,7 @@ func run() error {
flag.BoolVar(&flagDev, "dev", false, "Enable development component: Swagger UI")
flag.BoolVar(&flagVersion, "version", false, "Show version information and exit")
flag.BoolVar(&flagLogDateTime, "logdate", false, "Set this flag to add date and time to log messages")
flag.BoolVar(&flagCleanupCheckpoints, "cleanup-checkpoints", false, "Clean up old checkpoint files (delete or archive) based on retention settings, then exit")
flag.StringVar(&flagConfigFile, "config", "./config.json", "Specify alternative path to `config.json`")
flag.StringVar(&flagLogLevel, "loglevel", "warn", "Sets the logging level: `[debug, info, warn (default), err, crit]`")
flag.Parse()
Expand All @@ -138,12 +142,6 @@ func run() error {

cclog.Init(flagLogLevel, flagLogDateTime)

if flagGops || config.Keys.Debug.EnableGops {
if err := agent.Listen(agent.Options{}); err != nil {
return fmt.Errorf("starting gops agent: %w", err)
}
}

ccconf.Init(flagConfigFile)

cfg := ccconf.GetPackageConfig("main")
Expand All @@ -153,6 +151,44 @@ func run() error {

config.Init(cfg)

if flagGops || config.Keys.Debug.EnableGops {
if err := agent.Listen(agent.Options{}); err != nil {
return fmt.Errorf("starting gops agent: %w", err)
}
}

if flagCleanupCheckpoints {
mscfg := ccconf.GetPackageConfig("metric-store")
if mscfg == nil {
return fmt.Errorf("metric-store configuration required for checkpoint cleanup")
}
if err := json.Unmarshal(mscfg, &metricstore.Keys); err != nil {
return fmt.Errorf("decoding metric-store config: %w", err)
}
d, err := time.ParseDuration(metricstore.Keys.RetentionInMemory)
if err != nil {
return fmt.Errorf("parsing retention-in-memory: %w", err)
}
from := time.Now().Add(-d)
deleteMode := metricstore.Keys.Cleanup == nil || metricstore.Keys.Cleanup.Mode != "archive"
cleanupDir := ""
if !deleteMode {
cleanupDir = metricstore.Keys.Cleanup.RootDir
}
cclog.Infof("Cleaning up checkpoints older than %s...", from.Format(time.RFC3339))
n, err := metricstore.CleanupCheckpoints(
metricstore.Keys.Checkpoints.RootDir, cleanupDir, from.Unix(), deleteMode)
if err != nil {
return fmt.Errorf("checkpoint cleanup: %w", err)
}
if deleteMode {
cclog.Printf("Cleanup done: %d checkpoint files deleted.", n)
} else {
cclog.Printf("Cleanup done: %d checkpoint files archived to parquet.", n)
}
return nil
}

natsConfig := ccconf.GetPackageConfig("nats")
if err := nats.Init(natsConfig); err != nil {
cclog.Warnf("initializing (optional) nats client: %s", err.Error())
Expand Down
4 changes: 2 additions & 2 deletions configs/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@
},
"metric-store": {
"checkpoints": {
"interval": "12h",
"file-format": "wal",
"directory": "./var/checkpoints"
},
"checkpoint-interval": "12h",
"memory-cap": 100,
"retention-in-memory": "48h",
"cleanup": {
"mode": "archive",
"interval": "48h",
"directory": "./var/archive"
},
"nats-subscriptions": [
Expand Down
66 changes: 33 additions & 33 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ module github.com/ClusterCockpit/cc-metric-store
go 1.25.0

require (
github.com/ClusterCockpit/cc-backend v1.5.0
github.com/ClusterCockpit/cc-lib/v2 v2.8.0
github.com/ClusterCockpit/cc-backend v1.5.3
github.com/ClusterCockpit/cc-lib/v2 v2.11.0
github.com/ClusterCockpit/cc-line-protocol/v2 v2.4.0
github.com/golang-jwt/jwt/v4 v4.5.2
github.com/google/gops v0.3.29
Expand All @@ -15,25 +15,25 @@ require (

require (
github.com/KyleBanks/depth v1.2.1 // indirect
github.com/andybalholm/brotli v1.2.0 // indirect
github.com/aws/aws-sdk-go-v2 v1.41.3 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.6 // indirect
github.com/aws/aws-sdk-go-v2/config v1.32.11 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.19.11 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.19 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.19 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.19 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.8.5 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.20 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.6 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.11 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.19 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.19 // indirect
github.com/aws/aws-sdk-go-v2/service/s3 v1.96.4 // indirect
github.com/aws/aws-sdk-go-v2/service/signin v1.0.7 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.30.12 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.16 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.41.8 // indirect
github.com/andybalholm/brotli v1.2.1 // indirect
github.com/aws/aws-sdk-go-v2 v1.41.5 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.8 // indirect
github.com/aws/aws-sdk-go-v2/config v1.32.13 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.19.13 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.21 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.21 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.7.21 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.8.6 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.4.22 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.13.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.9.13 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.13.21 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.19.21 // indirect
github.com/aws/aws-sdk-go-v2/service/s3 v1.98.0 // indirect
github.com/aws/aws-sdk-go-v2/service/signin v1.0.9 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.30.14 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.18 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.41.10 // indirect
github.com/aws/smithy-go v1.24.2 // indirect
github.com/cpuguy83/go-md2man/v2 v2.0.7 // indirect
github.com/fsnotify/fsnotify v1.9.0 // indirect
Expand All @@ -48,28 +48,28 @@ require (
github.com/go-openapi/swag/typeutils v0.25.5 // indirect
github.com/go-openapi/swag/yamlutils v0.25.5 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/klauspost/compress v1.18.4 // indirect
github.com/mattn/go-sqlite3 v1.14.34 // indirect
github.com/nats-io/nats.go v1.49.0 // indirect
github.com/klauspost/compress v1.18.5 // indirect
github.com/mattn/go-sqlite3 v1.14.38 // indirect
github.com/nats-io/nats.go v1.50.0 // indirect
github.com/nats-io/nkeys v0.4.15 // indirect
github.com/nats-io/nuid v1.0.1 // indirect
github.com/parquet-go/bitpack v1.0.0 // indirect
github.com/parquet-go/jsonlite v1.4.0 // indirect
github.com/parquet-go/parquet-go v0.28.0 // indirect
github.com/parquet-go/jsonlite v1.5.0 // indirect
github.com/parquet-go/parquet-go v0.29.0 // indirect
github.com/pierrec/lz4/v4 v4.1.26 // indirect
github.com/russross/blackfriday/v2 v2.1.0 // indirect
github.com/swaggo/files/v2 v2.0.2 // indirect
github.com/twpayne/go-geom v1.6.1 // indirect
github.com/urfave/cli/v2 v2.27.7 // indirect
github.com/xrash/smetrics v0.0.0-20250705151800-55b8f293f342 // indirect
go.yaml.in/yaml/v2 v2.4.3 // indirect
go.yaml.in/yaml/v2 v2.4.4 // indirect
go.yaml.in/yaml/v3 v3.0.4 // indirect
golang.org/x/crypto v0.48.0 // indirect
golang.org/x/mod v0.33.0 // indirect
golang.org/x/sync v0.19.0 // indirect
golang.org/x/sys v0.41.0 // indirect
golang.org/x/text v0.34.0 // indirect
golang.org/x/tools v0.42.0 // indirect
golang.org/x/crypto v0.49.0 // indirect
golang.org/x/mod v0.34.0 // indirect
golang.org/x/sync v0.20.0 // indirect
golang.org/x/sys v0.42.0 // indirect
golang.org/x/text v0.35.0 // indirect
golang.org/x/tools v0.43.0 // indirect
google.golang.org/protobuf v1.36.11 // indirect
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c // indirect
sigs.k8s.io/yaml v1.6.0 // indirect
Expand Down
Loading
Loading