Skip to content

Add encode-tensogram action for Tensogram message format#260

Open
tlmquintino wants to merge 10 commits intodevelopfrom
feature/encode-tensogram
Open

Add encode-tensogram action for Tensogram message format#260
tlmquintino wants to merge 10 commits intodevelopfrom
feature/encode-tensogram

Conversation

@tlmquintino
Copy link
Copy Markdown
Member

@tlmquintino tlmquintino commented Apr 13, 2026

Summary

  • Add a new encode-tensogram pipeline action that encodes raw field data into the Tensogram N-dimensional tensor message format
  • MARS metadata is preserved on the output Message (for downstream routing) and embedded in the Tensogram payload under base[0].mars (for external tool interoperability)
  • Feature is gated behind ENABLE_TENSOGRAM=OFF (optional dependency)

New files

File Purpose
cmake/FindTensogram.cmake CMake Find module — locates tensogram headers and the Rust FFI static library
src/multio/action/encode-tensogram/EncodeTensogram.h Action class declaration (ChainedAction subclass)
src/multio/action/encode-tensogram/EncodeTensogram.cc Core implementation — metadata JSON builder, precision dispatch, tensogram::encode() call, factory self-registration
src/multio/action/encode-tensogram/CMakeLists.txt Builds multio-action-encode-tensogram as a SHARED plugin
tests/multio/action/encode-tensogram/test_multio_encode_tensogram.cc 6 unit tests (raw round-trip, simple_packing, szip, metadata preservation, routing metadata, flush passthrough)
tests/multio/action/encode-tensogram/testPlan-tensogram.yaml Test YAML plan
tests/multio/action/encode-tensogram/demo-plan.yaml End-to-end demo YAML (packed + raw output)

Modified files

File Change
CMakeLists.txt Added ecbuild_add_option(FEATURE TENSOGRAM DEFAULT OFF ...)
src/multio/action/CMakeLists.txt Added add_subdirectory(encode-tensogram)
src/multio/tools/CMakeLists.txt Link multio-feed against the plugin when HAVE_TENSOGRAM
tests/multio/action/CMakeLists.txt Added test subdirectory

YAML configuration

- type: encode-tensogram
  encoding: simple_packing   # "none" | "simple_packing" (default: simple_packing)
  compression: szip          # "none" | "szip" | "zstd" | "lz4" (default: szip)
  bits-per-value: 16         # default: 16
  decimal-scale-factor: 0    # default: 0
  filter: none               # "none" | "shuffle" (default: none)
  hash: xxh3                 # "xxh3" | "" (default: xxh3)

Build & test

# Pre-requisite: build tensogram (requires Rust toolchain)
cd tensogram && cargo build --release -p tensogram-ffi

# Configure multio with tensogram support
cmake -DENABLE_TENSOGRAM=ON -DTENSOGRAM_ROOT=/path/to/tensogram ...

# Run tests
ctest -R encode_tensogram

Verified

  • All 6 unit tests pass (raw lossless round-trip, simple_packing accuracy, szip compression, MARS metadata in tensogram payload, MARS metadata on output message, flush passthrough)
  • End-to-end demo: multio-feed test.grib --decode --plans=demo-plan.yaml produces valid .tgm files verified with tensogram validate, tensogram info, and the tensogram Python package
  • Packing accuracy: max absolute error of 23µm/s on 10m wind at 16 bits-per-value
  • Compression ratio: 2.6× (6.3KB raw → 2.4KB packed+szip)

🌈🌦️📖🚧 Documentation 🚧📖🌦️🌈
https://sites.ecmwf.int/docs/dev-section/multio/pull-requests/PR-260

Add a new pipeline action that encodes raw field data into the Tensogram
N-dimensional tensor message format (github.com/ecmwf/tensogram). The
action supports configurable encoding (none/simple_packing), compression
(none/szip/zstd/lz4), and integrity hashing (xxh3).

MARS metadata is preserved on the output Message for downstream routing
and embedded in the Tensogram payload under base[0].mars for external
tool interoperability.

New files:
- cmake/FindTensogram.cmake: CMake Find module for the tensogram library
- src/multio/action/encode-tensogram/: Action plugin (ChainedAction)
- tests/multio/action/encode-tensogram/: 6 unit tests + demo YAML plans

Defaults: simple_packing at 16 bits-per-value with szip compression.
The feature is gated behind ENABLE_TENSOGRAM=OFF (optional dependency).
@jameshawkes
Copy link
Copy Markdown

@claude[agent] this PR needs docs for the changes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces an optional Tensogram encoding action to the Multio action pipeline, enabling Field messages (raw numeric arrays) to be serialized into the Tensogram N-dimensional tensor message format while preserving MARS metadata for routing and external interoperability.

Changes:

  • Added encode-tensogram action implementation and build integration behind the TENSOGRAM feature flag.
  • Added CMake find-module to locate Tensogram headers and the Rust FFI static library, and wired the plugin into builds/tools when enabled.
  • Added unit tests plus YAML plans for test and demo usage.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
CMakeLists.txt Adds build feature option for Tensogram support.
cmake/FindTensogram.cmake Locates Tensogram headers and FFI library; defines imported CMake targets.
src/multio/action/CMakeLists.txt Adds the new encode-tensogram action subdirectory.
src/multio/action/encode-tensogram/CMakeLists.txt Builds the encode-tensogram plugin when Tensogram is available.
src/multio/action/encode-tensogram/EncodeTensogram.h Declares the new action and its configurable parameters.
src/multio/action/encode-tensogram/EncodeTensogram.cc Implements metadata JSON construction, packing dispatch, and encoding call.
src/multio/tools/CMakeLists.txt Links multio-feed with the plugin when Tensogram is enabled.
tests/multio/action/CMakeLists.txt Registers the new test subdirectory.
tests/multio/action/encode-tensogram/CMakeLists.txt Adds unit test target under HAVE_TENSOGRAM.
tests/multio/action/encode-tensogram/test_multio_encode_tensogram.cc Adds unit tests for raw/simple_packing/szip + metadata preservation and passthrough.
tests/multio/action/encode-tensogram/testPlan-tensogram.yaml Adds a file-output test plan for the new action.
tests/multio/action/encode-tensogram/demo-plan.yaml Adds an end-to-end demo plan for packed and raw Tensogram output.
CLAUDE.md Adds agent guidance documentation (contains several typos and repo-specific mismatches).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +209 to +230
compression_{compConf.parsedConfig().getString("compression", "szip")},
hashAlgo_{compConf.parsedConfig().getString("hash", "xxh3")},
bitsPerValue_{static_cast<uint32_t>(compConf.parsedConfig().getInt("bits-per-value", 16))},
decimalScaleFactor_{static_cast<int32_t>(compConf.parsedConfig().getInt("decimal-scale-factor", 0))} {

// Validate configuration
if (encoding_ != "none" && encoding_ != "simple_packing") {
throw eckit::UserError(
"EncodeTensogram: unsupported encoding '" + encoding_ + "'. Must be 'none' or 'simple_packing'.", Here());
}
if (filter_ != "none" && filter_ != "shuffle") {
throw eckit::UserError("EncodeTensogram: unsupported filter '" + filter_ + "'. Must be 'none' or 'shuffle'.",
Here());
}
if (compression_ != "none" && compression_ != "szip" && compression_ != "zstd" && compression_ != "lz4") {
throw eckit::UserError("EncodeTensogram: unsupported compression '" + compression_
+ "'. Must be 'none', 'szip', 'zstd', or 'lz4'.",
Here());
}
if (encoding_ == "simple_packing" && bitsPerValue_ == 0) {
throw eckit::UserError("EncodeTensogram: bits-per-value must be > 0 for simple_packing encoding.", Here());
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bits-per-value is read via getInt() and immediately cast to uint32_t. If the YAML provides a negative value (e.g., -1), this will wrap to a huge unsigned value and bypass the bitsPerValue_ == 0 check. Validate the original signed value is within an acceptable range (e.g., >= 1 and <= 64 or whatever tensogram supports) before casting, and emit a UserError for out-of-range values.

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +235
EncodeTensogram::EncodeTensogram(const ComponentConfiguration& compConf) :
ChainedAction{compConf},
encoding_{compConf.parsedConfig().getString("encoding", "simple_packing")},
filter_{compConf.parsedConfig().getString("filter", "none")},
compression_{compConf.parsedConfig().getString("compression", "szip")},
hashAlgo_{compConf.parsedConfig().getString("hash", "xxh3")},
bitsPerValue_{static_cast<uint32_t>(compConf.parsedConfig().getInt("bits-per-value", 16))},
decimalScaleFactor_{static_cast<int32_t>(compConf.parsedConfig().getInt("decimal-scale-factor", 0))} {

// Validate configuration
if (encoding_ != "none" && encoding_ != "simple_packing") {
throw eckit::UserError(
"EncodeTensogram: unsupported encoding '" + encoding_ + "'. Must be 'none' or 'simple_packing'.", Here());
}
if (filter_ != "none" && filter_ != "shuffle") {
throw eckit::UserError("EncodeTensogram: unsupported filter '" + filter_ + "'. Must be 'none' or 'shuffle'.",
Here());
}
if (compression_ != "none" && compression_ != "szip" && compression_ != "zstd" && compression_ != "lz4") {
throw eckit::UserError("EncodeTensogram: unsupported compression '" + compression_
+ "'. Must be 'none', 'szip', 'zstd', or 'lz4'.",
Here());
}
if (encoding_ == "simple_packing" && bitsPerValue_ == 0) {
throw eckit::UserError("EncodeTensogram: bits-per-value must be > 0 for simple_packing encoding.", Here());
}

LOG_DEBUG_LIB(LibMultio) << "EncodeTensogram: encoding=" << encoding_ << " filter=" << filter_
<< " compression=" << compression_ << " bits-per-value=" << bitsPerValue_
<< " decimal-scale-factor=" << decimalScaleFactor_ << " hash=" << hashAlgo_ << std::endl;
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config docstring in EncodeTensogram.h says hash is "xxh3" | "" to disable, but the constructor does not validate hashAlgo_. Consider validating allowed values up-front (or at least handling empty/unknown values with a clear UserError) so misconfiguration fails early with a helpful message.

Copilot uses AI. Check for mistakes.
Comment on lines +353 to +354
const auto* values = reinterpret_cast<const Precision*>(msg.payload().data());
const size_t numValues = msg.payload().size() / sizeof(Precision);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeImpl() uses msg.globalSize() (metadata) for the encoded tensor shape but derives numValues from payload().size(). If these diverge (or if the payload size is not an exact multiple of sizeof(Precision)), the encoder will produce an inconsistent message or read past the intended data. Add explicit validation that payload.size() % sizeof(Precision) == 0 and that numValues == globalSize (and globalSize > 0) before encoding, with a user-facing error.

Suggested change
const auto* values = reinterpret_cast<const Precision*>(msg.payload().data());
const size_t numValues = msg.payload().size() / sizeof(Precision);
const size_t payloadBytes = msg.payload().size();
if (globalSize == 0) {
throw eckit::UserError("EncodeTensogram: Message has invalid globalSize=0 for tensor encoding", Here());
}
if (payloadBytes % sizeof(Precision) != 0) {
std::ostringstream oss;
oss << "EncodeTensogram: Payload size (" << payloadBytes << " bytes) is not a multiple of element size ("
<< sizeof(Precision) << " bytes) for the declared precision";
throw eckit::UserError(oss.str(), Here());
}
const size_t numValues = payloadBytes / sizeof(Precision);
if (numValues != globalSize) {
std::ostringstream oss;
oss << "EncodeTensogram: Payload element count (" << numValues
<< ") does not match message globalSize (" << globalSize << ")";
throw eckit::UserError(oss.str(), Here());
}
const auto* values = reinterpret_cast<const Precision*>(msg.payload().data());

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +86
/// Escape a string for JSON (handles backslash, double-quote, control characters).
std::string jsonEscape(const std::string& s) {
std::string out;
out.reserve(s.size() + 8);
for (char c : s) {
switch (c) {
case '"':
out += "\\\"";
break;
case '\\':
out += "\\\\";
break;
case '\n':
out += "\\n";
break;
case '\r':
out += "\\r";
break;
case '\t':
out += "\\t";
break;
default:
out += c;
break;
}
}
return out;
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says jsonEscape handles control characters, but it only escapes \n, \r, and \t (and not other control characters like \b, \f, or any byte < 0x20). Either extend the escaping to produce valid JSON for all control chars, or adjust the comment and/or use an existing JSON writer (e.g., eckit::JSON) to avoid subtle invalid JSON bugs.

Copilot uses AI. Check for mistakes.
for (size_t i = 0; i < fieldSize; ++i) {
maxError = std::max(maxError, std::abs(decoded[i] - data[i]));
}
// At 16 bits per value with a range of ~20K, tolerance is roughly range / 2^16
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment claims a value range of ~20K, but makeSyntheticField() produces values around 273.15 ± 10 (range ~20). Please correct the comment (or compute the tolerance based on the actual observed range) so the test rationale matches the data being generated.

Suggested change
// At 16 bits per value with a range of ~20K, tolerance is roughly range / 2^16
// At 16 bits per value with this synthetic field's range of ~20 (273.15 ± 10), 1.0 is a loose tolerance.

Copilot uses AI. Check for mistakes.
Comment thread CLAUDE.md Outdated
- Use it to find definitions, references, and workspace symbols.


- IMPORTANT: when planing and before you do any work:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in this guideline header: planingplanning.

Suggested change
- IMPORTANT: when planing and before you do any work:
- IMPORTANT: when planning and before you do any work:

Copilot uses AI. Check for mistakes.
Comment thread CLAUDE.md Outdated
Comment on lines +27 to +34
- in C/C++ NEVER produce undefined behavior and never segfault or stop executiong without returning error or exceptions
- all documentation up-to-date with changes,
- running required formatter/lint/tests

- NOTE: when user asks for 'error handling' checks:
- verify no panic in rust code
- verify how errors are handled across-code base, all languages
- ensure all errors handled and reported correclty with enough information reaching users
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos in this line: executiongexecuting, correcltycorrectly.

Suggested change
- in C/C++ NEVER produce undefined behavior and never segfault or stop executiong without returning error or exceptions
- all documentation up-to-date with changes,
- running required formatter/lint/tests
- NOTE: when user asks for 'error handling' checks:
- verify no panic in rust code
- verify how errors are handled across-code base, all languages
- ensure all errors handled and reported correclty with enough information reaching users
- in C/C++ NEVER produce undefined behavior and never segfault or stop executing without returning error or exceptions
- all documentation up-to-date with changes,
- running required formatter/lint/tests
- NOTE: when user asks for 'error handling' checks:
- verify no panic in rust code
- verify how errors are handled across-code base, all languages
- ensure all errors handled and reported correctly with enough information reaching users

Copilot uses AI. Check for mistakes.
Comment thread CLAUDE.md Outdated
- NEVER update MAJOR unless users says so.
- Increment MINOR for new features. MICRO for bugfixes and documentation updates.
- NEVER prepend git tag or releases with 'v'
- REMEBER on releases:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: REMEBERREMEMBER.

Suggested change
- REMEBER on releases:
- REMEMBER on releases:

Copilot uses AI. Check for mistakes.
Comment thread CLAUDE.md
Comment on lines +67 to +74
- NOTE: When user asks for 'make release' execute:
- check all changes are commited and pushed upstream
- final check everything builds, all languages and all tests pass
- all examples in all languages Rust, Python and C++ compile and run
- all docs build
- if any of the above fails STOP and prompt the user for action
- otherwise, proceed by check the latest version upstream and in VERSION file
- if needed, bump in VERSION file, commit and tag then push to upstream
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: commitedcommitted.

Copilot uses AI. Check for mistakes.
Comment thread CLAUDE.md Outdated
Comment on lines +92 to +111
- Git project in github.com/ecmwf/tensogram
- IMPORTANT:
- versions are tagged using Semantic Versioning form 'MAJOR.MINOR.MICRO'
- NEVER update MAJOR unless users says so.
- Increment MINOR for new features. MICRO for bugfixes and documentation updates.
- NEVER prepend git tag or releases with 'v'
- REMEBER on releases:
- check all is commited and pushed upstream, otherwise STOP and warn user
- update the VERSION file
- git tag with version
- push and create release in github

- NOTE: SINGLE SOURCE OF TRUTH FOR VERSION — The `VERSION` file at the repo root is the
canonical version for the ENTIRE project. ALL version strings everywhere MUST match it.
When bumping the version (e.g. during a release), you MUST update ALL of these locations
to match the VERSION file:
- `VERSION` (the source of truth)
- `pyproject.toml`
- `CHANGELOG.md` (new release entry header)

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section states the Git project is github.com/ecmwf/tensogram and includes release/versioning rules for that project. In this repository (Multio), that appears misleading and may confuse contributors; please either remove it or update it to reflect this repo’s actual versioning/release process and canonical version file (if any).

Suggested change
- Git project in github.com/ecmwf/tensogram
- IMPORTANT:
- versions are tagged using Semantic Versioning form 'MAJOR.MINOR.MICRO'
- NEVER update MAJOR unless users says so.
- Increment MINOR for new features. MICRO for bugfixes and documentation updates.
- NEVER prepend git tag or releases with 'v'
- REMEBER on releases:
- check all is commited and pushed upstream, otherwise STOP and warn user
- update the VERSION file
- git tag with version
- push and create release in github
- NOTE: SINGLE SOURCE OF TRUTH FOR VERSION — The `VERSION` file at the repo root is the
canonical version for the ENTIRE project. ALL version strings everywhere MUST match it.
When bumping the version (e.g. during a release), you MUST update ALL of these locations
to match the VERSION file:
- `VERSION` (the source of truth)
- `pyproject.toml`
- `CHANGELOG.md` (new release entry header)
- Use the versioning and release process defined by this repository only.
- Do not reference or follow release rules copied from other projects.
- Before changing any version, tag, or release metadata:
- inspect the files and automation present in this repo to determine the canonical source(s) of version information
- confirm the expected version format and release steps from this repo's existing documentation and tooling
- if the canonical version file or release workflow is unclear, stop and ask for clarification rather than guessing
- When preparing a release:
- ensure all intended changes are committed and pushed as required by the project's workflow
- update only the version-related files that this repository actually uses
- create tags/releases only according to this repository's documented conventions

Copilot uses AI. Check for mistakes.
Agent-Logs-Url: https://github.com/ecmwf/multio/sessions/561d25dd-44c3-4bfc-8164-1997b0d99e46

Co-authored-by: jameshawkes <11695815+jameshawkes@users.noreply.github.com>
@Claude
Copy link
Copy Markdown

Claude AI commented Apr 13, 2026

@claude[agent] this PR needs docs for the changes

Added comprehensive documentation for the encode-tensogram action in docs/content/processing-pipelines.rst, including:

  • Overview and use cases
  • Configuration options table with all parameters
  • Multiple example configurations (high compression, lossless, balanced)
  • Complete pipeline example
  • Usage examples for validation and inspection (CLI and Python)

Commit: 0984d96

@Claude Claude AI requested a review from jameshawkes April 13, 2026 13:51
Copy link
Copy Markdown
Member

@tweska tweska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have no idea what a tensogram is, but still have some comments.

I am mainly concerned we are implementing mostly a JSON string writer here, while that is already implemented in eckit. I believe this action can be made much more compact by using eckit::LocalConfiguration and the JSON writer in eckit.

Comment on lines +46 to +52
const std::vector<std::string> marsKeys = {
"class", "type", "stream", "expver",
"date", "time", "step",
"param", "paramId", "shortName",
"levtype", "levelist", "level",
"domain", "number",
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a datamod record to extract this from the message metadata as we have done in other new actions?

Comment on lines +55 to +197
//----------------------------------------------------------------------------------------------------------------------
// JSON helpers — minimal escaping for string values
//----------------------------------------------------------------------------------------------------------------------

/// Escape a string for JSON (handles backslash, double-quote, control characters).
std::string jsonEscape(const std::string& s) {
std::string out;
out.reserve(s.size() + 8);
for (char c : s) {
switch (c) {
case '"':
out += "\\\"";
break;
case '\\':
out += "\\\\";
break;
case '\n':
out += "\\n";
break;
case '\r':
out += "\\r";
break;
case '\t':
out += "\\t";
break;
default:
out += c;
break;
}
}
return out;
}

/// Emit a single metadata value into JSON.
/// Dispatches on the MetadataValue type: string → quoted, numeric → raw, bool → true/false.
void emitJsonValue(std::ostringstream& os, const message::MetadataValue& val) {
val.visit(eckit::Overloaded{
[&](std::nullptr_t) { os << "null"; },
[&](bool v) { os << (v ? "true" : "false"); },
[&](std::int64_t v) { os << v; },
[&](double v) {
if (std::isfinite(v)) {
os << v;
}
else {
os << "null";
}
},
[&](const std::string& v) { os << '"' << jsonEscape(v) << '"'; },
// For vectors of integers (e.g., levelist), emit as a JSON array
[&](const std::vector<std::int64_t>& v) {
os << '[';
for (size_t i = 0; i < v.size(); ++i) {
if (i > 0)
os << ',';
os << v[i];
}
os << ']';
},
// For vectors of doubles
[&](const std::vector<double>& v) {
os << '[';
for (size_t i = 0; i < v.size(); ++i) {
if (i > 0)
os << ',';
os << v[i];
}
os << ']';
},
// For nested metadata, emit as nested JSON object
[&](const message::BaseMetadata& nested) {
os << '{';
bool first = true;
for (const auto& [key, v] : nested) {
if (!first)
os << ',';
first = false;
os << '"' << jsonEscape(key) << "\":";
emitJsonValue(os, v);
}
os << '}';
},
// Catch-all for types we don't handle (e.g., vector<string>)
[&](const auto&) { os << "null"; },
});
}

/// Build the "base[0]" metadata object from multio Metadata.
/// MARS keys go under "mars", everything else goes at the top level.
std::string buildBaseEntry(const message::Metadata& md) {
std::ostringstream marsJson;
std::ostringstream extraJson;
bool firstMars = true;
bool firstExtra = true;

// Collect known MARS keys → mars sub-object; everything else → top level
for (const auto& [key, val] : md) {
// The key type is PrehashedKey<std::string>; extract the string for comparisons
const std::string& keyStr = static_cast<const std::string&>(key);

// Skip internal/routing keys that are not meaningful in tensogram metadata
if (keyStr == "misc-globalSize" || keyStr == "misc-precision" || keyStr == "bitmapPresent"
|| keyStr == "missingValue" || keyStr == "encoder-overwrites" || keyStr == "globalSize"
|| keyStr == "precision") {
continue;
}

bool isMarsKey = std::find(marsKeys.begin(), marsKeys.end(), keyStr) != marsKeys.end();

if (isMarsKey) {
if (!firstMars)
marsJson << ',';
firstMars = false;
marsJson << '"' << jsonEscape(keyStr) << "\":";
emitJsonValue(marsJson, val);
}
else {
if (!firstExtra)
extraJson << ',';
firstExtra = false;
extraJson << '"' << jsonEscape(keyStr) << "\":";
emitJsonValue(extraJson, val);
}
}

std::ostringstream entry;
entry << '{';
bool needsComma = false;

if (!firstMars) {
entry << "\"mars\":{" << marsJson.str() << '}';
needsComma = true;
}

if (!firstExtra) {
if (needsComma)
entry << ',';
entry << extraJson.str();
}

entry << '}';
return entry.str();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of this action seems to be code to spit out a JSON string. Can't we just create an eckit::LocalConfiguraiton and use eckit to write it as JSON?

Comment on lines +241 to +287
std::string EncodeTensogram::buildEncodeJson(const message::Metadata& md, size_t globalSize, const std::string& dtype,
const std::string& byteOrder, double referenceValue,
int32_t binaryScaleFactor) const {
std::ostringstream json;
json << '{';

// Version (required)
json << "\"version\":2,";

// Descriptors array (one object)
json << "\"descriptors\":[{";
json << "\"type\":\"ndarray\",";
json << "\"ndim\":1,";
json << "\"shape\":[" << globalSize << "],";
// Strides: for float64 input to simple_packing, stride = 8 bytes (the input dtype)
json << "\"strides\":[8],";
json << "\"dtype\":\"" << dtype << "\",";
json << "\"byte_order\":\"" << byteOrder << "\",";
json << "\"encoding\":\"" << encoding_ << "\",";
json << "\"filter\":\"" << filter_ << "\",";
json << "\"compression\":\"" << compression_ << "\"";

// Szip compression parameters (required by the szip codec)
if (compression_ == "szip") {
// Reference Sample Interval: must divide the total number of packed values.
// 32 is the most common default (matches GRIB szip usage).
json << ",\"szip_rsi\":32";
json << ",\"szip_block_size\":8";
json << ",\"szip_flags\":4"; // EC (entropy coding)
}

// Simple packing parameters
if (encoding_ == "simple_packing") {
json << ",\"bits_per_value\":" << bitsPerValue_;
json << ",\"decimal_scale_factor\":" << decimalScaleFactor_;
json << ",\"reference_value\":" << referenceValue;
json << ",\"binary_scale_factor\":" << binaryScaleFactor;
}

json << "}],"; // end descriptors

// Base array (per-object metadata — MARS keys + extras)
json << "\"base\":[" << buildBaseEntry(md) << "]";

json << '}';
return json.str();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would just create an eckit::LocalConfiguration and I believe you can use this https://github.com/ecmwf/eckit/blob/9daf0377f2fd0ddc2ee1bdd47c7064ea037a28c6/src/eckit/log/JSON.h to write it to string.

Comment on lines +342 to +347
// Determine byte order for this platform
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
const std::string byteOrder = "little";
#else
const std::string byteOrder = "big";
#endif
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More efficient and safer way to save this would be a boolean.

}

// Build JSON and encode
std::string json = buildEncodeJson(md, globalSize, "float64", byteOrder, referenceValue, binaryScaleFactor);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess the byteOrder is just injected into the JSON string here.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.28%. Comparing base (6178fe7) to head (f905d01).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #260   +/-   ##
========================================
  Coverage    56.28%   56.28%           
========================================
  Files          321      321           
  Lines        21188    21188           
  Branches      1688     1688           
========================================
  Hits         11925    11925           
  Misses        9263     9263           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Addresses all review comments from tweska and copilot:

- Replace hand-built JSON strings with eckit::JSON (tweska #2, #3, copilot #1)
  eckit::JSON handles all string escaping (including all control characters)
  and provides proper JSON formatting. Removed jsonEscape() and all manual
  string concatenation. Unified buildEncodeJson and buildEncodeJsonRaw into
  a single method.

- Use boolean for simple_packing check (tweska #4)
  Added useSimplePacking_ cached boolean instead of repeated string comparison.

- Validate bits-per-value range (copilot #2)
  Now validates signed value is in [1, 64] before casting to uint32_t,
  preventing silent wraparound on negative input.

- Validate hashAlgo_ (copilot #3)
  Constructor now rejects unknown hash algorithms with a clear UserError.

- Validate globalSize vs payload consistency (copilot #4)
  Added checks: globalSize > 0, payload aligned to element size, and
  element count matches globalSize.

- Fix test comment (copilot #5)
  Corrected range comment from ~20K to ~20 to match actual synthetic data.

- Fix CLAUDE.md (copilot #6-10)
  Fixed typos: planing->planning, executiong->executing, correclty->correctly,
  commited->committed, REMEBER->REMEMBER. Removed tensogram-specific
  versioning rules that don't belong in multio.
- Guard against negative globalSize from int64_t wrapping in static_cast<size_t>
  by checking globalSizeSigned <= 0 before the cast
- Add explicit #include "eckit/utils/Overloaded.h" (was only available via
  transitive include)
- Remove unused #include <algorithm> and #include <numeric>
- Fix YAML usage comments: multio-feed takes positional args, not --source=
- Remove uint32 implementation detail from user-facing bits-per-value comment
Error handling:
- Wrap tensogram::encode() calls in try-catch to add field context
  (globalSize, encoding) to exception messages — users can now identify
  which field failed encoding
- Wrap msg.globalSize() in try-catch to give a clear error when
  misc-globalSize metadata key is missing
- Validate decimal-scale-factor range against int32_t bounds before
  casting — prevents silent integer overflow (undefined behaviour)
- Handle NaN/Inf in vector<double> metadata values by emitting JSON
  null instead of invalid non-JSON tokens

Edge cases:
- Added #include <limits> for std::numeric_limits
New tests:
- Constructor: invalid encoding, invalid compression, invalid hash,
  bits-per-value=0, bits-per-value=65 (5 validation error tests)
- Edge cases: constant-value field (range=0), single-element field,
  hash disabled (empty string), metadata with extra typed values,
  message with no MARS keys (5 edge case tests)

Coverage now exercises:
- All constructor validation branches (encoding, filter, compression,
  hash, bits-per-value range)
- Constant field (simple_packing with zero range)
- Single-element tensor
- Hash disabled mode
- Empty MARS metadata (no mars sub-object in base[0])
- Mixed metadata value types (int64, string)
The decimal-scale-factor key name was split across two lines in the RST
table, causing an 'Inline literal start-string without end-string'
Sphinx warning that fails the CI doc build.
Widen first column from 20 to 24 characters to fit the longest key
name (decimal-scale-factor) without line wrapping, which was causing
Sphinx 'Malformed table' errors.
Comment on lines +47 to +56
const std::unordered_set<std::string> marsKeys = {
"class", "type", "stream", "expver", "date", "time", "step", "param",
"paramId", "shortName", "levtype", "levelist", "level", "domain", "number",
};

// Internal/routing keys that should be excluded from tensogram metadata
const std::unordered_set<std::string> skipKeys = {
"misc-globalSize", "misc-precision", "bitmapPresent", "missingValue",
"encoder-overwrites", "globalSize", "precision",
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, use datamod like we do in the average-rate and encode-mtg2 actions to describe precisely the metadata that we process in this action. And only access metadata through a parsed datamod record in this new action.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that there are a lot of keys used here that shouldn't be used in new multio plans. For example MARS paramId and shortName are not handled in MTG2 actions. We only use param.

All the keys in skipKeys should get the misc- prefix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants