Add encode-tensogram action for Tensogram message format#260
Add encode-tensogram action for Tensogram message format#260tlmquintino wants to merge 10 commits intodevelopfrom
Conversation
Add a new pipeline action that encodes raw field data into the Tensogram N-dimensional tensor message format (github.com/ecmwf/tensogram). The action supports configurable encoding (none/simple_packing), compression (none/szip/zstd/lz4), and integrity hashing (xxh3). MARS metadata is preserved on the output Message for downstream routing and embedded in the Tensogram payload under base[0].mars for external tool interoperability. New files: - cmake/FindTensogram.cmake: CMake Find module for the tensogram library - src/multio/action/encode-tensogram/: Action plugin (ChainedAction) - tests/multio/action/encode-tensogram/: 6 unit tests + demo YAML plans Defaults: simple_packing at 16 bits-per-value with szip compression. The feature is gated behind ENABLE_TENSOGRAM=OFF (optional dependency).
|
@claude[agent] this PR needs docs for the changes |
There was a problem hiding this comment.
Pull request overview
This pull request introduces an optional Tensogram encoding action to the Multio action pipeline, enabling Field messages (raw numeric arrays) to be serialized into the Tensogram N-dimensional tensor message format while preserving MARS metadata for routing and external interoperability.
Changes:
- Added
encode-tensogramaction implementation and build integration behind theTENSOGRAMfeature flag. - Added CMake find-module to locate Tensogram headers and the Rust FFI static library, and wired the plugin into builds/tools when enabled.
- Added unit tests plus YAML plans for test and demo usage.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
CMakeLists.txt |
Adds build feature option for Tensogram support. |
cmake/FindTensogram.cmake |
Locates Tensogram headers and FFI library; defines imported CMake targets. |
src/multio/action/CMakeLists.txt |
Adds the new encode-tensogram action subdirectory. |
src/multio/action/encode-tensogram/CMakeLists.txt |
Builds the encode-tensogram plugin when Tensogram is available. |
src/multio/action/encode-tensogram/EncodeTensogram.h |
Declares the new action and its configurable parameters. |
src/multio/action/encode-tensogram/EncodeTensogram.cc |
Implements metadata JSON construction, packing dispatch, and encoding call. |
src/multio/tools/CMakeLists.txt |
Links multio-feed with the plugin when Tensogram is enabled. |
tests/multio/action/CMakeLists.txt |
Registers the new test subdirectory. |
tests/multio/action/encode-tensogram/CMakeLists.txt |
Adds unit test target under HAVE_TENSOGRAM. |
tests/multio/action/encode-tensogram/test_multio_encode_tensogram.cc |
Adds unit tests for raw/simple_packing/szip + metadata preservation and passthrough. |
tests/multio/action/encode-tensogram/testPlan-tensogram.yaml |
Adds a file-output test plan for the new action. |
tests/multio/action/encode-tensogram/demo-plan.yaml |
Adds an end-to-end demo plan for packed and raw Tensogram output. |
CLAUDE.md |
Adds agent guidance documentation (contains several typos and repo-specific mismatches). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| compression_{compConf.parsedConfig().getString("compression", "szip")}, | ||
| hashAlgo_{compConf.parsedConfig().getString("hash", "xxh3")}, | ||
| bitsPerValue_{static_cast<uint32_t>(compConf.parsedConfig().getInt("bits-per-value", 16))}, | ||
| decimalScaleFactor_{static_cast<int32_t>(compConf.parsedConfig().getInt("decimal-scale-factor", 0))} { | ||
|
|
||
| // Validate configuration | ||
| if (encoding_ != "none" && encoding_ != "simple_packing") { | ||
| throw eckit::UserError( | ||
| "EncodeTensogram: unsupported encoding '" + encoding_ + "'. Must be 'none' or 'simple_packing'.", Here()); | ||
| } | ||
| if (filter_ != "none" && filter_ != "shuffle") { | ||
| throw eckit::UserError("EncodeTensogram: unsupported filter '" + filter_ + "'. Must be 'none' or 'shuffle'.", | ||
| Here()); | ||
| } | ||
| if (compression_ != "none" && compression_ != "szip" && compression_ != "zstd" && compression_ != "lz4") { | ||
| throw eckit::UserError("EncodeTensogram: unsupported compression '" + compression_ | ||
| + "'. Must be 'none', 'szip', 'zstd', or 'lz4'.", | ||
| Here()); | ||
| } | ||
| if (encoding_ == "simple_packing" && bitsPerValue_ == 0) { | ||
| throw eckit::UserError("EncodeTensogram: bits-per-value must be > 0 for simple_packing encoding.", Here()); | ||
| } |
There was a problem hiding this comment.
bits-per-value is read via getInt() and immediately cast to uint32_t. If the YAML provides a negative value (e.g., -1), this will wrap to a huge unsigned value and bypass the bitsPerValue_ == 0 check. Validate the original signed value is within an acceptable range (e.g., >= 1 and <= 64 or whatever tensogram supports) before casting, and emit a UserError for out-of-range values.
| EncodeTensogram::EncodeTensogram(const ComponentConfiguration& compConf) : | ||
| ChainedAction{compConf}, | ||
| encoding_{compConf.parsedConfig().getString("encoding", "simple_packing")}, | ||
| filter_{compConf.parsedConfig().getString("filter", "none")}, | ||
| compression_{compConf.parsedConfig().getString("compression", "szip")}, | ||
| hashAlgo_{compConf.parsedConfig().getString("hash", "xxh3")}, | ||
| bitsPerValue_{static_cast<uint32_t>(compConf.parsedConfig().getInt("bits-per-value", 16))}, | ||
| decimalScaleFactor_{static_cast<int32_t>(compConf.parsedConfig().getInt("decimal-scale-factor", 0))} { | ||
|
|
||
| // Validate configuration | ||
| if (encoding_ != "none" && encoding_ != "simple_packing") { | ||
| throw eckit::UserError( | ||
| "EncodeTensogram: unsupported encoding '" + encoding_ + "'. Must be 'none' or 'simple_packing'.", Here()); | ||
| } | ||
| if (filter_ != "none" && filter_ != "shuffle") { | ||
| throw eckit::UserError("EncodeTensogram: unsupported filter '" + filter_ + "'. Must be 'none' or 'shuffle'.", | ||
| Here()); | ||
| } | ||
| if (compression_ != "none" && compression_ != "szip" && compression_ != "zstd" && compression_ != "lz4") { | ||
| throw eckit::UserError("EncodeTensogram: unsupported compression '" + compression_ | ||
| + "'. Must be 'none', 'szip', 'zstd', or 'lz4'.", | ||
| Here()); | ||
| } | ||
| if (encoding_ == "simple_packing" && bitsPerValue_ == 0) { | ||
| throw eckit::UserError("EncodeTensogram: bits-per-value must be > 0 for simple_packing encoding.", Here()); | ||
| } | ||
|
|
||
| LOG_DEBUG_LIB(LibMultio) << "EncodeTensogram: encoding=" << encoding_ << " filter=" << filter_ | ||
| << " compression=" << compression_ << " bits-per-value=" << bitsPerValue_ | ||
| << " decimal-scale-factor=" << decimalScaleFactor_ << " hash=" << hashAlgo_ << std::endl; | ||
| } |
There was a problem hiding this comment.
The config docstring in EncodeTensogram.h says hash is "xxh3" | "" to disable, but the constructor does not validate hashAlgo_. Consider validating allowed values up-front (or at least handling empty/unknown values with a clear UserError) so misconfiguration fails early with a helpful message.
| const auto* values = reinterpret_cast<const Precision*>(msg.payload().data()); | ||
| const size_t numValues = msg.payload().size() / sizeof(Precision); |
There was a problem hiding this comment.
executeImpl() uses msg.globalSize() (metadata) for the encoded tensor shape but derives numValues from payload().size(). If these diverge (or if the payload size is not an exact multiple of sizeof(Precision)), the encoder will produce an inconsistent message or read past the intended data. Add explicit validation that payload.size() % sizeof(Precision) == 0 and that numValues == globalSize (and globalSize > 0) before encoding, with a user-facing error.
| const auto* values = reinterpret_cast<const Precision*>(msg.payload().data()); | |
| const size_t numValues = msg.payload().size() / sizeof(Precision); | |
| const size_t payloadBytes = msg.payload().size(); | |
| if (globalSize == 0) { | |
| throw eckit::UserError("EncodeTensogram: Message has invalid globalSize=0 for tensor encoding", Here()); | |
| } | |
| if (payloadBytes % sizeof(Precision) != 0) { | |
| std::ostringstream oss; | |
| oss << "EncodeTensogram: Payload size (" << payloadBytes << " bytes) is not a multiple of element size (" | |
| << sizeof(Precision) << " bytes) for the declared precision"; | |
| throw eckit::UserError(oss.str(), Here()); | |
| } | |
| const size_t numValues = payloadBytes / sizeof(Precision); | |
| if (numValues != globalSize) { | |
| std::ostringstream oss; | |
| oss << "EncodeTensogram: Payload element count (" << numValues | |
| << ") does not match message globalSize (" << globalSize << ")"; | |
| throw eckit::UserError(oss.str(), Here()); | |
| } | |
| const auto* values = reinterpret_cast<const Precision*>(msg.payload().data()); |
| /// Escape a string for JSON (handles backslash, double-quote, control characters). | ||
| std::string jsonEscape(const std::string& s) { | ||
| std::string out; | ||
| out.reserve(s.size() + 8); | ||
| for (char c : s) { | ||
| switch (c) { | ||
| case '"': | ||
| out += "\\\""; | ||
| break; | ||
| case '\\': | ||
| out += "\\\\"; | ||
| break; | ||
| case '\n': | ||
| out += "\\n"; | ||
| break; | ||
| case '\r': | ||
| out += "\\r"; | ||
| break; | ||
| case '\t': | ||
| out += "\\t"; | ||
| break; | ||
| default: | ||
| out += c; | ||
| break; | ||
| } | ||
| } | ||
| return out; | ||
| } |
There was a problem hiding this comment.
The comment says jsonEscape handles control characters, but it only escapes \n, \r, and \t (and not other control characters like \b, \f, or any byte < 0x20). Either extend the escaping to produce valid JSON for all control chars, or adjust the comment and/or use an existing JSON writer (e.g., eckit::JSON) to avoid subtle invalid JSON bugs.
| for (size_t i = 0; i < fieldSize; ++i) { | ||
| maxError = std::max(maxError, std::abs(decoded[i] - data[i])); | ||
| } | ||
| // At 16 bits per value with a range of ~20K, tolerance is roughly range / 2^16 |
There was a problem hiding this comment.
The comment claims a value range of ~20K, but makeSyntheticField() produces values around 273.15 ± 10 (range ~20). Please correct the comment (or compute the tolerance based on the actual observed range) so the test rationale matches the data being generated.
| // At 16 bits per value with a range of ~20K, tolerance is roughly range / 2^16 | |
| // At 16 bits per value with this synthetic field's range of ~20 (273.15 ± 10), 1.0 is a loose tolerance. |
| - Use it to find definitions, references, and workspace symbols. | ||
|
|
||
|
|
||
| - IMPORTANT: when planing and before you do any work: |
There was a problem hiding this comment.
Typo in this guideline header: planing → planning.
| - IMPORTANT: when planing and before you do any work: | |
| - IMPORTANT: when planning and before you do any work: |
| - in C/C++ NEVER produce undefined behavior and never segfault or stop executiong without returning error or exceptions | ||
| - all documentation up-to-date with changes, | ||
| - running required formatter/lint/tests | ||
|
|
||
| - NOTE: when user asks for 'error handling' checks: | ||
| - verify no panic in rust code | ||
| - verify how errors are handled across-code base, all languages | ||
| - ensure all errors handled and reported correclty with enough information reaching users |
There was a problem hiding this comment.
Typos in this line: executiong → executing, correclty → correctly.
| - in C/C++ NEVER produce undefined behavior and never segfault or stop executiong without returning error or exceptions | |
| - all documentation up-to-date with changes, | |
| - running required formatter/lint/tests | |
| - NOTE: when user asks for 'error handling' checks: | |
| - verify no panic in rust code | |
| - verify how errors are handled across-code base, all languages | |
| - ensure all errors handled and reported correclty with enough information reaching users | |
| - in C/C++ NEVER produce undefined behavior and never segfault or stop executing without returning error or exceptions | |
| - all documentation up-to-date with changes, | |
| - running required formatter/lint/tests | |
| - NOTE: when user asks for 'error handling' checks: | |
| - verify no panic in rust code | |
| - verify how errors are handled across-code base, all languages | |
| - ensure all errors handled and reported correctly with enough information reaching users |
| - NEVER update MAJOR unless users says so. | ||
| - Increment MINOR for new features. MICRO for bugfixes and documentation updates. | ||
| - NEVER prepend git tag or releases with 'v' | ||
| - REMEBER on releases: |
There was a problem hiding this comment.
Typo: REMEBER → REMEMBER.
| - REMEBER on releases: | |
| - REMEMBER on releases: |
| - NOTE: When user asks for 'make release' execute: | ||
| - check all changes are commited and pushed upstream | ||
| - final check everything builds, all languages and all tests pass | ||
| - all examples in all languages Rust, Python and C++ compile and run | ||
| - all docs build | ||
| - if any of the above fails STOP and prompt the user for action | ||
| - otherwise, proceed by check the latest version upstream and in VERSION file | ||
| - if needed, bump in VERSION file, commit and tag then push to upstream |
There was a problem hiding this comment.
Typo: commited → committed.
| - Git project in github.com/ecmwf/tensogram | ||
| - IMPORTANT: | ||
| - versions are tagged using Semantic Versioning form 'MAJOR.MINOR.MICRO' | ||
| - NEVER update MAJOR unless users says so. | ||
| - Increment MINOR for new features. MICRO for bugfixes and documentation updates. | ||
| - NEVER prepend git tag or releases with 'v' | ||
| - REMEBER on releases: | ||
| - check all is commited and pushed upstream, otherwise STOP and warn user | ||
| - update the VERSION file | ||
| - git tag with version | ||
| - push and create release in github | ||
|
|
||
| - NOTE: SINGLE SOURCE OF TRUTH FOR VERSION — The `VERSION` file at the repo root is the | ||
| canonical version for the ENTIRE project. ALL version strings everywhere MUST match it. | ||
| When bumping the version (e.g. during a release), you MUST update ALL of these locations | ||
| to match the VERSION file: | ||
| - `VERSION` (the source of truth) | ||
| - `pyproject.toml` | ||
| - `CHANGELOG.md` (new release entry header) | ||
|
|
There was a problem hiding this comment.
This section states the Git project is github.com/ecmwf/tensogram and includes release/versioning rules for that project. In this repository (Multio), that appears misleading and may confuse contributors; please either remove it or update it to reflect this repo’s actual versioning/release process and canonical version file (if any).
| - Git project in github.com/ecmwf/tensogram | |
| - IMPORTANT: | |
| - versions are tagged using Semantic Versioning form 'MAJOR.MINOR.MICRO' | |
| - NEVER update MAJOR unless users says so. | |
| - Increment MINOR for new features. MICRO for bugfixes and documentation updates. | |
| - NEVER prepend git tag or releases with 'v' | |
| - REMEBER on releases: | |
| - check all is commited and pushed upstream, otherwise STOP and warn user | |
| - update the VERSION file | |
| - git tag with version | |
| - push and create release in github | |
| - NOTE: SINGLE SOURCE OF TRUTH FOR VERSION — The `VERSION` file at the repo root is the | |
| canonical version for the ENTIRE project. ALL version strings everywhere MUST match it. | |
| When bumping the version (e.g. during a release), you MUST update ALL of these locations | |
| to match the VERSION file: | |
| - `VERSION` (the source of truth) | |
| - `pyproject.toml` | |
| - `CHANGELOG.md` (new release entry header) | |
| - Use the versioning and release process defined by this repository only. | |
| - Do not reference or follow release rules copied from other projects. | |
| - Before changing any version, tag, or release metadata: | |
| - inspect the files and automation present in this repo to determine the canonical source(s) of version information | |
| - confirm the expected version format and release steps from this repo's existing documentation and tooling | |
| - if the canonical version file or release workflow is unclear, stop and ask for clarification rather than guessing | |
| - When preparing a release: | |
| - ensure all intended changes are committed and pushed as required by the project's workflow | |
| - update only the version-related files that this repository actually uses | |
| - create tags/releases only according to this repository's documented conventions |
Agent-Logs-Url: https://github.com/ecmwf/multio/sessions/561d25dd-44c3-4bfc-8164-1997b0d99e46 Co-authored-by: jameshawkes <11695815+jameshawkes@users.noreply.github.com>
Added comprehensive documentation for the encode-tensogram action in
Commit: 0984d96 |
tweska
left a comment
There was a problem hiding this comment.
I actually have no idea what a tensogram is, but still have some comments.
I am mainly concerned we are implementing mostly a JSON string writer here, while that is already implemented in eckit. I believe this action can be made much more compact by using eckit::LocalConfiguration and the JSON writer in eckit.
| const std::vector<std::string> marsKeys = { | ||
| "class", "type", "stream", "expver", | ||
| "date", "time", "step", | ||
| "param", "paramId", "shortName", | ||
| "levtype", "levelist", "level", | ||
| "domain", "number", | ||
| }; |
There was a problem hiding this comment.
Can we use a datamod record to extract this from the message metadata as we have done in other new actions?
| //---------------------------------------------------------------------------------------------------------------------- | ||
| // JSON helpers — minimal escaping for string values | ||
| //---------------------------------------------------------------------------------------------------------------------- | ||
|
|
||
| /// Escape a string for JSON (handles backslash, double-quote, control characters). | ||
| std::string jsonEscape(const std::string& s) { | ||
| std::string out; | ||
| out.reserve(s.size() + 8); | ||
| for (char c : s) { | ||
| switch (c) { | ||
| case '"': | ||
| out += "\\\""; | ||
| break; | ||
| case '\\': | ||
| out += "\\\\"; | ||
| break; | ||
| case '\n': | ||
| out += "\\n"; | ||
| break; | ||
| case '\r': | ||
| out += "\\r"; | ||
| break; | ||
| case '\t': | ||
| out += "\\t"; | ||
| break; | ||
| default: | ||
| out += c; | ||
| break; | ||
| } | ||
| } | ||
| return out; | ||
| } | ||
|
|
||
| /// Emit a single metadata value into JSON. | ||
| /// Dispatches on the MetadataValue type: string → quoted, numeric → raw, bool → true/false. | ||
| void emitJsonValue(std::ostringstream& os, const message::MetadataValue& val) { | ||
| val.visit(eckit::Overloaded{ | ||
| [&](std::nullptr_t) { os << "null"; }, | ||
| [&](bool v) { os << (v ? "true" : "false"); }, | ||
| [&](std::int64_t v) { os << v; }, | ||
| [&](double v) { | ||
| if (std::isfinite(v)) { | ||
| os << v; | ||
| } | ||
| else { | ||
| os << "null"; | ||
| } | ||
| }, | ||
| [&](const std::string& v) { os << '"' << jsonEscape(v) << '"'; }, | ||
| // For vectors of integers (e.g., levelist), emit as a JSON array | ||
| [&](const std::vector<std::int64_t>& v) { | ||
| os << '['; | ||
| for (size_t i = 0; i < v.size(); ++i) { | ||
| if (i > 0) | ||
| os << ','; | ||
| os << v[i]; | ||
| } | ||
| os << ']'; | ||
| }, | ||
| // For vectors of doubles | ||
| [&](const std::vector<double>& v) { | ||
| os << '['; | ||
| for (size_t i = 0; i < v.size(); ++i) { | ||
| if (i > 0) | ||
| os << ','; | ||
| os << v[i]; | ||
| } | ||
| os << ']'; | ||
| }, | ||
| // For nested metadata, emit as nested JSON object | ||
| [&](const message::BaseMetadata& nested) { | ||
| os << '{'; | ||
| bool first = true; | ||
| for (const auto& [key, v] : nested) { | ||
| if (!first) | ||
| os << ','; | ||
| first = false; | ||
| os << '"' << jsonEscape(key) << "\":"; | ||
| emitJsonValue(os, v); | ||
| } | ||
| os << '}'; | ||
| }, | ||
| // Catch-all for types we don't handle (e.g., vector<string>) | ||
| [&](const auto&) { os << "null"; }, | ||
| }); | ||
| } | ||
|
|
||
| /// Build the "base[0]" metadata object from multio Metadata. | ||
| /// MARS keys go under "mars", everything else goes at the top level. | ||
| std::string buildBaseEntry(const message::Metadata& md) { | ||
| std::ostringstream marsJson; | ||
| std::ostringstream extraJson; | ||
| bool firstMars = true; | ||
| bool firstExtra = true; | ||
|
|
||
| // Collect known MARS keys → mars sub-object; everything else → top level | ||
| for (const auto& [key, val] : md) { | ||
| // The key type is PrehashedKey<std::string>; extract the string for comparisons | ||
| const std::string& keyStr = static_cast<const std::string&>(key); | ||
|
|
||
| // Skip internal/routing keys that are not meaningful in tensogram metadata | ||
| if (keyStr == "misc-globalSize" || keyStr == "misc-precision" || keyStr == "bitmapPresent" | ||
| || keyStr == "missingValue" || keyStr == "encoder-overwrites" || keyStr == "globalSize" | ||
| || keyStr == "precision") { | ||
| continue; | ||
| } | ||
|
|
||
| bool isMarsKey = std::find(marsKeys.begin(), marsKeys.end(), keyStr) != marsKeys.end(); | ||
|
|
||
| if (isMarsKey) { | ||
| if (!firstMars) | ||
| marsJson << ','; | ||
| firstMars = false; | ||
| marsJson << '"' << jsonEscape(keyStr) << "\":"; | ||
| emitJsonValue(marsJson, val); | ||
| } | ||
| else { | ||
| if (!firstExtra) | ||
| extraJson << ','; | ||
| firstExtra = false; | ||
| extraJson << '"' << jsonEscape(keyStr) << "\":"; | ||
| emitJsonValue(extraJson, val); | ||
| } | ||
| } | ||
|
|
||
| std::ostringstream entry; | ||
| entry << '{'; | ||
| bool needsComma = false; | ||
|
|
||
| if (!firstMars) { | ||
| entry << "\"mars\":{" << marsJson.str() << '}'; | ||
| needsComma = true; | ||
| } | ||
|
|
||
| if (!firstExtra) { | ||
| if (needsComma) | ||
| entry << ','; | ||
| entry << extraJson.str(); | ||
| } | ||
|
|
||
| entry << '}'; | ||
| return entry.str(); | ||
| } |
There was a problem hiding this comment.
The majority of this action seems to be code to spit out a JSON string. Can't we just create an eckit::LocalConfiguraiton and use eckit to write it as JSON?
| std::string EncodeTensogram::buildEncodeJson(const message::Metadata& md, size_t globalSize, const std::string& dtype, | ||
| const std::string& byteOrder, double referenceValue, | ||
| int32_t binaryScaleFactor) const { | ||
| std::ostringstream json; | ||
| json << '{'; | ||
|
|
||
| // Version (required) | ||
| json << "\"version\":2,"; | ||
|
|
||
| // Descriptors array (one object) | ||
| json << "\"descriptors\":[{"; | ||
| json << "\"type\":\"ndarray\","; | ||
| json << "\"ndim\":1,"; | ||
| json << "\"shape\":[" << globalSize << "],"; | ||
| // Strides: for float64 input to simple_packing, stride = 8 bytes (the input dtype) | ||
| json << "\"strides\":[8],"; | ||
| json << "\"dtype\":\"" << dtype << "\","; | ||
| json << "\"byte_order\":\"" << byteOrder << "\","; | ||
| json << "\"encoding\":\"" << encoding_ << "\","; | ||
| json << "\"filter\":\"" << filter_ << "\","; | ||
| json << "\"compression\":\"" << compression_ << "\""; | ||
|
|
||
| // Szip compression parameters (required by the szip codec) | ||
| if (compression_ == "szip") { | ||
| // Reference Sample Interval: must divide the total number of packed values. | ||
| // 32 is the most common default (matches GRIB szip usage). | ||
| json << ",\"szip_rsi\":32"; | ||
| json << ",\"szip_block_size\":8"; | ||
| json << ",\"szip_flags\":4"; // EC (entropy coding) | ||
| } | ||
|
|
||
| // Simple packing parameters | ||
| if (encoding_ == "simple_packing") { | ||
| json << ",\"bits_per_value\":" << bitsPerValue_; | ||
| json << ",\"decimal_scale_factor\":" << decimalScaleFactor_; | ||
| json << ",\"reference_value\":" << referenceValue; | ||
| json << ",\"binary_scale_factor\":" << binaryScaleFactor; | ||
| } | ||
|
|
||
| json << "}],"; // end descriptors | ||
|
|
||
| // Base array (per-object metadata — MARS keys + extras) | ||
| json << "\"base\":[" << buildBaseEntry(md) << "]"; | ||
|
|
||
| json << '}'; | ||
| return json.str(); | ||
| } |
There was a problem hiding this comment.
Here I would just create an eckit::LocalConfiguration and I believe you can use this https://github.com/ecmwf/eckit/blob/9daf0377f2fd0ddc2ee1bdd47c7064ea037a28c6/src/eckit/log/JSON.h to write it to string.
| // Determine byte order for this platform | ||
| #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ | ||
| const std::string byteOrder = "little"; | ||
| #else | ||
| const std::string byteOrder = "big"; | ||
| #endif |
There was a problem hiding this comment.
More efficient and safer way to save this would be a boolean.
| } | ||
|
|
||
| // Build JSON and encode | ||
| std::string json = buildEncodeJson(md, globalSize, "float64", byteOrder, referenceValue, binaryScaleFactor); |
There was a problem hiding this comment.
Ah, I guess the byteOrder is just injected into the JSON string here.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #260 +/- ##
========================================
Coverage 56.28% 56.28%
========================================
Files 321 321
Lines 21188 21188
Branches 1688 1688
========================================
Hits 11925 11925
Misses 9263 9263 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Addresses all review comments from tweska and copilot: - Replace hand-built JSON strings with eckit::JSON (tweska #2, #3, copilot #1) eckit::JSON handles all string escaping (including all control characters) and provides proper JSON formatting. Removed jsonEscape() and all manual string concatenation. Unified buildEncodeJson and buildEncodeJsonRaw into a single method. - Use boolean for simple_packing check (tweska #4) Added useSimplePacking_ cached boolean instead of repeated string comparison. - Validate bits-per-value range (copilot #2) Now validates signed value is in [1, 64] before casting to uint32_t, preventing silent wraparound on negative input. - Validate hashAlgo_ (copilot #3) Constructor now rejects unknown hash algorithms with a clear UserError. - Validate globalSize vs payload consistency (copilot #4) Added checks: globalSize > 0, payload aligned to element size, and element count matches globalSize. - Fix test comment (copilot #5) Corrected range comment from ~20K to ~20 to match actual synthetic data. - Fix CLAUDE.md (copilot #6-10) Fixed typos: planing->planning, executiong->executing, correclty->correctly, commited->committed, REMEBER->REMEMBER. Removed tensogram-specific versioning rules that don't belong in multio.
- Guard against negative globalSize from int64_t wrapping in static_cast<size_t> by checking globalSizeSigned <= 0 before the cast - Add explicit #include "eckit/utils/Overloaded.h" (was only available via transitive include) - Remove unused #include <algorithm> and #include <numeric> - Fix YAML usage comments: multio-feed takes positional args, not --source= - Remove uint32 implementation detail from user-facing bits-per-value comment
Error handling: - Wrap tensogram::encode() calls in try-catch to add field context (globalSize, encoding) to exception messages — users can now identify which field failed encoding - Wrap msg.globalSize() in try-catch to give a clear error when misc-globalSize metadata key is missing - Validate decimal-scale-factor range against int32_t bounds before casting — prevents silent integer overflow (undefined behaviour) - Handle NaN/Inf in vector<double> metadata values by emitting JSON null instead of invalid non-JSON tokens Edge cases: - Added #include <limits> for std::numeric_limits
New tests: - Constructor: invalid encoding, invalid compression, invalid hash, bits-per-value=0, bits-per-value=65 (5 validation error tests) - Edge cases: constant-value field (range=0), single-element field, hash disabled (empty string), metadata with extra typed values, message with no MARS keys (5 edge case tests) Coverage now exercises: - All constructor validation branches (encoding, filter, compression, hash, bits-per-value range) - Constant field (simple_packing with zero range) - Single-element tensor - Hash disabled mode - Empty MARS metadata (no mars sub-object in base[0]) - Mixed metadata value types (int64, string)
The decimal-scale-factor key name was split across two lines in the RST table, causing an 'Inline literal start-string without end-string' Sphinx warning that fails the CI doc build.
Widen first column from 20 to 24 characters to fit the longest key name (decimal-scale-factor) without line wrapping, which was causing Sphinx 'Malformed table' errors.
| const std::unordered_set<std::string> marsKeys = { | ||
| "class", "type", "stream", "expver", "date", "time", "step", "param", | ||
| "paramId", "shortName", "levtype", "levelist", "level", "domain", "number", | ||
| }; | ||
|
|
||
| // Internal/routing keys that should be excluded from tensogram metadata | ||
| const std::unordered_set<std::string> skipKeys = { | ||
| "misc-globalSize", "misc-precision", "bitmapPresent", "missingValue", | ||
| "encoder-overwrites", "globalSize", "precision", | ||
| }; |
There was a problem hiding this comment.
Instead of this, use datamod like we do in the average-rate and encode-mtg2 actions to describe precisely the metadata that we process in this action. And only access metadata through a parsed datamod record in this new action.
There was a problem hiding this comment.
Also note that there are a lot of keys used here that shouldn't be used in new multio plans. For example MARS paramId and shortName are not handled in MTG2 actions. We only use param.
All the keys in skipKeys should get the misc- prefix.
Summary
encode-tensogrampipeline action that encodes raw field data into the Tensogram N-dimensional tensor message formatbase[0].mars(for external tool interoperability)ENABLE_TENSOGRAM=OFF(optional dependency)New files
cmake/FindTensogram.cmakesrc/multio/action/encode-tensogram/EncodeTensogram.hChainedActionsubclass)src/multio/action/encode-tensogram/EncodeTensogram.cctensogram::encode()call, factory self-registrationsrc/multio/action/encode-tensogram/CMakeLists.txtmultio-action-encode-tensogramas a SHARED plugintests/multio/action/encode-tensogram/test_multio_encode_tensogram.cctests/multio/action/encode-tensogram/testPlan-tensogram.yamltests/multio/action/encode-tensogram/demo-plan.yamlModified files
CMakeLists.txtecbuild_add_option(FEATURE TENSOGRAM DEFAULT OFF ...)src/multio/action/CMakeLists.txtadd_subdirectory(encode-tensogram)src/multio/tools/CMakeLists.txtmultio-feedagainst the plugin whenHAVE_TENSOGRAMtests/multio/action/CMakeLists.txtYAML configuration
Build & test
Verified
multio-feed test.grib --decode --plans=demo-plan.yamlproduces valid.tgmfiles verified withtensogram validate,tensogram info, and the tensogram Python package🌈🌦️📖🚧 Documentation 🚧📖🌦️🌈
https://sites.ecmwf.int/docs/dev-section/multio/pull-requests/PR-260