You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Autoware's error identity is currently fragmented across three independent channels with no shared vocabulary or canonical mapping between them:
Channel
Current state
ResponseStatus.code (srv/action)
Ad-hoc constants (UNKNOWN=50000, SERVICE_UNREADY=50001, …) defined per-srv, no owner partitioning, no class derivation
/diagnostics (DiagnosticStatus)
Free-form name + message strings, OK/WARN/ERROR/STALE level, no structured code identity
/rosout (rcl_interfaces/Log)
Free-form msg text, no structured key-value attachment
A node raising a planning-validator error today writes three independent strings — one constant in a srv response, a different name string in DiagnosticStatus, and a free-form log line — that a caller cannot correlate programmatically. The caller has no retryability signal; the monitoring stack has no common key to join on; the HMI has no well-typed error class to branch on. This becomes a hard maintenance problem as the number of components scales.
Goals
Unify error identity across all three channels onto a single 16-bit code (code = (domain_byte << 8) | value_byte, 0x0000 = success) that is the same value whether it appears in a srv response, a diagnostic KeyValue, or a log suffix.
Zero wire change during 1.x. The new code rides the existing ResponseStatus.code (uint16) in place. No new field, no ResponseStatusV2, no autoware_adapi_v2_msgs.
Derive error class off-wire. The 17 gRPC canonical status codes (kOk, kUnavailable, kDeadlineExceeded, kFailedPrecondition, …) are derived from the code via a registry table, never carried on the wire. This gives callers a stable, retryability-meaningful class axis without encoding it into the type.
Bake owner partitioning into the domain byte. The upper byte identifies the owner at a glance. New values for a vendor's own byte require no Foundation PR; cross-owner code bytes are opaque (consumed only through canonical).
Minimal new surface. Three new message types, one C++ library, one lint plugin. Every existing topic, service, and toolchain continues to work.
Reuse existing standards end-to-end: gRPC canonical (Google API Error Model), POSIX / AUTOSAR ara::core principles for the platform-imported domain, OpenTelemetry semantic conventions for structured-log keys, logfmt for the wire format.
Adding new RPC response types or a ResponseStatusV2 message.
Mandatory ErrorCode on INFO/DEBUG logs (mandatory only for ERROR/WARN/FATAL).
HMI / cloud-platform display vocabulary (a higher-layer contract derived from this one).
Communication-middleware transport errors specific to each RMW (absorbed via the platform-imported PLATFORM_COM domain).
The full observability stack (OTel Collector bridge, backends, DiagGraph integration) — covered in the companion Observability and Diagnostics Integration proposal #7176.
Proposed Design
16-bit code space (1.x) and widening at 2.0
1.x (rides ResponseStatus.code = uint16):
code = (domain << 8) | value
domain upper 8 bits owner-partitioned domain byte
value lower 8 bits error value within the domain
2.0 (ResponseStatus.code widened to uint32, 8+24 split):
code = (domain << 24) | value
domain upper 8 bits same partition table — unchanged
value lower 24 bits up to 16,777,216 values per domain
0x0000 success (domain 0x00, value 0x00)
0x0080–0x00FF common-domain warning band (treated as success)
During the 1.x transition period, the wire type is never changed: bool success and string message are declared deprecated in-place and kept for backward compatibility; producers dual-fill them alongside the new code via a set_legacy() helper. At MAJOR 2.0, those deprecated fields are removed from the internal envelope and the field widens to uint32. The external AD API boundary (autoware_adapi_v1_msgs/ResponseStatus) applies deprecation but defers field removal to a future package-major.
Domain-byte owner partition
Domain byte
Owner
0x00
Common (0x0000 = success; 0x0001–0x007F generic errors; 0x0080–0x00FF warning band)
Vendor pool — one or more bytes allocated per vendor via a Foundation registry PR. No vendor has a privileged block; TIER IV occupies this space as one vendor among others. (0xC3/0xEA frozen until 2.0 — they collide with legacy constants 50000–60001)
0xF0–0xFE
Experimental / private (CI deny)
0xFF
Reserved sentinel
Two-layer allocation model:
First layer — which byte range belongs to which owner — recorded in a single Foundation domain_registry.yaml. A new vendor byte requires a Foundation registry PR. This prevents all cross-vendor collisions.
Second layer — values within an allocated byte — managed internally by the owner (capability WG for OSS domains; vendor-internal for vendor bytes). No external review required for second-layer additions.
A component MUST NOT branch on a domain byte it does not own. Cross-owner consumption is via the derived canonical only (CI gate: no_foreign_domain_branch).
Error class: gRPC canonical, derived off-wire
The 17 gRPC canonical status codes (Google API Error Model) are the sole contract-surface error class. canonical is not carried on the wire; the autoware_error library derives (domain, value) → canonical from a registry table (canonical-mapping.csv). A code absent from the table resolves to kUnknown.
Representative mappings:
Errc
canonical
common::kServiceUnready
kUnavailable
common::kServiceTimeout
kDeadlineExceeded
common::kTransformError
kFailedPrecondition
routing::kPlannerUnready
kFailedPrecondition
routing::kPlannerFailed
kInternal
routing::kGoalOutOfLanelet
kInvalidArgument
operation_mode::kNotAvailable
kFailedPrecondition
operation_mode::kInTransition
kAborted
platform::posix::kTimedOut
kDeadlineExceeded
A caller's generic handler — retry on kUnavailable/kDeadlineExceeded, fix the precondition on kFailedPrecondition, escalate on kInternal — can be written directly against these 17 values without knowing any domain byte.
New message types (3 total)
autoware_common_msgs/msg/ErrorCode.msg — composition type for embedding in state topics and event messages (service responses carry the same code in the existing ResponseStatus.code; this message is not embedded there):
uint16 code # widens to uint32 at 2.0
string detail
autoware_common_msgs/msg/ErrorDomain.msg — domain-byte constants (the 0x10–0x7F OSS range and the 0x00/0x01–0x0F blocks; the vendor pool range is reserved but individual vendor bytes are not enumerated here).
error_domain/<owner>/<Domain>.msg — per-domain value constants, e.g.:
Producers dual-fill via set_legacy(resp.status, ec). Consumers migrate from success/message to code. Source-lint (no_legacy_response_status_field) is the primary evidence of migration completion — it mechanically proves zero references to the old fields across the entire codebase.
Minimal instrumentation
A node author constructs one ErrorCode and emits it on all three channels:
// Example: planning_validator raises a routing errorusing Routing = autoware::error::domain::autoware_::routing::Errc;
auto ec = autoware::error::domain::autoware_::routing::make(
Routing::kPlannerUnready, "no route found");
// srv response (dual-fills deprecated fields automatically)throwautoware::error::Exception(ec);
// /rosout: logfmt suffix appended to log textAUTOWARE_LOG_ERROR_CODE(logger, ec, "Route planning failed");
// /diagnostics: reserved KV keys in DiagnosticStatus.values[]autoware::error::set(diag_status, ec);
The logfmt suffix follows OpenTelemetry semantic conventions:
The reserved KV keys in DiagnosticStatus.values[] (autoware.error.code, autoware.error.canonical, autoware.error.domain_name, autoware.error.value_name, autoware.error.detail) let existing rqt_robot_monitor / diagnostic_aggregator deployments display structured error information with no change to those tools.
Proof of Concept
A minimal feasibility skeleton is available at https://github.com/youtalk/awf-error-observability-poc (ROS 2 Jazzy; docker compose up to run, docker compose --profile test up to run the 25-test suite). It is not production code; it proves the design claims end-to-end on real ROS 2 machinery.
Claims validated by the PoC:
Claim
What the PoC runs
C1 New msgs generate and link cleanly
ErrorCode / ErrorDomain / domain value .msg build in a colcon workspace
C2 gRPC canonical derived off-wire from the code
autoware_error::canonical(code) matches the registry table; no canonical field on the wire
C3 (keystone) In-place semver = zero wire change
rihs01_zero_wire_change.sh computes the RIHS01 type hash of ResponseStatus before and after in-place deprecation (comments + lint, no field added/removed) and asserts the hashes are identical; then removes success/message at the simulated 2.0 boundary and asserts the hash changes
C4 logfmt suffix lands on /rosout
Live node publishes; the subscriber parses and validates each key
C5 logfmt suffix survives a rosbag2 round-trip byte-identical
Record /rosout → play → assert the suffixed Log record (with escaped space / = / quote / newline) replays byte-for-byte identical
C6 reserved KV keeps a real diagnostic_aggregator working
A full diagnostic_aggregator launch consumes DiagnosticStatus with the reserved keys and emits the expected /diagnostics_agg output
C3 is the keystone claim: it directly proves that the "zero wire change during 1.x" guarantee is not an assertion but a verifiable RIHS01 property.
Migration
Migration proceeds in three stages, without any absolute date dependency (the OSS track runs at Foundation adoption cadence):
Stage 1 — Foundation (zero wire change). Introduce autoware_common_msgs (ErrorCode / ErrorDomain / error_domain/*) and autoware_error (C++ library, set_legacy, canonical derivation table, lint plugin). Source-lint starts in warn mode. The RIHS01 hash of ResponseStatus is provably unchanged (C3 above).
Stage 2 — Producer/consumer migration. Producers replace legacy filling with make() + set_legacy(); consumers migrate from success/message to code. The producer_first_order CI gate enforces that consumer switches follow producer switches. Source-lint (no_legacy_response_status_field) escalates from warn to error once a package opts in via <export><autoware_error_migrated/></export>. Zero source-lint violations is the exit condition — mechanically provable, not dependent on manual audit.
Stage 3 — MAJOR 2.0. After the deprecated fields are lint-clean across all OSS packages and Humble support has wound down: remove success/message from the internal envelope (autoware_common_msgs/ResponseStatus) and widen code to uint32 (8+24 split). The runtime semver handshake rejects MAJOR-mixed node pairs at startup. Old bags are converted by ros2 bag migrate using the ResponseStatus1xTo2 plugin. The RIHS01 hash of the new layout differs from the old (C3), making residual old endpoints detectable mechanically at the 2.0 boundary.
Source-lint is the lead evidence of migration completion. Because the 1.x period involves zero wire change, hash observation cannot distinguish "producer has migrated" from "producer has not." Source-lint (no_legacy_response_status_field) fills this gap, proving there are zero references to .success/.message or the old constants anywhere in the codebase. RIHS01 audits serve as the auditor for bag consistency and the 2.0 boundary.
Feedback Requested
Domain-byte partition. The 0x80–0xEF vendor-pool range allocates bytes per vendor via a Foundation registry PR, with no privileged block for any single vendor. Does this governance model work for OEMs and integrators who want to define their own error values without a Foundation review cycle for each value addition?
gRPC canonical as the sole error class. The proposal uses the 17 gRPC canonical status codes rather than a coarser 4-class scheme. The benefit is direct gRPC/HTTP binding and well-known retryability semantics; the cost is that autoware_error must maintain the (domain, value) → canonical table in CI. Is this tradeoff acceptable, or should there be a simpler first-pass class axis alongside canonical?
Warning band (0x0080–0x00FF). Several existing call sites return success = true with a warning code (e.g. NO_EFFECT). The proposal absorbs these into a common-domain warning band treated as success, accessible via has_warning(code). Are there call sites where the warning/success distinction is load-bearing enough that a dedicated field (rather than a value-band convention) would be safer?
In-place ResponseStatus evolution vs. a parallel type. The proposal deliberately avoids ResponseStatusV2 in favor of in-place deprecation and RIHS01 evidence. If your team has existing tooling that snapshots type hashes for compatibility checking, does the "zero RIHS01 change during 1.x" guarantee give enough confidence, or would a side-by-side type be easier to gate in CI?
Migration tooling scope. The autoware_error_codemod libclang-based rewriter is the proposed automation path. Have teams here used similar AST-based codemods across autowarefoundation/* at scale? Lessons learned on failure modes (macro expansion, generated code, ROS 2 service stubs) would directly inform the codemod's design.
This proposal is part of a foundation-design series targeting Autoware's API and observability infrastructure. A companion Observability and Diagnostics Integration proposal #7176 covering the full OTel Collector bridge, DiagGraph integration, and structured-log backend wiring follows separately and builds on the emission mechanism defined here.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Autoware's error identity is currently fragmented across three independent channels with no shared vocabulary or canonical mapping between them:
ResponseStatus.code(srv/action)UNKNOWN=50000,SERVICE_UNREADY=50001, …) defined per-srv, no owner partitioning, no class derivation/diagnostics(DiagnosticStatus)name+messagestrings,OK/WARN/ERROR/STALElevel, no structured code identity/rosout(rcl_interfaces/Log)msgtext, no structured key-value attachmentA node raising a planning-validator error today writes three independent strings — one constant in a srv response, a different name string in
DiagnosticStatus, and a free-form log line — that a caller cannot correlate programmatically. The caller has no retryability signal; the monitoring stack has no common key to join on; the HMI has no well-typed error class to branch on. This becomes a hard maintenance problem as the number of components scales.Goals
code = (domain_byte << 8) | value_byte,0x0000= success) that is the same value whether it appears in a srv response, a diagnostic KeyValue, or a log suffix.ResponseStatus.code(uint16) in place. No new field, noResponseStatusV2, noautoware_adapi_v2_msgs.kOk,kUnavailable,kDeadlineExceeded,kFailedPrecondition, …) are derived from the code via a registry table, never carried on the wire. This gives callers a stable, retryability-meaningful class axis without encoding it into the type.canonical).ara::coreprinciples for the platform-imported domain, OpenTelemetry semantic conventions for structured-log keys, logfmt for the wire format.Non-Goals
ResponseStatusV2message.ErrorCodeonINFO/DEBUGlogs (mandatory only forERROR/WARN/FATAL).PLATFORM_COMdomain).Proposed Design
16-bit code space (1.x) and widening at 2.0
During the 1.x transition period, the wire type is never changed:
bool successandstring messageare declared deprecated in-place and kept for backward compatibility; producers dual-fill them alongside the newcodevia aset_legacy()helper. At MAJOR 2.0, those deprecated fields are removed from the internal envelope and the field widens touint32. The external AD API boundary (autoware_adapi_v1_msgs/ResponseStatus) applies deprecation but defers field removal to a future package-major.Domain-byte owner partition
0x000x0000= success;0x0001–0x007Fgeneric errors;0x0080–0x00FFwarning band)0x01–0x0Fara::coresemantics; frozen)0x10–0x7F0x10, Localization0x11, Perception0x12, Planning0x13, Control0x14, Map0x15, …)0x80–0xEF0xC3/0xEAfrozen until 2.0 — they collide with legacy constants50000–60001)0xF0–0xFE0xFFTwo-layer allocation model:
domain_registry.yaml. A new vendor byte requires a Foundation registry PR. This prevents all cross-vendor collisions.A component MUST NOT branch on a domain byte it does not own. Cross-owner consumption is via the derived
canonicalonly (CI gate:no_foreign_domain_branch).Error class: gRPC canonical, derived off-wire
The 17 gRPC canonical status codes (Google API Error Model) are the sole contract-surface error class.
canonicalis not carried on the wire; theautoware_errorlibrary derives(domain, value) → canonicalfrom a registry table (canonical-mapping.csv). A code absent from the table resolves tokUnknown.Representative mappings:
common::kServiceUnreadykUnavailablecommon::kServiceTimeoutkDeadlineExceededcommon::kTransformErrorkFailedPreconditionrouting::kPlannerUnreadykFailedPreconditionrouting::kPlannerFailedkInternalrouting::kGoalOutOfLaneletkInvalidArgumentoperation_mode::kNotAvailablekFailedPreconditionoperation_mode::kInTransitionkAbortedplatform::posix::kTimedOutkDeadlineExceededA caller's generic handler — retry on
kUnavailable/kDeadlineExceeded, fix the precondition onkFailedPrecondition, escalate onkInternal— can be written directly against these 17 values without knowing any domain byte.New message types (3 total)
autoware_common_msgs/msg/ErrorCode.msg— composition type for embedding in state topics and event messages (service responses carry the same code in the existingResponseStatus.code; this message is not embedded there):autoware_common_msgs/msg/ErrorDomain.msg— domain-byte constants (the0x10–0x7FOSS range and the0x00/0x01–0x0Fblocks; the vendor pool range is reserved but individual vendor bytes are not enumerated here).error_domain/<owner>/<Domain>.msg— per-domain value constants, e.g.:In-place evolution of
ResponseStatusNo new response type is introduced. The existing
ResponseStatusis evolved in place:Producers dual-fill via
set_legacy(resp.status, ec). Consumers migrate fromsuccess/messagetocode. Source-lint (no_legacy_response_status_field) is the primary evidence of migration completion — it mechanically proves zero references to the old fields across the entire codebase.Minimal instrumentation
A node author constructs one
ErrorCodeand emits it on all three channels:The logfmt suffix follows OpenTelemetry semantic conventions:
The reserved KV keys in
DiagnosticStatus.values[](autoware.error.code,autoware.error.canonical,autoware.error.domain_name,autoware.error.value_name,autoware.error.detail) let existingrqt_robot_monitor/diagnostic_aggregatordeployments display structured error information with no change to those tools.Proof of Concept
A minimal feasibility skeleton is available at https://github.com/youtalk/awf-error-observability-poc (ROS 2 Jazzy;
docker compose upto run,docker compose --profile test upto run the 25-test suite). It is not production code; it proves the design claims end-to-end on real ROS 2 machinery.Claims validated by the PoC:
ErrorCode/ErrorDomain/ domain value.msgbuild in a colcon workspaceautoware_error::canonical(code)matches the registry table; no canonical field on the wirerihs01_zero_wire_change.shcomputes the RIHS01 type hash ofResponseStatusbefore and after in-place deprecation (comments + lint, no field added/removed) and asserts the hashes are identical; then removessuccess/messageat the simulated 2.0 boundary and asserts the hash changes/rosout/rosout→ play → assert the suffixedLogrecord (with escaped space /=/ quote / newline) replays byte-for-byte identicaldiagnostic_aggregatorworkingdiagnostic_aggregatorlaunch consumesDiagnosticStatuswith the reserved keys and emits the expected/diagnostics_aggoutputC3 is the keystone claim: it directly proves that the "zero wire change during 1.x" guarantee is not an assertion but a verifiable RIHS01 property.
Migration
Migration proceeds in three stages, without any absolute date dependency (the OSS track runs at Foundation adoption cadence):
Stage 1 — Foundation (zero wire change). Introduce
autoware_common_msgs(ErrorCode/ErrorDomain/error_domain/*) andautoware_error(C++ library,set_legacy, canonical derivation table, lint plugin). Source-lint starts in warn mode. The RIHS01 hash ofResponseStatusis provably unchanged (C3 above).Stage 2 — Producer/consumer migration. Producers replace legacy filling with
make()+set_legacy(); consumers migrate fromsuccess/messagetocode. Theproducer_first_orderCI gate enforces that consumer switches follow producer switches. Source-lint (no_legacy_response_status_field) escalates from warn to error once a package opts in via<export><autoware_error_migrated/></export>. Zero source-lint violations is the exit condition — mechanically provable, not dependent on manual audit.Stage 3 — MAJOR 2.0. After the deprecated fields are lint-clean across all OSS packages and Humble support has wound down: remove
success/messagefrom the internal envelope (autoware_common_msgs/ResponseStatus) and widencodetouint32(8+24 split). The runtime semver handshake rejects MAJOR-mixed node pairs at startup. Old bags are converted byros2 bag migrateusing theResponseStatus1xTo2plugin. The RIHS01 hash of the new layout differs from the old (C3), making residual old endpoints detectable mechanically at the 2.0 boundary.Source-lint is the lead evidence of migration completion. Because the 1.x period involves zero wire change, hash observation cannot distinguish "producer has migrated" from "producer has not." Source-lint (
no_legacy_response_status_field) fills this gap, proving there are zero references to.success/.messageor the old constants anywhere in the codebase. RIHS01 audits serve as the auditor for bag consistency and the 2.0 boundary.Feedback Requested
Domain-byte partition. The
0x80–0xEFvendor-pool range allocates bytes per vendor via a Foundation registry PR, with no privileged block for any single vendor. Does this governance model work for OEMs and integrators who want to define their own error values without a Foundation review cycle for each value addition?gRPC canonical as the sole error class. The proposal uses the 17 gRPC canonical status codes rather than a coarser 4-class scheme. The benefit is direct gRPC/HTTP binding and well-known retryability semantics; the cost is that
autoware_errormust maintain the(domain, value) → canonicaltable in CI. Is this tradeoff acceptable, or should there be a simpler first-pass class axis alongside canonical?Warning band (
0x0080–0x00FF). Several existing call sites returnsuccess = truewith a warning code (e.g.NO_EFFECT). The proposal absorbs these into a common-domain warning band treated as success, accessible viahas_warning(code). Are there call sites where the warning/success distinction is load-bearing enough that a dedicated field (rather than a value-band convention) would be safer?In-place
ResponseStatusevolution vs. a parallel type. The proposal deliberately avoidsResponseStatusV2in favor of in-place deprecation and RIHS01 evidence. If your team has existing tooling that snapshots type hashes for compatibility checking, does the "zero RIHS01 change during 1.x" guarantee give enough confidence, or would a side-by-side type be easier to gate in CI?Migration tooling scope. The
autoware_error_codemodlibclang-based rewriter is the proposed automation path. Have teams here used similar AST-based codemods acrossautowarefoundation/*at scale? Lessons learned on failure modes (macro expansion, generated code, ROS 2 service stubs) would directly inform the codemod's design.This proposal is part of a foundation-design series targeting Autoware's API and observability infrastructure. A companion Observability and Diagnostics Integration proposal #7176 covering the full OTel Collector bridge, DiagGraph integration, and structured-log backend wiring follows separately and builds on the emission mechanism defined here.
Beta Was this translation helpful? Give feedback.
All reactions