How to write Rust in ncx-infra-controller-core

The goal of this document is to help keep our codebase consistent and maintainable by outlining best-practices we've learned through experience. It is currently a mix of best practices for this codebase (ie. how we expect code to be organized), and best practices for Rust in general. The latter is mostly motivated by issues we seen enough to warrant writing them down, but otherwise this document not aim to be a "how to write Rust" guide.

Core Principles

Prefer simple, explicit code over clever or heavily abstracted code. Optimize for readability and maintainability first.
Prefer designs that are hard to misuse. The more the compiler can catch bugs, the better.
Abstractions should justify their existence: Do not add abstractions "just in case". Wait until there is a real requirement for them.

Reviewability

PR descriptions should be written as if the audience has no context for the change: Explain why it's happening. Don't assume people are already aware of your feature roadmap.

Prefer to not land unused code if nobody's using it yet, unless not doing so would make for too large of a change to review. For example, a PR that lands protobuf changes but without any code using it yet, makes for a lot of guesswork during review: If we can't see how the code will be used, we are just guessing at what the best API contract will be. Landing both changes together means we can look at it all holistically.

Lints and Warnings

We enable all clippy lints by default, and treat all warnings as errors. If a warning or clippy lint is firing for your code, strongly consider fixing it. Avoid using #[allow(...)] unless you have a strong reason to do so. New code should generally not have to #[allow] any lints or warnings.

A note on dead code

Dead code detection is important to catch mistakes and to avoid unused code building up and hurting maintainability. Strongly avoid using #[allow(dead_code)].

An exception is when a part of the codebase is not finished: If a new feature is too large to land all in one PR, and is being written in phases, code may be merged with nothing calling it yet, and #[allow(dead_code)] is necessary for it to be merged early.

Other common places where we've seen #[allow(dead_code)] that are not necessary:

If a field or function is only used in tests: Use #[cfg(test)] to include it only in test builds.
If a field is written to but never read, but needs to be held so its Drop impl does not run: Name it with an underscore to hint that it's not supposed to be read
If a field is only used if certain crate features are enabled, prefer #[cfg(feature = "feature")] to only include it when that feature is being used.
If a field isn't currently yet, but you want to leave it around as documentation on what fields could exist (like an unused database column, or unused JSON field), comment it out.
Otherwise, strongly consider deleting the code.

gRPC API definitions

APIs to list resources and retrieve resource state should be paginated in order to scale to a high amount of managed resources. Pagination should be achieved in the following fashion:
- An API call with the format FindResourceNameIds (e.g. FindMachineIds) should be used to list the IDs of all resources. It should take a ResourceNameSearchFilter message as argument, that allows to narrow down the amount of returned IDs according to certain criteria. If multiple criteria are provided, the API should search for resources where all criteria apply.
- An API call with the format FindResourceNamesByIds (e.g. FindMachinesByIds) should be used to retrieve the state of the resources.
Each resource object that is configurable by API users should contain the following set of fields:
- An id field that identifies the resource.
- A config field that holds every value that is set by API callers (site admins or tenants).
- A status field which holds every value that is generated by the system (not user-provided)
- A metadata field if the resource has user-changeable metadata (name, description or labels)
- A version field which describes how often the config of the resource was updated and when the last change occured. The version field needs to get incremented every time a tenant or site admin changes the config of a certain resource. This allows the system to identify whether anything changed purely by comparing version numbers.
Example of a complete resource:
```
message AmazingResource {
  common.AmazingResourceId id = 1;
  Metadata metadata = 2;
  AmazingResourceConfig config = 3;
  AmazingResourceStatus status = 4;
  string version = 5;
}
```
If the lifecycle of a resource is managed by a state handler, the resource should contain the following extra fields:
- A state field which shows the lifecycle state of the resource
- A state_version field which gets incremented every time the resource switches between states
- A state_reason field which shows the outcome of the last state handler run
- A state_sla field which shows the SLA for the state, and whether it had been breached.

Metrics

When designing metrics, be careful with cardinality. Do not attach highly unique labels that explode time-series count, like per-machine or per-instance attributes.

Logging

All services should emit logs in "logfmt" syntax. This structured logging format allows administrators to efficiently search logs by certain attributes (key/value pairs).

When writing log messages, prefer placing common fields as attributes passed to tracing function, instead of using string interpolation. For example:

fn avoid(machine_id: MachineId) {
    if let Err(e) = process_machine(machine_id) {
        tracing::error!("process_machine failed for {machine_id}: {e}");
    }
}
fn prefer(machine_id: MachineId) {
    if let Err(e) = process_machine(machine_id) {
        tracing::error!(%machine_id, error=%e, "process_machine failed");
    }
}

This helps in log parsing, especially when we want to find logs corresponding to a given machine_id. error and machine_id are probably the two most important examples, but try to express other relevant data as fields instead of using interpolation if it makes sense.

Core API handlers

Implementations of all gRPC functions exposed by the core service should reside in subdirectories of api/src/handlers.
API handlers should inject the deserialized request arguments into the API logs by calling log_request_data to assist debugging. If the request contains sensitive data (e.g. credentials), the data however needs to be filtered before logging.

Core API handler Errors

Inside API handlers, the CarbideError data type should be used to construct errors. It should then be converted into tonic::Status using .into(). All errors being derived from CarbideError assures that the errors will look uniform to tenants.

The CarbideError variant that is used should be selected based on whether the error gets returned due to the user passing invalid arguments or due to the system not being able to handle the request correctly. Error variants that should be used if the user passing invalid arguments can be InvalidArgument, InvalidConfiguration, NotFoundError or ConcurrentModificationError - these will map to "4xx-like" gRPC error codes. An example of a system-side error would be CarbideError::Internal.

// Avoid — constructing Status directly, bypassing `CarbideError` error mapping
pub async fn create_resource(
    api: &Api,
    request: Request<rpc::Resource>,
) -> Result<Response<()>, Status> {
    let resource = request.into_inner();
    let id = resource
        .id
        .ok_or_else(|| Status::invalid_argument("id is required"))?;
}

// Prefer — uses `CarbideError::InvalidArgument`
pub async fn create_resource(
    api: &Api,
    request: Request<rpc::Resource>,
) -> Result<Response<()>, Status> {
    let resource = request.into_inner();
    let id = resource
        .id
        .ok_or(CarbideError::InvalidArgument("id is required".into()))?;
}

Crate Features

Avoid using crate features unless there is a good reason. Our CI runners only build with the default features you get from cargo build --release, meaning that if certain code breaks under certain combinations of crate features, it might not get caught by CI. If we wanted to support numerous crate features, we would need CI runners to produce checks for each meaningful combination of feature flags we support, which scales exponentially to the feature count.

Cases where features are warranted:

For shared crates when only a subset of dependents need certain code: For example, the carbide_uuid is used by several dependents, but only the carbide_api crate needs the sqlx conversions. We don't want e.g. carbide_admin_cli to take a dependency on sqlx, so the sqlx conversions are behind a sqlx crate feature. But this is covered by CI tests, since CI builds both the admin-cli and the api crate, both sets of features are exercised.
For supporting non-linux builds: The carbide_api crate needs to use types from the tss-esapi crate to support validating secure-boot keys, but tss-esapi only builds on Linux. To support developers running carbide_api on their Mac for testing, the parts which require tss-esapi are carefully carved out into a linux-build feature (which is enabled by default). We do not run CI tests with this feature disabled, so supporting a build without linux-build enabled is best-effort.

Async code

Due to the "virality" of async code, prefer synchronous versions of abstractions if both are available. For instance, prefer a std::sync::Mutex to a tokio::sync::Mutex if either will work for you, so that you don't need to make your interface async just so you can use the tokio Mutex. That way callers can call you without needing to be async themselves. Async work should generally be traceable to some I/O or timer that needs to be used, otherwise code should typically be synchronous.

Database transactions

Transactions should be used to group write operations together such that they can be rolled back on failure. But do not hold a transaction open while doing long-running work. Doing so can exhaust the connection pool if the thing you're awaiting is blocked or slow. We have a custom lint, txn_held_across_await which will catch cases where you're awaiting a future while holding a transaction, which mitigates this. If it happens, your code needs to be fixed, do not #[allow(txn_held_across_await)].

Database wrappers

Type definitions: The code in crates/api-db is intended to wrap database calls, whereas crates/api-model should contain the actual model definitions. In the api-db crate, prefer bare functions that take a model as an argument, to OO-style methods on db-specific types. This allows the model types to live in a separate model crate, without the temptation for an OO-style database type to become a quasi-model unto itself.
Read vs Write: Prefer accepting a impl DbReader as a connection if your database function is read-only. This allows callers to pass a PgPool and avoid needing boilerplate to begin a transaction and commit it just to call a read-only function.

Background tasks

Avoid spawning background tasks without joining them. Any panics that happen in background tasks will not propagate to the rest of the process unless you join them via JoinHandle::join() or add them to a JoinSet which is later awaited with JoinSet::join_all().

For carbide-api, we use a single JoinSet to spawn all background tasks, and call join_all() to block "forever" until the process is shut down. This makes it so any panics in the JoinSet will propagate to the main task, and crash the process (which is what we want.) If you want to spawn background work, prefer accepting a &mut JoinSet and spawn your background task into it. Your task can be constructed it inside carbide::setup::initialize_and_spawn_controllers, which has a JoinSet it can pass to your start() function.

Avoid using oneshot::Sender<()> as a cancellation signal, and prefer tokio_util's CancellationToken, which can be cloned and re-used to cancel sub-tasks.

A note on function naming: start or spawn should mean "spawns work in the background". run should mean "run forever".

General Rust Coding Standards

Mutability

Prefer immutable data when possible. Mutable data can be hard to reason about if it's being reused multiple times, and it's not clear when mutations are supposed to "stop". For example:

fn example(machines: Vec<Machine>) {
    let mut index: HashMap<MachineId, &Machine> = HashMap::new();
    for machine in &machines {
        index[machine.id] = machine;
        do_something_else_with(machine);
    }

    process_machines(&index);

    // Someone comes in later and adds:
    let another_machine = lookup_machine();
    index[another_machine.id] = another_machine;
    // Hmm, do I need to call `process_machines` again? Or will that process the same machines twice?
    process_machines(&index);
}

If data is left mutable (like index above), it's not clear at a given line of code if the data is "done" being built, or still has more writes to go. It's also not clear whether it's safe to use the partially-written index. And interleaving the construction of index with other side-effects (like do_something_else_with(machine)) makes it unclear what the role of certain code is.

When building a Vec or a HashMap, prefer using iterators to building them from a for-loop:

fn example(machines: Vec<Machine>) {
    // index is immutable
    let index: HashMap<MachineId, &Machine> = machines.iter().map(|machine| {
        (machine.id, machine)
    }).collect();

    for machine in &machines {
        do_something_else_with(machine); // it's clear this is unrelated to constructing the index
    }

    // it's clear the index is now fully-built
    process_machines(&index);

    // This will now fail to compile, making it clear you have to move this to the beginning and use
    // `machines.iter().chain(Some(another_machine))` to include it in the original index.
    let another_machine = lookup_machine();
    index[another_machine.id] = another_machine;
}

Initialization

Prefer struct literals for "plain old data", and only add a new() function if your type has fields which need to be non-public. Prefer a Builder pattern only if your new() function is too large or difficult to call.

Reasoning: Struct literals include named fields which aid in readability, versus a new() function which does not have labels for parameters. Builders can be more readable than a large new() function, but sacrifice compile-time checks if any of the fields are required.

Compare:

fn example() {
    let u = User {
        id: "john",
        full_name: "John Smith",
    };
}

to:

fn example() {
    let u = User::new("john", "John Smith");
}

In the former it is clear what each argument is, whereas the latter you have to memorize which positional argument corresponds to what field.

For types that are not simple plain-old-data, for example "services" (like a redfish client), or any other case where you don't want the caller to initialize certain fields, a new() function may be required:

struct RedfishClient {
    // Callers pass this
    url: Url,
    // Callers don't pass this
    inner: HttpClient,
}

impl RedfishClient {
    fn new(url: Url) {
        Self { url, inner: make_http_client(url) }
    }
}

If your type has fields that can all be default values in the common case (like a Config object), prefer implementing Default for the type and let callers call T::default(), instead of a parameterless new().

If, in addition to not wanting callers to initialize certain fields, you also have a large number of fields that can to be passed, consider adding a Builder type.

struct BigService {
    name: String,
    // ... lots of fields
}

struct BigServiceBuilder {
    name: Option<String>, // careful!
    // .. lots of Option<T> fields
}

impl BigServiceBuilder {
    fn name(mut self, name: String) -> Self {
        self.name = Some(name);
        self
    }

    fn build(self) -> BigService {
        BigService {
            name: self.name.expect("caller didn't provide name"), // oops!
            // ...
        }
    }
}

But be aware that this can sacrifice compile-time safety if any of the builder fields are required to construct the object. You can work around this by requiring callers to pass any required fields in order to construct a builder:

impl BigService {
    fn builder(name: String) -> BigServiceBuilder {
        BigServiceBuilder {
            name,
            // ...
        }
    }
}

But as the number of required fields grows, a builder becomes less and less helpful in the first place. Builders are most helpful when all fields are optional or have defaults, and are less helpful if there are a complex mix of required and non-required fields. If you have a large struct with lots of required fields and lots of non-required fields, consider splitting it into two types, one for the required fields, and a Config or Params type for the non-required (defaultable) ones.

Type Conversions

Prefer implementing From or TryFrom for types, rather than writing bespoke .to_foo() methods on objects. This makes your conversion logic more idiomatic and discoverable (e.g. you can write src.into()) than custom methods.

Prefer implementing From<T> over From<&T>. This allows the conversion to move data without cloning. If the caller cannot move the value, they can explicitly call .clone() themselves, which makes the cost more obvious.

An exception is when the conversion doesn't require cloning at all (e.g. it only reads Copy fields from the source.) In this case borrowed conversion can be provided for ergonomics, but it should be provided in addition to the owned conversion, not instead of it.

If you need to convert from a string representation, prefer FromStr to From<String> or From<&str>. This lets callers call .parse(), which can be given a &str slice, which can avoid needless clones.

Fields and getters

Avoid writing getters like .some_field() for a type, and prefer just making that field public.

The reason for this is specific to Rust and its ownership model: Public fields allow partial moves of an object to take ownership of its fields, whereas getters have to pick an ownership model that might not match what the caller needs.

For example, if a type User has a field pub name: String, callers that own a User have several options for reading the name field:

fn example(u: User) {
    foo(&u.name); // borrow `name`
    bar(u.name.clone()); // clone `name`
    baz(u.name); // partial move of `name` out of `u`
}

Whereas if name() were a getter, you have to pick an ownership model:

impl User {
    // By borrow: But callers have to clone if they need an owned string
    fn name(&self) -> &str {
        &self.name
    }

    // By cloned value: If callers only need to borrow, this clone is wasteful
    fn name(&self) -> String {
        self.name.clone()
    }

    // By transferring ownership: Now callers have to move `self` to get the name, and can no longer access other fields
    fn name(self) -> String {}
}

In cases where you don't want a field to be public for other reasons (like not allowing callers to write to it), and you must write a getter, consider making two versions, a borrowed getter and an into_ getter:

impl User {
    // Borrowed version
    fn name(&self) -> &str {
        &self.name
    }

    // Owned/destructured version
    fn into_name(self) -> String {
        self.name
    }
}

or an into_parts function, if you want to return multiple fields at once. But again, pub fields are simplest and can avoid all of this, if you are able to use them.

Avoid needless clones

Seeing .clone() all over is a sign that the ownership model may need some rethinking. Can you borrow the data instead? Can you take ownership of the value you're cloning?

Common usages of clone that have easy fixes:

Borrowing: Sometimes a clone happens because you have a borrow and need an owned value:

fn takes_string(s: String) {
    println!("{s}");
}

fn example(s: &str) {
    takes_string(s.clone()); // takes_string requires ownership so we have to clone
}

But the takes_string function doesn't truly need an owned string, it can be changed to take a &str as well.

Or conversely, example could be changed to take an owned String.

Iterators: You can use .into_iter() instead of .iter() to an an owned version of each value, which you can then move without cloning:

fn takes_string(s: String) {}

fn avoid(v: Vec<String>) {
    v.iter().for_each(|i| takes_string(i.clone())); // avoid: needless clone
}

fn prefer(v: Vec<String>) {
    v.into_iter().for_each(|i| takes_string(i)); // prefer: moves out of v
}

Struct initialization ordering: Sometimes just moving the order of parameters to a struct literal can avoid a clone:

struct Outer {
    inner: Inner,
    id: uint
}

struct Inner {
    name: String,
    id: uint
}

fn avoid(inner: I) -> Outer {
    Outer {
        inner: inner.clone(), // can't move inner yet, still need inner.id?
        id: inner.id,
    }
}

fn prefer(inner: I) -> Outer {
    Outer {
        id: inner.id, // Better: just swap the parameters and we can move inner last
        inner,
    }
}

Making use of Cow<T>: If you might use a borrowed value or might produce your own, consider using Cow to avoid the clone in the borrowed case.

fn avoid(user: Option<&User>) -> User {
    if let Some(u) = user {
        u.clone()
    } else {
        User::default()
    }
}

fn prefer(user: Option<&User>) -> Cow<'_, User> {
    if let Some(u) = user {
        Cow::Borrowed(u)
    } else {
        Cow::Owned(User::default())
    }
}

Error handling

Prefer custom errors for library crates, using the thiserror crate to reduce boilerplate for declaring them. Use automatic conversions to convert between errors, or .map_err() if you have to. Using eyre is acceptable for crates that are used for tests/mocks, or for toplevel binaries where errors are given to the user for informational purposes, and not intended to be inspected by other rust code. (We do not always adhere to this rule.)

Avoid using let _unused = foo(); to discard errors. This is error-prone: If later foo() is refactored to become an async function, assigning the result to _unused silences the compiler warning telling you forgot to call .await. If you don't care about the errors a function produces, prefer using .ok() to convert the error into a (discardable) Option.

fn fails() -> Result<(), Error> {}

fn avoid() {
    // if somebody makes `fails()` async later, the compiler won't complain, and the future will
    // never get run
    let _dontcare = fails();
}


fn prefer() {
    // if somebody makes `fails()` async later, you get a compiler error
    fails().ok();
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to write Rust in ncx-infra-controller-core

Core Principles

Reviewability

Lints and Warnings

A note on dead code

gRPC API definitions

Metrics

Logging

Core API handlers

Core API handler Errors

Crate Features

Async code

Database transactions

Database wrappers

Background tasks

General Rust Coding Standards

Mutability

Initialization

Type Conversions

Fields and getters

Avoid needless clones

Error handling

FilesExpand file tree

STYLE_GUIDE.md

Latest commit

History

STYLE_GUIDE.md

File metadata and controls

How to write Rust in ncx-infra-controller-core

Core Principles

Reviewability

Lints and Warnings

A note on dead code

gRPC API definitions

Metrics

Logging

Core API handlers

Core API handler Errors

Crate Features

Async code

Database transactions

Database wrappers

Background tasks

General Rust Coding Standards

Mutability

Initialization

Type Conversions

Fields and getters

Avoid needless clones

Error handling