Skip to content

Implement firmware upgrade in scout#484

Open
rahmonov wants to merge 1 commit intoNVIDIA:mainfrom
rahmonov:jrakhmonov/scout-firmware-proto
Open

Implement firmware upgrade in scout#484
rahmonov wants to merge 1 commit intoNVIDIA:mainfrom
rahmonov:jrakhmonov/scout-firmware-proto

Conversation

@rahmonov
Copy link
Contributor

@rahmonov rahmonov commented Mar 9, 2026

Description

Adds a handler in scout to handle firmware upgrades. Downloads the necessary files and performs the upgrade. Currently the behaviour of the whole system is not affected. I will follow up with carbide-api changes that initiates this new process and handles the response.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Discussed whether to keep this new handler synchronous or make it async. The main points are the following:

  • Nothing else should happen while the upgrade is ongoing so making it sync is fine.
  • How about healthchecks? They only happen for the machines in the ready state.

That's why keeping it sync which is consistent with the other handlers.

Copilot AI review requested due to automatic review settings March 9, 2026 15:48
@rahmonov rahmonov requested a review from a team as a code owner March 9, 2026 15:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Scout stream handler to execute host-based firmware upgrades, including new Forge RPC messages and a Scout implementation that downloads artifacts and runs an upgrade script.

Changes:

  • Extend Scout stream routing to handle ScoutFirmwareUpgradeRequest and return ScoutFirmwareUpgradeResponse.
  • Add crates/scout/src/firmware_upgrade.rs implementing download + script execution flow (with unit tests).
  • Update Forge protobuf definitions to include the new request/response messages and wire them into stream message oneofs.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
crates/scout/src/stream.rs Routes the new stream request type to the firmware upgrade handler (now async).
crates/scout/src/main.rs Registers the new firmware_upgrade module.
crates/scout/src/firmware_upgrade.rs Implements firmware upgrade flow (download + execute) and adds unit tests.
crates/scout/Cargo.toml Adds dependencies needed for firmware upgrade implementation and tests.
crates/rpc/proto/forge.proto Adds ScoutFirmwareUpgrade{Request,Response} and wires them into stream messages; also includes formatting cleanups.
Cargo.lock Locks new dependencies (axum/tempfile/tokio-test).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a little scary since it's basically a "download this URL and run whatever it says" command. It's called "firmware upgrade" right now but it could be re-used for anything else in the future.

I think I'd prefer that if we wanted to go this route, we'd rename the gRPC message to "RunCommandFromUrl" or similar, since that's literally what it does. Calling it "ScoutFirmwareUpgradeRequest" implies that it's doing something specific to firmware upgrades, and it gives us a false sense of security/restriction when there is none.

(Not to mention that there may be better/more restrictive ways to only install firmware if that's our only goal. Not saying I can think of any at the moment, but it'd be great if we could think of a more restricted way to do this.)

Comment on lines +84 to +85
let stdout = String::from_utf8_lossy(&output.stdout).to_string();
let stderr = String::from_utf8_lossy(&output.stderr).to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: This always allocates a new string from the stdout/stderr, even if it's valid utf8... this wouldn't really matter except it's possible the stdout can get quite large, so it's probably best to avoid it in the happy path.

Suggested change
let stdout = String::from_utf8_lossy(&output.stdout).to_string();
let stderr = String::from_utf8_lossy(&output.stderr).to_string();
let stdout = String::from_utf8(output.stdout)
.unwrap_or_else(|e| String::from_utf8_lossy(&e.into_bytes()).into_owned());
let stderr = String::from_utf8(output.stderr)
.unwrap_or_else(|e| String::from_utf8_lossy(&e.into_bytes()).into_owned());

(This works because the error returned by String::from_utf8 gives you back the original bytes, so you can re-use them with e.into_bytes() and build the lossy string.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the name. Not sure how to make it more restrictive though. We could add some checksum verification. But how useful would that be? Everything (firmware files, metadata, script) will be prepared by the "operator" anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be an involved design process, but we could try leveraging fwupd and finding out what kind of knobs it has to specify specific firmware to update. Since scout is debian-based we could just install fwupd it from apt as part of the mkosi process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as being more restrictive goes, in a past job I had similar things set up to be signed, so that only script packages that someone with access to the signing key (indirect in that case, with an internal signing service) created could be installed like this. Might be a bit overbearing for our use case here though. I can go over it in more detail if you're interested.

@rahmonov
Copy link
Contributor Author

rahmonov commented Mar 9, 2026

My editor seems to have added lots of formatting changes that I didn't want to add. Will remove them when handling the feedback.

@ddejong-spec
Copy link
Contributor

As mentioned in the reply to the copilot comment, having a checksum of the files is something you might want to consider. It's more of a deal when transferring over the Internet, when a bad link can cause enough errors to eventually get a TCP/IP checksum false negative; if we're expecting downloads from a local site only, it may be of lower importance. (Until we end up with some That One Site that causes issues, at least.)

@rahmonov rahmonov force-pushed the jrakhmonov/scout-firmware-proto branch 3 times, most recently from 8bb40b6 to c8a7579 Compare March 10, 2026 16:40
@rahmonov
Copy link
Contributor Author

@kensimon @ddejong-spec I have made some changes:

  1. Added checksum verification.
  2. Renamed things to "remote execution".
  3. Smaller changes based on the comments on the previous version.

Let me know what you think.

@rahmonov rahmonov force-pushed the jrakhmonov/scout-firmware-proto branch from c8a7579 to 791b621 Compare March 10, 2026 17:07
string script_url = 3;
uint32 timeout_seconds = 4;
// Files to download before running the script.
// Keys are download URLs, values are expected SHA-256 hex checksums.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this a repeated FileArtifcat file_artifact (or something like this), and define FileArtifact as required. Then it becomes more obvious what the fields are compared to the map, and we can extend it if further fields are required in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, will add.

);

let work_dir = tempfile::tempdir()?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have a timeout for executing the script, but neither for downloading the script nor downloading the artifacts? Can we please add?

You can decide whether

  • the timeout that is specified is for each step (meaning the total time of execution could be a multiple of it)
  • the timeout is for everything together. In that case each step would need to subtract the already elapsed time for previous steps for calculating the timeout. Or you calculate a deadline once upfront.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, only the script execution phase is using that timeout param. I am thinking we can hardcode the timeout for the script download phase because it is usually a couple of lines of bash script. But the artifact downloading can be different depending on the component so that might have to be another parameter. What do you think?

// Download files and verify checksums.
let download_dir = work_dir.path().join("downloads");
std::fs::create_dir_all(&download_dir)?;
for (url, expected_sha256) in &request.download_files {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to consider downloading everything in parallel. But that could be a future optimization

mlx_device.MlxDeviceConfigSyncRequest mlx_device_config_sync_request = 13;
mlx_device.MlxDeviceConfigCompareRequest mlx_device_config_compare_request = 14;
ScoutStreamAgentPingRequest scout_stream_agent_ping_request = 15;
ScoutRemoteExecRequest scout_remote_exec_request = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very long running request. I am not sure if it fits the "scout stream" model best.

My understanding was a bit of:

  • If we do anything in the main state machine, then ForgeAgentControl would be the mechanism to tell scout what to do
  • If things are outside of the state machine and relatively short lived, then scout stream could be used.

Maybe @chet who introduced scout stream can help figuring out where it fits best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought we were moving towards the stream model because it seemed much cleaner than ForgeAgentControl. Let's see what @chet has to say.

Ok(format!("{hash:x}"))
}

#[cfg(test)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding tests!

@Matthias247 Matthias247 requested a review from chet March 10, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants