Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Scout stream handler to execute host-based firmware upgrades, including new Forge RPC messages and a Scout implementation that downloads artifacts and runs an upgrade script.
Changes:
- Extend Scout stream routing to handle
ScoutFirmwareUpgradeRequestand returnScoutFirmwareUpgradeResponse. - Add
crates/scout/src/firmware_upgrade.rsimplementing download + script execution flow (with unit tests). - Update Forge protobuf definitions to include the new request/response messages and wire them into stream message oneofs.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/scout/src/stream.rs | Routes the new stream request type to the firmware upgrade handler (now async). |
| crates/scout/src/main.rs | Registers the new firmware_upgrade module. |
| crates/scout/src/firmware_upgrade.rs | Implements firmware upgrade flow (download + execute) and adds unit tests. |
| crates/scout/Cargo.toml | Adds dependencies needed for firmware upgrade implementation and tests. |
| crates/rpc/proto/forge.proto | Adds ScoutFirmwareUpgrade{Request,Response} and wires them into stream messages; also includes formatting cleanups. |
| Cargo.lock | Locks new dependencies (axum/tempfile/tokio-test). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kensimon
left a comment
There was a problem hiding this comment.
Hmm, this is a little scary since it's basically a "download this URL and run whatever it says" command. It's called "firmware upgrade" right now but it could be re-used for anything else in the future.
I think I'd prefer that if we wanted to go this route, we'd rename the gRPC message to "RunCommandFromUrl" or similar, since that's literally what it does. Calling it "ScoutFirmwareUpgradeRequest" implies that it's doing something specific to firmware upgrades, and it gives us a false sense of security/restriction when there is none.
(Not to mention that there may be better/more restrictive ways to only install firmware if that's our only goal. Not saying I can think of any at the moment, but it'd be great if we could think of a more restricted way to do this.)
crates/scout/src/remote_exec.rs
Outdated
| let stdout = String::from_utf8_lossy(&output.stdout).to_string(); | ||
| let stderr = String::from_utf8_lossy(&output.stderr).to_string(); |
There was a problem hiding this comment.
Minor nit: This always allocates a new string from the stdout/stderr, even if it's valid utf8... this wouldn't really matter except it's possible the stdout can get quite large, so it's probably best to avoid it in the happy path.
| let stdout = String::from_utf8_lossy(&output.stdout).to_string(); | |
| let stderr = String::from_utf8_lossy(&output.stderr).to_string(); | |
| let stdout = String::from_utf8(output.stdout) | |
| .unwrap_or_else(|e| String::from_utf8_lossy(&e.into_bytes()).into_owned()); | |
| let stderr = String::from_utf8(output.stderr) | |
| .unwrap_or_else(|e| String::from_utf8_lossy(&e.into_bytes()).into_owned()); |
(This works because the error returned by String::from_utf8 gives you back the original bytes, so you can re-use them with e.into_bytes() and build the lossy string.)
There was a problem hiding this comment.
Good point on the name. Not sure how to make it more restrictive though. We could add some checksum verification. But how useful would that be? Everything (firmware files, metadata, script) will be prepared by the "operator" anyway
There was a problem hiding this comment.
It would be an involved design process, but we could try leveraging fwupd and finding out what kind of knobs it has to specify specific firmware to update. Since scout is debian-based we could just install fwupd it from apt as part of the mkosi process.
There was a problem hiding this comment.
As far as being more restrictive goes, in a past job I had similar things set up to be signed, so that only script packages that someone with access to the signing key (indirect in that case, with an internal signing service) created could be installed like this. Might be a bit overbearing for our use case here though. I can go over it in more detail if you're interested.
|
My editor seems to have added lots of formatting changes that I didn't want to add. Will remove them when handling the feedback. |
|
As mentioned in the reply to the copilot comment, having a checksum of the files is something you might want to consider. It's more of a deal when transferring over the Internet, when a bad link can cause enough errors to eventually get a TCP/IP checksum false negative; if we're expecting downloads from a local site only, it may be of lower importance. (Until we end up with some That One Site that causes issues, at least.) |
8bb40b6 to
c8a7579
Compare
|
@kensimon @ddejong-spec I have made some changes:
Let me know what you think. |
c8a7579 to
791b621
Compare
| string script_url = 3; | ||
| uint32 timeout_seconds = 4; | ||
| // Files to download before running the script. | ||
| // Keys are download URLs, values are expected SHA-256 hex checksums. |
There was a problem hiding this comment.
I'd make this a repeated FileArtifcat file_artifact (or something like this), and define FileArtifact as required. Then it becomes more obvious what the fields are compared to the map, and we can extend it if further fields are required in the future.
There was a problem hiding this comment.
I like this, will add.
| ); | ||
|
|
||
| let work_dir = tempfile::tempdir()?; | ||
|
|
There was a problem hiding this comment.
It looks like we have a timeout for executing the script, but neither for downloading the script nor downloading the artifacts? Can we please add?
You can decide whether
- the timeout that is specified is for each step (meaning the total time of execution could be a multiple of it)
- the timeout is for everything together. In that case each step would need to subtract the already elapsed time for previous steps for calculating the timeout. Or you calculate a deadline once upfront.
There was a problem hiding this comment.
Right, only the script execution phase is using that timeout param. I am thinking we can hardcode the timeout for the script download phase because it is usually a couple of lines of bash script. But the artifact downloading can be different depending on the component so that might have to be another parameter. What do you think?
| // Download files and verify checksums. | ||
| let download_dir = work_dir.path().join("downloads"); | ||
| std::fs::create_dir_all(&download_dir)?; | ||
| for (url, expected_sha256) in &request.download_files { |
There was a problem hiding this comment.
we might want to consider downloading everything in parallel. But that could be a future optimization
| mlx_device.MlxDeviceConfigSyncRequest mlx_device_config_sync_request = 13; | ||
| mlx_device.MlxDeviceConfigCompareRequest mlx_device_config_compare_request = 14; | ||
| ScoutStreamAgentPingRequest scout_stream_agent_ping_request = 15; | ||
| ScoutRemoteExecRequest scout_remote_exec_request = 16; |
There was a problem hiding this comment.
It's a very long running request. I am not sure if it fits the "scout stream" model best.
My understanding was a bit of:
- If we do anything in the main state machine, then
ForgeAgentControlwould be the mechanism to tell scout what to do - If things are outside of the state machine and relatively short lived, then scout stream could be used.
Maybe @chet who introduced scout stream can help figuring out where it fits best.
There was a problem hiding this comment.
Hm, I thought we were moving towards the stream model because it seemed much cleaner than ForgeAgentControl. Let's see what @chet has to say.
| Ok(format!("{hash:x}")) | ||
| } | ||
|
|
||
| #[cfg(test)] |
There was a problem hiding this comment.
thanks for adding tests!
Description
Adds a handler in scout to handle firmware upgrades. Downloads the necessary files and performs the upgrade. Currently the behaviour of the whole system is not affected. I will follow up with carbide-api changes that initiates this new process and handles the response.
Type of Change
Testing
Additional Notes
Discussed whether to keep this new handler synchronous or make it async. The main points are the following:
That's why keeping it sync which is consistent with the other handlers.