Skip to content

Comments

cloud_broker: avoid script name collisions during injection#29511

Merged
ballard26 merged 1 commit intoredpanda-data:devfrom
ballard26:coll-avoid-cb
Feb 3, 2026
Merged

cloud_broker: avoid script name collisions during injection#29511
ballard26 merged 1 commit intoredpanda-data:devfrom
ballard26:coll-avoid-cb

Conversation

@ballard26
Copy link
Contributor

When multiple CloudBroker instances inject scripts with the same name to the same agent node, they can overwrite each other. Use a UUID prefix for unique script names and clean up from the agent node after copying to the pod.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

When multiple CloudBroker instances inject scripts with the same name
to the same agent node, they can overwrite each other. Use a UUID
prefix for unique script names and clean up from the agent node after
copying to the pod.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents CloudBroker instances from overwriting each other's scripts when multiple instances inject scripts with the same name to the same Kubernetes agent node.

Changes:

  • Add UUID prefix to script names during injection to ensure uniqueness
  • Clean up temporary script files from the agent node after copying to the pod

return

# Remove script from agent node
self._kubeclient._ssh_cmd(["rm", unique_script_name])
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SSH command to remove the script is constructed but never executed. This should be subprocess.check_output(self._kubeclient._ssh_cmd(["rm", unique_script_name])) or similar to actually run the cleanup command.

Suggested change
self._kubeclient._ssh_cmd(["rm", unique_script_name])
_rm_cmd = self._kubeclient._ssh_cmd(["rm", unique_script_name])
subprocess.check_output(_rm_cmd)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._kubeclient._ssh_cmd(...) does execute the command AFAIK.

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this copying scripts in the first place? Can't it just run the command via ssh directly?

@ballard26
Copy link
Contributor Author

ballard26 commented Feb 3, 2026

Why is this copying scripts in the first place? Can't it just run the command via ssh directly?

Probably to avoid gotchas with quotes and env var expansions when running complex commands across ssh. Didn't really look into it too much. Just saw a potential issue here while checking if it was safe to run some test setup code in parallel.

@ballard26 ballard26 enabled auto-merge February 3, 2026 23:35
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#80054
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/80054#019c25af-fc0a-48ed-a6d0-0a93938be70f FLAKY 8/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1002, p0=0.2645, reject_threshold=0.0100. adj_baseline=0.2714, p1=0.4625, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

@ballard26 ballard26 merged commit 3f46e87 into redpanda-data:dev Feb 3, 2026
20 checks passed
@travisdowns
Copy link
Member

Why is this copying scripts in the first place? Can't it just run the command via ssh directly?

@StephanDollberg it was for efficiency as running each command has a long fixed overhead of ~5 seconds.

@StephanDollberg
Copy link
Member

@StephanDollberg it was for efficiency as running each command has a long fixed overhead of ~5 seconds.

I don't follow this sorry. What does script vs tsh -- foo have to do with how long the command runs?

@travisdowns
Copy link
Member

I don't follow this sorry. What does script vs tsh -- foo have to do with how long the command runs?

Sorry I was answering the question of why there is a script in the first place: to group many commands together. When you said "why can't it run the command" I thought you mean like running individual commands from the script over tsh.

I guess you are asking why the whole script body isn't sent over tsh to execute immediately in the shell? Not sure. I think the semantics are a bit different doing that (e.g., treatment of failures, comments, etc) but it seems feasible. It's also probably possible to send the script as direct input to bash, e..g, as a heredoc which would probably have the same script execution semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants