-
Notifications
You must be signed in to change notification settings - Fork 204
Allow agent to override fleet settings #11524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Allow agent to override fleet settings #11524
Conversation
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
swiatekm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks fine to me but I don't like where it's located in the codebase. Ideally, this change should try to make reasoning about config loading easier, not harder.
swiatekm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| return func(change coordinator.ConfigChange) coordinator.ConfigChange { | ||
| newConfig := change.Config() | ||
| if err := newConfig.Merge(rawConfig); err != nil { | ||
| log.Errorf("error merging fleet config into config change: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you just want to log on failure? In the context of agentless, this would leave the pod running without valid configuration wouldn'it it?
How would you alert on that or gate promotion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see this as non fatal for use cases we have today. we want to allow monitoring of agent, liveness checks... what's more important getting data in or being observable. my take is former, but if you think this should be fatal I can fail hard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward a bit I think we can likely use this to inject a rate limit processor required by all agentless integrations, using the OTel processors section that can be used to add global processors with some other not quite done work.
I think if you only look at the liveness issue it isn't fatal, but I can see this being used for things in the future that should be fatal.
In the context of agentless I would rather have incomplete configs be fatal errors so we always know there is a problem, than silently continue to run with a configuration we don't want (e.g. missing rate limit for example).
|
Capabilities is a nice way to control this, good idea 👍 The capabilities.yml file has to be read from the config path, do we do permissions checks on ownership in this directory? This new capability is quite powerful so I think we want to do some extra work to make sure that the file was not created by an arbitrary user. For example if agent is root, the config path and capabilities.yml file need to be owned by root too. I don't see any on the capabilities.yml file itself: elastic-agent/internal/pkg/capabilities/capabilities.go Lines 43 to 58 in 7d6cf59
|
|
@swiatekm i'm reverting to previous implementation. whole patchers is a bit cumbersome, and need more thorough refactoring that i don't want to do in this PR. |
…ow-fleet-config-override
|
@cmacknz i added check for capabilities permissions |
|
Maybe I have code blindness at the end of the day, but I don't see permissions checks, maybe forgotten push? |
…ristas/elastic-agent into feat/allow-fleet-config-override
|
no code blindness, i lost the work |
💛 Build succeeded, but was flaky
Failed CI StepsHistory
|
cmacknz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main concern is limiting security implications because this new capability is pretty powerful for a Fleet managed agent.
If we can restrict this to only working for containers that would help a lot IMO, so it's scoped to only where we need it, and we don't to worry about arbitrary users on a machine disabling the policy output and similar things via the new override capability.
| // override retrieved config from Fleet with persisted config from AgentConfig file | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // override retrieved config from Fleet with persisted config from AgentConfig file | |
| // override retrieved config from Fleet with persisted config from AgentConfig file |
| // owner of the file and that the owner of the file is the same as the UID or root. | ||
| func HasStrictExecPerms(path string, uid int) error { | ||
| info, err := os.Stat(path) | ||
| func HasStrictExecPerms(path string, uid int, checkUID bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This godoc reads like it was supposed to be used with checkUID set to true all the time but instead it ignored the UID argument completely which is... interesting.
I almost think we'd be better off with two functions so it's clearer what happens.
HasStrictExecPerms with just the path argument and a new HasStrictUIDExecPerms or similar with the uid argument that always did the UID check, and calls the HasStrictExecPerms function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the new HasStrictUIDExecPerms should have tests (or if you disagree entirely with this suggestion, the there should be tests for the checkUID argument).
| // HasStrictExecPerms ensures that the path is executable by the owner and that the owner of the file | ||
| // is the same as the UID or root. | ||
| func HasStrictExecPerms(path string, uid int) error { | ||
| func HasStrictExecPerms(path string, uid int, _ bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fix this, an arbitrary user being able to control capabilities.yml now has the ability override agent configurations from Fleet (preventing upgrades also would not be very nice).
Is there a way to scope the capability to only work in containers instead? Then you would have a way to avoid impact on endpoint use cases for example and avoid windows completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern right now are the security implications of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capabilities will check user on linux/darwin. on windows we don't check, we could check permissions like osquery does.
i'm fine with limiting this to containers with uid check kept in place
This pull request introduces support for a new "Fleet override" capability, allowing persisted configuration from a file to override configuration received from Fleet, but only if explicitly enabled by capabilities. The changes include the implementation of the new capability type, its integration into the configuration processing flow, and comprehensive unit tests to ensure correct behavior.
Opted for capabilities rather than FF or agent config option as it is clear this is set on purpose. It is not coming from fleet and cannot override itself.
Fleet override capability implementation and integration:
fleet_override, to the capabilities system, including parsing from YAML, internal representation, and logic to determine if Fleet override is allowed (AllowFleetOverride).capabilitiesManagerto support the newfleetOverrideCapsand wire it into the capability loading process.Configuration override logic:
applyPersistedConfigfunction, which applies persisted configuration from a file on disk if the Fleet override capability is enabled. This function is now called during configuration processing in the coordinator.osto support file operations in the coordinator.Testing:
applyPersistedConfigto verify correct merging of persisted configuration based on the Fleet override capability, including both enabled and disabled scenarios.Manual testing:
For manual testing build and enroll to fleet to a policy with monitoring disabled.
Verify http server is not running with
curl http://localhost:6774/liveness -IUpdate
elastic-agent.ymlwith following contentcurl http://localhost:6774/liveness -I