-
Notifications
You must be signed in to change notification settings - Fork 237
Description
CloudWatch Agent Bug Report
Describe the bug
The CloudWatch Agent's startup process has a fundamental architectural flaw: the config-translator command generates credential paths based on the runtime user at translation time, not the run_as_user specified in the configuration. This causes permission errors when running the agent as an unprivileged user via systemd.
When systemd starts the service as root (required for the wrapper to chown directories), config-translator sees $USER=root and hardcodes /root/.aws/credentials in the generated TOML. The agent then switches to the unprivileged user (e.g., cwagent) but still tries to read /root/.aws/credentials, resulting in "permission denied" errors.
Steps to reproduce
- Install CloudWatch Agent on a system using bootc/immutable infrastructure
- Configure the agent to
run_as_user: cwagentinamazon-cloudwatch-agent.json - Create a systemd drop-in to allow the wrapper to start as root (required for chown operations):
[Service] # Don't set User=cwagent - wrapper needs root to chown StandardOutput=tty StandardError=tty
- Start the service:
systemctl start amazon-cloudwatch-agent - Check logs: The agent switches to
cwagentbut fails with:SharedCredsLoad: failed to load shared credentials file caused by: open /root/.aws/credentials: permission denied
What did you expect to see?
The config-translator should respect the run_as_user configuration and generate credential paths relative to that user's home directory, not the user running the translator.
Expected TOML output:
shared_credential_file = "/home/cwagent/.aws/credentials" # or /var/opt/aws/amazon-cloudwatch-agent/.aws/credentialsOr better yet: Don't hardcode credential file paths at all - let the AWS SDK resolve credentials using its standard chain (environment vars → IAM instance profile → credential file).
What did you see instead?
Generated TOML always uses the translator's runtime user:
shared_credential_file = "/root/.aws/credentials"This breaks when the agent switches to the unprivileged user.
What version did you use?
Version: v1.300061.0b1289-1 (latest as of November 2024)
What config did you use?
{
"agent": {
"run_as_user": "cwagent"
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/my.log",
"log_group_name": "/my/log",
"log_stream_name": "{instance_id}"
}
]
}
}
}
}Environment
OS: Fedora 43 (bootc-based immutable image)
Systemd: 258.2-1.fc43
Deployment: AWS EC2 with IAM instance profile
Additional context
Root Cause Analysis:
The agent's startup flow is fundamentally flawed:
- Systemd starts
start-amazon-cloudwatch-agentwrapper as root - Wrapper runs
config-translatoras root → hardcodes/root/.aws/credentials - Wrapper chowns directories to
cwagent - Wrapper switches to
cwagentuser viasetuid() - Agent process (now running as
cwagent) tries to read/root/.aws/credentials→ permission denied
Why This Design is Broken:
- Credential path generation happens at the wrong time - before user switching, not after
- Hardcoded paths instead of runtime resolution - AWS SDK should handle credential discovery
- Unnecessary privilege escalation - the wrapper shouldn't need root at all; systemd can create writable directories via
StateDirectory= - Poor separation of concerns - config translation shouldn't embed runtime environment details
There is no clean workaround. Both attempted fixes fail:
- ❌ Setting
Environment="HOME=/var/opt/aws/amazon-cloudwatch-agent"-config-translatoruses$USER, not$HOME - ❌ Setting
Environment="AWS_SHARED_CREDENTIALS_FILE="- The TOML's hardcodedshared_credential_filepath overrides the environment variable
Partial workaround for EC2 deployments:
The bug does NOT manifest on EC2 with IAM instance profiles because:
- The credential file at
/root/.aws/credentialsdoes not exist - The AWS SDK gets "file not found" error (not "permission denied")
- SDK falls back to IAM instance profile via IMDS ✅
Why local testing fails:
- The
/root/directory exists but is inaccessible to the unprivileged user - SDK gets "permission denied" error
- SDK does NOT fall back to other credential sources on permission errors
- Agent keeps retrying the inaccessible file indefinitely
Manual workaround for local/non-EC2 environments:
-
Pre-generate TOML with correct paths manually (defeats automation):
# Generate TOML as root sudo -u cwagent /opt/aws/amazon-cloudwatch-agent/bin/config-translator \ -input /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \ -output /tmp/agent.toml # Manually edit shared_credential_file path in /tmp/agent.toml # Copy to /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
-
Or: Make
/root/.aws/readable by the agent user (security risk):mkdir -p /root/.aws chmod 755 /root /root/.aws # Opens root's home directory!
None of these are acceptable for production systems.
Recommended Fix:
-
Remove hardcoded credential paths - Let the AWS SDK use its standard credential chain:
- Environment variables (
AWS_SHARED_CREDENTIALS_FILE,AWS_ACCESS_KEY_ID, etc.) - IAM instance profile (IMDS)
- Credential file (
~/.aws/credentialsresolved at runtime based on effective user)
- Environment variables (
-
Read
run_as_userBEFORE generating config - If config saysrun_as_user: cwagent, resolve paths for that user, not the translator's user -
Respect
$HOMEenvironment variable - If HOME is explicitly set, use it instead of deriving from$USER -
Use systemd properly - Don't require root for directory creation:
[Service] User=cwagent StateDirectory=amazon-cloudwatch-agent LogsDirectory=amazon-cloudwatch-agent
-
Eliminate the wrapper - The agent binary should handle everything directly without needing a shell script wrapper that does privilege juggling
Impact:
This bug affects:
- Immutable infrastructure deployments (bootc, OSTree)
- Security-hardened systems running agents as unprivileged users
- Container-based deployments where root access is restricted
- Any environment using systemd's
DynamicUser=or similar features
The current design forces users to either:
- Run the entire agent as root (security risk)
- Use environment variable overrides to skip credential files (workaround)
- Manually pre-generate TOML with correct paths (defeats automation)
None of these are acceptable patterns for production systems with proper security posture.