Skip to content

CloudWatch Agent architecture is garbage #1944

@T0MASD

Description

@T0MASD

CloudWatch Agent Bug Report

Describe the bug

The CloudWatch Agent's startup process has a fundamental architectural flaw: the config-translator command generates credential paths based on the runtime user at translation time, not the run_as_user specified in the configuration. This causes permission errors when running the agent as an unprivileged user via systemd.

When systemd starts the service as root (required for the wrapper to chown directories), config-translator sees $USER=root and hardcodes /root/.aws/credentials in the generated TOML. The agent then switches to the unprivileged user (e.g., cwagent) but still tries to read /root/.aws/credentials, resulting in "permission denied" errors.

Steps to reproduce

  1. Install CloudWatch Agent on a system using bootc/immutable infrastructure
  2. Configure the agent to run_as_user: cwagent in amazon-cloudwatch-agent.json
  3. Create a systemd drop-in to allow the wrapper to start as root (required for chown operations):
    [Service]
    # Don't set User=cwagent - wrapper needs root to chown
    StandardOutput=tty
    StandardError=tty
  4. Start the service: systemctl start amazon-cloudwatch-agent
  5. Check logs: The agent switches to cwagent but fails with:
    SharedCredsLoad: failed to load shared credentials file
    caused by: open /root/.aws/credentials: permission denied
    

What did you expect to see?

The config-translator should respect the run_as_user configuration and generate credential paths relative to that user's home directory, not the user running the translator.

Expected TOML output:

shared_credential_file = "/home/cwagent/.aws/credentials"  # or /var/opt/aws/amazon-cloudwatch-agent/.aws/credentials

Or better yet: Don't hardcode credential file paths at all - let the AWS SDK resolve credentials using its standard chain (environment vars → IAM instance profile → credential file).

What did you see instead?

Generated TOML always uses the translator's runtime user:

shared_credential_file = "/root/.aws/credentials"

This breaks when the agent switches to the unprivileged user.

What version did you use?

Version: v1.300061.0b1289-1 (latest as of November 2024)

What config did you use?

{
  "agent": {
    "run_as_user": "cwagent"
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/my.log",
            "log_group_name": "/my/log",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

Environment

OS: Fedora 43 (bootc-based immutable image)
Systemd: 258.2-1.fc43
Deployment: AWS EC2 with IAM instance profile

Additional context

Root Cause Analysis:

The agent's startup flow is fundamentally flawed:

  1. Systemd starts start-amazon-cloudwatch-agent wrapper as root
  2. Wrapper runs config-translator as root → hardcodes /root/.aws/credentials
  3. Wrapper chowns directories to cwagent
  4. Wrapper switches to cwagent user via setuid()
  5. Agent process (now running as cwagent) tries to read /root/.aws/credentialspermission denied

Why This Design is Broken:

  1. Credential path generation happens at the wrong time - before user switching, not after
  2. Hardcoded paths instead of runtime resolution - AWS SDK should handle credential discovery
  3. Unnecessary privilege escalation - the wrapper shouldn't need root at all; systemd can create writable directories via StateDirectory=
  4. Poor separation of concerns - config translation shouldn't embed runtime environment details

There is no clean workaround. Both attempted fixes fail:

  1. ❌ Setting Environment="HOME=/var/opt/aws/amazon-cloudwatch-agent" - config-translator uses $USER, not $HOME
  2. ❌ Setting Environment="AWS_SHARED_CREDENTIALS_FILE=" - The TOML's hardcoded shared_credential_file path overrides the environment variable

Partial workaround for EC2 deployments:

The bug does NOT manifest on EC2 with IAM instance profiles because:

  • The credential file at /root/.aws/credentials does not exist
  • The AWS SDK gets "file not found" error (not "permission denied")
  • SDK falls back to IAM instance profile via IMDS ✅

Why local testing fails:

  • The /root/ directory exists but is inaccessible to the unprivileged user
  • SDK gets "permission denied" error
  • SDK does NOT fall back to other credential sources on permission errors
  • Agent keeps retrying the inaccessible file indefinitely

Manual workaround for local/non-EC2 environments:

  1. Pre-generate TOML with correct paths manually (defeats automation):

    # Generate TOML as root
    sudo -u cwagent /opt/aws/amazon-cloudwatch-agent/bin/config-translator \
      -input /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
      -output /tmp/agent.toml
    
    # Manually edit shared_credential_file path in /tmp/agent.toml
    # Copy to /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
  2. Or: Make /root/.aws/ readable by the agent user (security risk):

    mkdir -p /root/.aws
    chmod 755 /root /root/.aws  # Opens root's home directory!

None of these are acceptable for production systems.

Recommended Fix:

  1. Remove hardcoded credential paths - Let the AWS SDK use its standard credential chain:

    • Environment variables (AWS_SHARED_CREDENTIALS_FILE, AWS_ACCESS_KEY_ID, etc.)
    • IAM instance profile (IMDS)
    • Credential file (~/.aws/credentials resolved at runtime based on effective user)
  2. Read run_as_user BEFORE generating config - If config says run_as_user: cwagent, resolve paths for that user, not the translator's user

  3. Respect $HOME environment variable - If HOME is explicitly set, use it instead of deriving from $USER

  4. Use systemd properly - Don't require root for directory creation:

    [Service]
    User=cwagent
    StateDirectory=amazon-cloudwatch-agent
    LogsDirectory=amazon-cloudwatch-agent
  5. Eliminate the wrapper - The agent binary should handle everything directly without needing a shell script wrapper that does privilege juggling

Impact:

This bug affects:

  • Immutable infrastructure deployments (bootc, OSTree)
  • Security-hardened systems running agents as unprivileged users
  • Container-based deployments where root access is restricted
  • Any environment using systemd's DynamicUser= or similar features

The current design forces users to either:

  • Run the entire agent as root (security risk)
  • Use environment variable overrides to skip credential files (workaround)
  • Manually pre-generate TOML with correct paths (defeats automation)

None of these are acceptable patterns for production systems with proper security posture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions