Skip to content

Manager instance should NOT be using i4i.4xlarge instance #10539

Open
@mykaul

Description

@mykaul

See https://jenkins.scylladb.com/job/scylla-2025.1/job/alternator/job/longevity-alternator-3h-test/7/pipeline-console/log?nodeId=162 for a failure to run due to lack of i4i.4xlarge - but why would it need such a hefty instance size for Manager?

[2025-03-27T14:51:15.667Z] [us-east-1] Creating {count} on-demand instances using AMI id 'ami-0ccb7e0118ad5e03c' with following parameters:
[2025-03-27T14:51:15.667Z] {'ImageId': 'ami-0ccb7e0118ad5e03c', 'KeyName': 'scylla_test_id_ed25519', 'InstanceType': 'i4i.4xlarge', 'UserData': 'Content-Type: multipart/mixed; boundary="===============2748462579806779269=="\nMIME-Version: 1.0\n\n--===============2748462579806779269==\nContent-Type: x-scylla/json\nMIME-Version: 1.0\nContent-Disposition: attachment; filename="scylla_machine_image.json"\n\n{\n    "cluster_name": "alternator-3h-2025-1-db-cluster-fd548c9f",\n    "data_device": "instance_store",\n    "raid_level": 0,\n    "scylla_yaml": {\n        "cluster_name": "alternator-3h-2025-1-db-cluster-fd548c9f"\n    },\n    "start_scylla_on_first_boot": false\n}\n--===============2748462579806779269==\nContent-Type: text/cloud-config\nMIME-Version: 1.0\nContent-Disposition: attachment; filename="cloud-config.txt"\n\n\n        #cloud-config\n        cloud_final_modules:\n        - [scripts-user, always]\n        \n--===============2748462579806779269==\nContent-Type: text/x-shellscript\nMIME-Version: 1.0\nContent-Disposition: attachment; filename="user-script.txt"\n\n#!/bin/bash\nset -x\nwhile ! systemctl status cloud-init.service | grep "active (exited)"; do sleep 1; done\n\nwrite_syslog_ng_destination() {\n    disk_buffer_option=""\n    if syslog-ng -V | grep -q disk; then\n        disk_buffer_option="disk-buffer(\n            mem-buf-size(1048576)\n            disk-buf-size(104857600)\n            reliable(yes)\n            dir(\\"/var/log\\")\n        )"\n    fi\n\ncat <<EOF >/etc/syslog-ng/conf.d/remote_sct.conf\ndestination remote_sct {\n    syslog(\n        "10.12.8.220"\n        transport("tcp")\n        port(32768)\n        throttle(10000)\n        $disk_buffer_option\n    );\n};\nEOF\n}\n\nif [ -f /var/lib/sct/cloud-init/done ]; then\n    write_syslog_ng_destination\n    sudo systemctl restart syslog-ng\n    exit 0\nfi\nif apt-get --help >/dev/null 2>&1 ; then\n    if [ ! -f /tmp/disable_daily_apt_triggers_done ]; then\n        rm -f /etc/apt/apt.conf.d/*unattended-upgrades /etc/apt/apt.conf.d/*auto-upgrades || true\n        rm -f /etc/apt/apt.conf.d/*periodic /etc/apt/apt.conf.d/*update-notifier || true\n        systemctl stop apt-daily.timer apt-daily-upgrade.timer apt-daily.service apt-daily-upgrade.service || true\n        systemctl disable apt-daily.timer apt-daily-upgrade.timer apt-daily.service apt-daily-upgrade.service || true\n        apt-get remove -o DPkg::Lock::Timeout=300 -y unattended-upgrades update-manager || true\n        touch /tmp/disable_daily_apt_triggers_done\n    fi\nfi\nSYSLOG_NG_INSTALLED=""\nif yum --help 2>/dev/null 1>&2 ; then\n    if rpm -q syslog-ng ; then\n        rm /etc/syslog-ng/syslog-ng.conf  # Make sure we have default syslog-ng.conf\n        yum reinstall -y syslog-ng\n        SYSLOG_NG_INSTALLED=1\n    else\n        yum install -y epel-release\n        for n in 1 2 3 4 5 6 7 8 9; do # cloud-init is running it with set +o braceexpand\n            if yum install -y --downloadonly syslog-ng; then\n                break\n            fi\n        done\n\n        for n in 1 2 3; do # cloud-init is running it with set +o braceexpand\n            if yum install -y syslog-ng; then\n                SYSLOG_NG_INSTALLED=1\n                break\n            fi\n            sleep 10\n        done\n    fi\nelif apt-get --help 2>/dev/null 1>&2 ; then\n    if dpkg-query --show syslog-ng ; then\n        rm /etc/syslog-ng/syslog-ng.conf  # Make sure we have default syslog-ng.conf\n        apt-get purge -o DPkg::Lock::Timeout=300 -y syslog-ng*\n        DPKG_FORCE=confmiss apt-get --reinstall -o DPkg::Lock::Timeout=300 -y install syslog-ng\n        SYSLOG_NG_INSTALLED=1\n    else\n        cat /etc/apt/sources.list\n        for n in 1 2 3 4 5 6 7 8 9; do # cloud-init is running it with set +o braceexpand\n            if apt-get -y update ; then\n                break\n            fi\n            sleep 0.5\n        done\n\n        for n in 1 2 3; do # cloud-init is running it with set +o braceexpand\n            DEBIAN_FRONTEND=noninteractive apt-get install -o DPkg::Lock::Timeout=300 -y syslog-ng || true\n            if dpkg-query --show syslog-ng ; then\n                SYSLOG_NG_INSTALLED=1\n                break\n            fi\n        done\n    fi\nelse\n    echo "Unsupported distro"\nfi\n\nsource_name=`cat /etc/syslog-ng/syslog-ng.conf | tr -d "\\n" | tr -d "\\r" | sed -r "s/\\};/\\};\\n/g;         s/source /\\nsource /g" | grep -P "^source.*system\\(\\)" | cut -d" " -f2`\n\nif grep -P "keep-timestamp\\([^)]+\\)" /etc/syslog-ng/syslog-ng.conf; then\n    sed -i -r "s/keep-timestamp([ ]*yes[ ]*)/keep-timestamp(no)/g" /etc/syslog-ng/syslog-ng.conf\nelse\n    sed -i -r "s/([ \t]*options[ \t]*\\\\{)/\\\\1\\n  keep-timestamp(no);\\n/g" /etc/syslog-ng/syslog-ng.conf\nfi\n\nwrite_syslog_ng_destination\n\nif ! grep -P "log {.*destination\\\\(remote_sct\\\\)" /etc/syslog-ng/syslog-ng.conf; then\n    echo "\nfilter filter_sct {\n    # filter audit out\n    not program(\\"^audit\\");\n};\n    " >> /etc/syslog-ng/syslog-ng.conf\n    echo "log { source($source_name); filter(filter_sct); destination(remote_sct); };" >> /etc/syslog-ng/syslog-ng.conf\nfi\n\nif [ ! -z "" ]; then\n    if grep "rewrite r_host" /etc/syslog-ng/syslog-ng.conf; then\n        sed -i -r "s/rewrite r_host \\{ set\\(\\"[^\\"]+\\"/rewrite r_host { set(\\"\\"/" /etc/syslog-ng/syslog-ng.conf\n    else\n        echo "rewrite r_host { set(\\"\\", value(\\"HOST\\")); };" >>  /etc/syslog-ng/syslog-ng.conf\n        sed -i -r "s/destination\\(remote_sct\\);[ \\t]*\\};/destination\\(remote_sct\\); rewrite\\(r_host\\); \\};/" /etc/syslog-ng/syslog-ng.conf\n    fi\nfi\nsystemctl restart syslog-ng  || true\ncurl -L -O https://github.com/brandond/syslog_ng_exporter/releases/download/0.1.0/syslog_ng_exporter\nchmod +x syslog_ng_exporter\nmv syslog_ng_exporter /usr/local/bin\n\nif [ -e /etc/systemd/system/syslog_ng_exporter.service ]; then\n    rm /etc/systemd/system/syslog_ng_exporter.service\nfi\n\ncat <<EOM >> /etc/systemd/system/syslog_ng_exporter.service\n[Unit]\nDescription=Syslog-ng metrics Exporter\nWants=network.target network-online.target\nAfter=network.target network-online.target\n\n[Service]\nType=simple\nExecStart=/usr/local/bin/syslog_ng_exporter\nStandardOutput=journal\nStandardError=journal\nRestart=on-failure\n\n[Install]\nWantedBy=multi-user.target\nEOM\n\nsystemctl daemon-reload\nsystemctl enable syslog_ng_exporter.service\nsystemctl start syslog_ng_exporter.service\n\nif [ -f "/etc/security/limits.d/20-nproc.conf" ]; then\n    sed -i -e "s/^\\*[[:blank:]]*soft[[:blank:]]*nproc[[:blank:]]*.*/*\t\tsoft\tnproc\t\tunlimited/"     /etc/security/limits.d/20-nproc.conf || true\nelse\n    echo "*    hard    nproc    unlimited" > /etc/security/limits.d/20-nproc.conf || true\nfi\n\nsed -i "s/#MaxSessions \\(.*\\)$/MaxSessions 1000/" /etc/ssh/sshd_config || true\nsed -i "s/#MaxStartups \\(.*\\)$/MaxStartups 60/" /etc/ssh/sshd_config || true\nsed -i "s/#LoginGraceTime \\(.*\\)$/LoginGraceTime 15s/" /etc/ssh/sshd_config || true\nsed -i "s/#ClientAliveInterval \\(.*\\)$/ClientAliveInterval 60/" /etc/ssh/sshd_config || true\nsed -i "s/#ClientAliveCountMax \\(.*\\)$/ClientAliveCountMax 10/" /etc/ssh/sshd_config || true\nsystemctl restart sshd || systemctl restart ssh || true\nmkdir -p /var/lib/sct/cloud-init && touch /var/lib/sct/cloud-init/done\n--===============2748462579806779269==--\n', 'NetworkInterfaces': [{'DeviceIndex': 0, 'SubnetId': 'subnet-090ce5c775e0dbc19', 'Groups': ['sg-0feef3370ee8305ac']}], 'IamInstanceProfile': {'Name': 'qa-scylla-manager-backup-instance-profile'}, 'BlockDeviceMappings': [{'DeviceName': '/dev/sda1', 'Ebs': {'VolumeType': 'gp3', 'VolumeSize': 30}}], 'Placement': {'AvailabilityZone': 'us-east-1c'}}
[2025-03-27T14:51:25.381Z] Traceback (most recent call last):
[2025-03-27T14:51:25.381Z]   File "/home/ubuntu/scylla-cluster-tests/./sct.py", line 1890, in <module>
[2025-03-27T14:51:25.381Z]     cli.main(prog_name="hydra")
[2025-03-27T14:51:25.381Z]   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
[2025-03-27T14:51:25.381Z]     rv = self.invoke(ctx)
[2025-03-27T14:51:25.381Z]   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
[2025-03-27T14:51:25.381Z]     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2025-03-27T14:51:25.381Z]   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
[2025-03-27T14:51:25.381Z]     return ctx.invoke(self.callback, **ctx.params)
[2025-03-27T14:51:25.381Z]   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
[2025-03-27T14:51:25.381Z]     return __callback(*args, **kwargs)
[2025-03-27T14:51:25.381Z]   File "/home/ubuntu/scylla-cluster-tests/./sct.py", line 236, in provision_resources
[2025-03-27T14:51:25.381Z]     layout.provision()
[2025-03-27T14:51:25.381Z]   File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_provision/aws/layout.py", line 34, in provision
[2025-03-27T14:51:25.381Z]     self.db_cluster.provision()
[2025-03-27T14:51:25.381Z]   File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_provision/aws/cluster.py", line 334, in provision
[2025-03-27T14:51:25.381Z]     instances = self.provision_plan(region_id, self._azs[az_id]).provision_instances(
[2025-03-27T14:51:25.381Z]   File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/common/provision_plan.py", line 40, in provision_instances
[2025-03-27T14:51:25.381Z]     if instances := self.provisioner.provision(
[2025-03-27T14:51:25.382Z]   File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/aws/provisioner.py", line 74, in provision
[2025-03-27T14:51:25.382Z]     return self._provision_on_demand_instances(
[2025-03-27T14:51:25.382Z]   File "/home/ubuntu/scylla-cluster-tests/sdcm/provision/aws/provisioner.py", line 120, in _provision_on_demand_instances
[2025-03-27T14:51:25.382Z]     instances = ec2_services[provision_parameters.region_name].create_instances(
[2025-03-27T14:51:25.382Z]   File "/usr/local/lib/python3.10/site-packages/boto3/resources/factory.py", line 580, in do_action
[2025-03-27T14:51:25.382Z]     response = action(self, *args, **kwargs)
[2025-03-27T14:51:25.382Z]   File "/usr/local/lib/python3.10/site-packages/boto3/resources/action.py", line 88, in __call__
[2025-03-27T14:51:25.382Z]     response = getattr(parent.meta.client, operation_name)(*args, **params)
[2025-03-27T14:51:25.382Z]   File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call
[2025-03-27T14:51:25.382Z]     return self._make_api_call(operation_name, kwargs)
[2025-03-27T14:51:25.382Z]   File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 976, in _make_api_call
[2025-03-27T14:51:25.382Z]     raise error_class(parsed_response, operation_name)
[2025-03-27T14:51:25.382Z] botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 4): We currently do not have sufficient i4i.4xlarge capacity in the Availability Zone you requested (us-east-1c). Our system will be working on provisioning additional capacity. You can currently get i4i.4xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f.
[2025-03-27T14:51:25.382Z] Cleaning SSH agent

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions