Skip to content

Conversation

@sudharshanibm
Copy link
Contributor

@sudharshanibm sudharshanibm commented Aug 18, 2025

Add terraform for provisioning s390x build cluster on ibmcloud

  • k8s-s390x-infra-setup: Terraform resources to set up the necessary infrastructure in an IBM Cloud account, which will serve as prerequisites for the k8s-s390x-build-cluster automation.
  • k8s-s390x-build-cluster: Terraform resources to provision the s390x Build Cluster infrastructure in IBM Cloud.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 18, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @sudharshanibm!

It looks like this is your first PR to kubernetes/k8s.io 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/k8s.io has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @sudharshanibm. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/infra Infrastructure management, infrastructure design, code in infra/ labels Aug 18, 2025
@k8s-ci-robot k8s-ci-robot added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Aug 18, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sudharshanibm
Once this PR has been reviewed and has the lgtm label, please assign upodroid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

create_duration = "180s" # Wait 3 minutes for full initialization
}

resource "null_resource" "bastion_setup" {
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very declarative - is this truly required or can the verification be run as a one-off by someone? do we expect things to not be setup correct via Terraform (i.e. - verification here fails) periodically or at all? (if so - that seems problematic and who or what is intended to fix that 😅 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bastion instance’s initial setup and networking configuration are handled through a cloud-init script in user_data. Runtime verification of the bastion’s internal state is treated as an operational concern, and can be performed outside of Terraform via automation/manual checks.

vpc = data.ibm_is_vpc.vpc.id
}

data "ibm_is_security_group" "master_sg" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use an improved alternate terminology? main, primary, control-plane?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated all occurrences of master to control-plane

@@ -0,0 +1,129 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of k8s-* throughout the file names, resource names, etc. seems to be redundant (i.e. - its defined in a K8s repo/org so its assumed to be for K8s, no need to re-state that in the names)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bryantbiggs , thanks for pointing that out
I followed the same naming convention since the existing s390x setup is already connected to ArgoCD with k8s-* prefixed names (ref: #8332 ).

See the License for the specific language governing permissions and
limitations under the License.
*/
resource "ibm_is_lb" "k8s_load_balancer" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the names should be viewed as a hierarchy level just below the resource type, but read together. so ibm_is_lb.k8s_load_balancer is essentially IBM load balancer - Kubernetes load balancer

what about something shorter and more descriptive public (ibm_is.lb.public)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated k8s_load_balancer to public

limitations under the License.
*/
resource "ibm_is_lb" "k8s_load_balancer" {
name = "k8s-s390x-lb"
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, the name is very redundant. Its in a K8s repo, is this infrastructure deployed in an account that hosts resources for things outside of K8s? If yes = fine, keep the k8s, if not, drop it. Also, the resource is a load balancer so no need to add that to the name (my name is Bryant Biggs not Bryant Biggs human 😅 ). Lastly, I assume this is only for testing on s390x platforms so is that adding anything to the name? Maybe it is, maybe it isn't - I don't have enough context to say. But what about something more descriptive as to its intent and aligned with resources already defined in the repo

kops-infra-ci-name = "kops-infra-ci"

what about s390x-ci? or if you desire the full name k8s-s390x-ci?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this same thinking above applies throughout the change set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated k8s-s390x-lb to k8s-s390x-ci

health_timeout = 2
health_type = "tcp"
health_monitor_url = "/"
health_monitor_port = 6443
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6443 appears in a number of places - good chance it should be a local variable to avoid mis-matches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to use a variable api_server_port (default 6443)

name = "k8s-s390x-subnet"
}

data "ibm_is_security_group" "bastion_sg" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data "ibm_is_security_group" "bastion_sg" {
data "ibm_is_security_group" "bastion" {

Security group is already in the resource type - don't repeat in the resource name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have resource ibm_is_instance bastion defined in bastion.tf, so reusing bastion here would cause a naming conflict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - thats a different resource from this data source so no conflict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed bastion_sg --> bastion

}

resource "ibm_is_instance" "control_plane" {
count = var.control_plane.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a for_each so that we avoid amplifying disruptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the control-plane to use for_each in nodes.tf

default = "eu-de-1"
}

variable "resource_group_name" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the variable, Thanks!

}
}

variable "bastion" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bastion now generated in locals

}
}

variable "control_plane" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

control-plane now generated in locals

}
}

variable "compute" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute now generated in locals

@@ -0,0 +1,136 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables are mostly used in modules - I would argue, are these required? Should they be local variables instead (if they are used in multiple places but not intended to be configured dynamically outside of whats stored in git)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, these values are intended to be configurable by automation (Ansible playbooks) rather than hard-coded in the repo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ansible to deploy Terraform - why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for circling back on this. I realize my earlier explanation may have caused some confusion. What I meant is that while we do have downstream automation that consumes certain outputs for cluster setup, that doesn’t mean all of the Terraform configuration itself needs to be exposed as variables.

After revisiting your feedback, I’ve updated the code so that only true external inputs like API key, region, secrets manager ID, counts, profiles, etc., remain as variables. The structured definitions for bastion, control-plane, and compute nodes are now generated in locals, which keeps the interface cleaner and avoids over-exposing internal implementation details

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't grok why most of these variables need to exist (ignoring the API key for now). for example, we can't simply change the region nor do I believe we intended on doing so. if we need to set the name of the region in multiple locations, thats what a local variable is good for.

variables are public inputs for a public interface
local variables are a private implementation detail

}

resource "ibm_is_instance" "compute" {
count = var.compute.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, use for_each

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the compute to use for_each in nodes.tf

}
}

resource "null_resource" "node_setup" {
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't appear to be doing anything - can it be removed?

@@ -0,0 +1,43 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs are intended to be consumed elsewhere - are all of these intended to be consumed somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these outputs are consumed in our Ansible automation for installing Kubernetes. Specifically, the IPs and load balancer hostname are passed into the Ansible playbook ansible-playbook -v -i examples/k8s-build-cluster/hosts.yml install-k8s-ha.yaml -e @group_vars/bastion_configuration --extra-vars @group_vars/all to configure the bastion, control-plane, and worker nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't know I fully follow this. so the plan would be to write these outputs to file or? how do you bridge the gap between an output in Terraform and said output landing in a file on disk that ansible then uses? Is that ansible definition going to live in this repo as well? Should we store these values somewhere that Ansible can readily retrieve - S3/SSM/etc?

@sudharshanibm sudharshanibm force-pushed the k8s-s390x-infra branch 3 times, most recently from 03e9978 to 5af44e7 Compare October 28, 2025 14:52
@sudharshanibm
Copy link
Contributor Author

Hi @bryantbiggs
Hope you're doing well! I wanted to follow up on this PR. I’ve addressed all the review comments you provided earlier, and the PR is ready for your re-review whenever you get a chance.
Thanks so much for your time! 🙂

**2. Export COS Secrets**
<br> Export `access_key` and `secret_key` as environment variables.
```
export AWS_ACCESS_KEY_ID="<HMAC_ACCESS_KEY_ID>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IAM roles and short lived/ephemeral credentials should be used in place of static access and secret keys

```

**4. Check the `variables.tf` file**
<br> Open the `variables.tf` file to review all the available variables. This file lists all customizable inputs for your Terraform configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who or what is intended to modify these variables? variables are only warranted on re-usable artifacts like modules. otherwise things should be declared statically or with local variables if the value needs to be kept in sync in various locations


`ibmcloud_api_key`, `secrets_manager_id` are the only required variables that you must set in order to proceed. You can set this key either by adding it to your `var.tfvars` file or by exporting it as an environment variable.

**Option 1:** Set in `var.tfvars` file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really need to think long and hard about how secrets are handled - I don't think any of these options (nit: why are there options?) are sufficient

}

resource "ibm_is_instance" "bastion" {
for_each = local.bastion_nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit like YAGNI - usually one bastion host is sufficient

See the License for the specific language governing permissions and
limitations under the License.
*/
data "ibm_resource_group" "resource_group" {
Copy link
Contributor

@bryantbiggs bryantbiggs Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these names throughout this file are redundant

if you have to reference a 2nd resource group or a 2nd vpc, what will the data source name be? "resource_group_2"/"vpc2"? use the value provided to name is usually a safe route

Suggested change
data "ibm_resource_group" "resource_group" {
data "ibm_resource_group" "build_cluster" {

security_groups = [data.ibm_is_security_group.control_plane_sg.id]
}

resource "ibm_is_lb_pool" "k8s_api_pool" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid the redundant naming where possible

Suggested change
resource "ibm_is_lb_pool" "k8s_api_pool" {
resource "ibm_is_lb_pool" "k8s_api_server" {

health_monitor_port = var.api_server_port
}

resource "ibm_is_lb_listener" "k8s_api_listener" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resource "ibm_is_lb_listener" "k8s_api_listener" {
resource "ibm_is_lb_listener" "k8s_api_server" {

default_pool = ibm_is_lb_pool.k8s_api_pool.pool_id
}

resource "ibm_is_lb_pool_member" "k8s_api_members" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resource "ibm_is_lb_pool_member" "k8s_api_members" {
resource "ibm_is_lb_pool_member" "k8s_api_server" {

locals {
control_plane_nodes = {
for idx in range(1, var.control_plane_node_count + 1) :
"${idx}" => {
Copy link
Contributor

@bryantbiggs bryantbiggs Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as count #8413 (comment) FYI

I am not familiar with IBM cloud but this feels like they should be instance groups similar to an AWS autoscaling group instead of iterating over individual instances in Terraform. Do you agree? (cattle vs pets)

description = "The instance ID of your secrets manager"
default = ""

validation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a redundant validation - simply drop the default = "" and you achieve the same outcome without needing to specify the validation block

See the License for the specific language governing permissions and
limitations under the License.
*/
resource "ibm_resource_group" "build_resource_group" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant naming

Suggested change
resource "ibm_resource_group" "build_resource_group" {
resource "ibm_resource_group" "build_cluster" {


locals {
secrets_manager_region = "eu-de"
secrets_manager_name = "k8s-s390x-secrets-manager"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to only be used in one place so just use the value directly; no need for local variable

locals {
secrets_manager_region = "eu-de"
secrets_manager_name = "k8s-s390x-secrets-manager"
z_service_cred_secret_name = "k8s-s390x-sm-service-credentials-secret"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't appear to be used anywhere - should we remove it?

See the License for the specific language governing permissions and
limitations under the License.
*/
variable "resource_group_id" {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description, type, and default (if there is one)

limitations under the License.
*/

terraform {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the License for the specific language governing permissions and
limitations under the License.
*/
resource "ibm_is_vpc" "vpc" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, all of these are redundant names. since this is a module, the convention is typically to just call this this for the resource names and only use distinct names where 2 or more resources of the same time are utilized

limitations under the License.
*/
resource "ibm_is_vpc" "vpc" {
name = "k8s-s390x-vpc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the redundant resource name from the naming scheme throughout

  • remove -vpc here


# VPC
resource "ibm_is_public_gateway" "public_gw" {
name = "k8s-s390x-public-gw"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name = "k8s-s390x-public-gw"
name = "k8s-s390x"


# Subnet
resource "ibm_is_subnet" "subnet" {
name = "k8s-s390x-subnet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely you can create a local variable of

locals {
  name = "k8s-s390x"
}

and for all these resource names, just simple update to use

  name = local.name

Don't add the -gw or -sg or -vpc, etc.


# Security Groups
resource "ibm_is_security_group" "bastion_sg" {
name = "k8s-vpc-s390x-bastion-sg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would become

Suggested change
name = "k8s-vpc-s390x-bastion-sg"
name = "${local.name}-bastion"

@@ -0,0 +1 @@
ibmcloud_api_key = "<YOUR_API_KEY>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryantbiggs
Copy link
Contributor

FYI - I don't have approval authority on changes such as these in this repo; these are just my suggestions based on working with Terraform in various capacities over the years

the sig-k8s-infra-leads are the final approving authority

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/infra Infrastructure management, infrastructure design, code in infra/ cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants