This Terraform configuration creates a Google Kubernetes Engine (GKE) cluster with two node pools:
- A node pool for PySpark workloads (2 nodes)
- A node pool for TensorFlow workloads (2 nodes)
The configuration creates the following resources:
- A VPC network and subnet with secondary IP ranges for pods and services
- A Cloud Router and NAT for the private GKE cluster
- Firewall rules for internal cluster communication
- A private GKE cluster with Workload Identity enabled
- A service account for the GKE nodes with appropriate IAM roles
- Two node pools with CPU-only machines:
- PySpark node pool: 2 nodes with e2-standard-4 (4 vCPUs, 16GB memory)
- TensorFlow node pool: 2 nodes with e2-standard-8 (8 vCPUs, 32GB memory)
- Google Cloud SDK installed and configured
- Terraform v1.0.0 or newer
- A Google Cloud project with the following APIs enabled:
- Compute Engine API
- Kubernetes Engine API
- IAM API
- Resource Manager API
-
Initialize the Terraform configuration:
terraform init -
Create a
terraform.tfvarsfile with your project ID:project_id = "your-project-id" -
Review the execution plan:
terraform plan -
Apply the configuration:
terraform apply
| Name | Description | Default |
|---|---|---|
| project_id | The GCP project ID | (required) |
| region | The GCP region for the resources | us-central1 |
| zone | The GCP zone for zonal resources | us-central1-a |
| vpc_name | Name of the VPC network | gke-network |
| subnet_name | Name of the subnet for GKE | gke-subnet |
| subnet_cidr | CIDR range for the subnet | 10.10.0.0/24 |
| pods_cidr | CIDR range for pods | 10.20.0.0/16 |
| services_cidr | CIDR range for services | 10.30.0.0/16 |
| cluster_name | Name of the GKE cluster | ml-cluster |
| pyspark_node_count | Number of nodes in the PySpark node pool | 2 |
| pyspark_machine_type | Machine type for PySpark nodes | e2-standard-4 |
| tensorflow_node_count | Number of nodes in the TensorFlow node pool | 2 |
| tensorflow_machine_type | Machine type for TensorFlow nodes | e2-standard-8 |
| Name | Description |
|---|---|
| cluster_name | The name of the GKE cluster |
| cluster_location | The location (zone) of the GKE cluster |
| cluster_endpoint | The IP address of the Kubernetes master endpoint |
| cluster_ca_certificate | The public certificate of the cluster's certificate authority |
| pyspark_node_pool_name | Name of the PySpark node pool |
| tensorflow_node_pool_name | Name of the TensorFlow node pool |
| vpc_name | The name of the VPC |
| subnet_name | The name of the subnet |
| service_account_email | The email of the service account used by the GKE nodes |
| kubectl_command | Command to get kubectl credentials for the cluster |
You can customize the configuration by modifying the variables in variables.tf or by providing different values in your terraform.tfvars file.
This guide outlines the steps to submit a PySpark script to a Spark cluster deployed on Google Kubernetes Engine (GKE) and how to verify its execution and workload distribution.
- A provisioned GKE cluster with a Spark node pool.
- Spark master and worker deployments and services running in the cluster (e.g., using
spark-master-deployment.yaml,spark-master-service.yaml,spark-worker-deployment.yaml). - A bastion host VM (
gke-bastion) configured withgcloudandkubectlto interact with the GKE cluster (as perconnection.tf). - Your PySpark script (e.g.,
main.py) ready.
- Install Spark on the Bastion VM: Download and configure a compatible Spark version on the
gke-bastionVM. - Ensure Script is on Bastion: Your
main.pymust be accessible on the bastion.
This method requires an extra Spark installation on the bastion but can be useful for certain workflows.
The primary method described here involves copying your script to the gke_bastion instance and executing it from there.
a. Get the SSH command for the bastion host: If you have Terraform installed and are in your project directory, run:
terraform output ssh_commandThis will output a command similar to:
gcloud compute ssh gke-bastion --zone=<your-zone> --project=<your-project-id>b. SSH into the bastion host: Execute the command obtained in the previous step.
c.
Verify kubectl access:
Once in the bastion host, confirm kubectl is configured and can communicate with your cluster:
terraform output kubectl_commandExecute the command obtained in the previous step and then execute
kubectl get nodesYou should see a list of your cluster nodes.
a.
If your spark-master-deployment.yaml (or another script) is on your local machine,
you'll first need to copy it to the bastion host.
You can use gcloud compute scp:
# From your local machine, not the bastion
gcloud compute scp /path/to/your/local/spark-master-deployment.yaml <your-bastion-user>@gke-bastion:/tmp/spark-master-deployment.yaml --zone=<your-zone> --project=<your-project-id>Replace placeholders accordingly.
/tmp/spark-master-deployment.yaml is a suggested path on the bastion.
b. After copying all the manifests into the bastion instance, apply the manifests
kubectl apply -f /tmp/spark-master-deployment.yamlUse the label defined in your Spark master deployment (app: spark-master) to find the pod name:
kubectl get servicesIf your main.py (or another script) is on your local machine, you'll first need to copy it to the bastion host. You can use gcloud compute scp:
# From your local machine, not the bastion
gcloud compute scp /path/to/your/local/spark.py <your-bastion-user>@gke-bastion:/tmp/spark.py --zone=<your-zone> --project=<your-project-id>Replace placeholders accordingly.
/tmp/main.py is a suggested path on the bastion.
Use the upload_dataset.sh to upload your dataset to google storage bucket. Edit the script to match the actual dataset name.
bash ./upload_dataset.shAt the end, a config map with the project id will be created as a variable holder for the gcs-connector in the submit command.
This project uses GKE Workload Identity to allow Spark pods to access Google Cloud Storage (GCS) without requiring service account keys. Here's how it works:
Workload Identity is a GKE feature that allows Kubernetes service accounts to act as Google service accounts. This means:
- Pods can access GCP resources using the permissions of the bound Google service account
- No need to download or manage service account keys
- More secure and manageable authentication
The following components are set up for Workload Identity:
- GKE Cluster Configuration: Workload Identity is enabled on the cluster with
workload_pool = "${var.project_id}.svc.id.goog" - Kubernetes Service Account: A service account named
spark-sais created in the Kubernetes cluster - IAM Binding: The Kubernetes service account is bound to the GKE service account with the role
roles/iam.workloadIdentityUser - Storage Permissions: The GKE service account has
roles/storage.objectViewerpermission on the GCS bucket
After deploying the infrastructure with Terraform, run the provided script to apply the Workload Identity configuration:
bash ./config.shThis script:
- Creates the Kubernetes service account with the proper annotation.
- Applies the ConfigMap with the project ID as variable holder for the gcs-connector in the submit command.
- Restarts the Spark deployments to pick up the new service account.
Once Workload Identity is configured, you can run Spark jobs that access GCS without additional authentication:
spark-submit \
--master spark://<load balancer service ip>:7077 \
--deploy-mode client \
--name health-kmeans-job-standalone \
--packages com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.2 \
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
--conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS main.pyThe Spark pods will automatically use the GKE service account's permissions to access GCS.
You can find the MySQL StatefulSet at infra/mysql-database directory. Is a basec deployment of a MySQL database for a
kubernetes cluster. It is based on the official MySQL Docker image and a modified version of percona-xtrabackup with the
preinstalled required packeges not included in the official image based in redhat minimal image.
This is a work in progress. Multi node is not available yet. Taint is not available yet.
Warning
Security isn't set yet.
The workload is created using a temporal container.
For Bash:
kubectl run mysql-client --image=mysql:8.4.0 -i --rm --restart=Never --\
mysql -h mysql-0.mysql <<EOF
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello');
EOF
And for Powershell:
@"
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello');
"@ | kubectl run mysql-client --image=mysql:8.4.0 -i --rm --restart=Never -- mysql -h mysql-0.mysql
Check the just created database and table with:
For Bash:
kubectl run mysql-client --image=mysql:8.4.0 -i -t --rm --restart=Never --\
mysql -h mysql-read -e "SELECT * FROM test.messages"
And for Powershell:
kubectl run mysql-client --image=mysql:8.4.0 -i -t --rm --restart=Never -- `
mysql -h mysql-read -e "SELECT * FROM test.messages"
- If it is required, run a forward port command to map port 3306 in the local cluster. But the MySQL StatefulSet configuration has a load balancer service (mysql-exernal for rw transactions) that can be used to connect to the database outside the cluster but in the same subnet.
Also, for Windows 11, the kind cluster maps the load balancer services ports to localhost in the host machine.
Warning
Only for local development, may differ in cloud providers. This command is to map the service's port 3306 into the local machine if it is required.
kubectl port-forward svc/mysql-external 3306:3306
- Run the python script (.\infra\local\mysql-database\load_csv.py) to upload the csv file into the database.
Check the network. In a local development environment is running in the same "kind" subnetwork with the external ip or in the "host" subnetwork when using the 127.0.0.1 ip.