GKE Cluster for PySpark and TensorFlow

This Terraform configuration creates a Google Kubernetes Engine (GKE) cluster with two node pools:

A node pool for PySpark workloads (2 nodes)
A node pool for TensorFlow workloads (2 nodes)

Architecture

The configuration creates the following resources:

A VPC network and subnet with secondary IP ranges for pods and services
A Cloud Router and NAT for the private GKE cluster
Firewall rules for internal cluster communication
A private GKE cluster with Workload Identity enabled
A service account for the GKE nodes with appropriate IAM roles
Two node pools with CPU-only machines:
- PySpark node pool: 2 nodes with e2-standard-4 (4 vCPUs, 16GB memory)
- TensorFlow node pool: 2 nodes with e2-standard-8 (8 vCPUs, 32GB memory)

Prerequisites

Google Cloud SDK installed and configured
Terraform v1.0.0 or newer
A Google Cloud project with the following APIs enabled:
- Compute Engine API
- Kubernetes Engine API
- IAM API
- Resource Manager API

Usage

Initialize the Terraform configuration:
```
terraform init
```
Create a terraform.tfvars file with your project ID:
```
project_id = "your-project-id"
```
Review the execution plan:
```
terraform plan
```
Apply the configuration:
```
terraform apply
```

Variables

Name	Description	Default
project_id	The GCP project ID	(required)
region	The GCP region for the resources	us-central1
zone	The GCP zone for zonal resources	us-central1-a
vpc_name	Name of the VPC network	gke-network
subnet_name	Name of the subnet for GKE	gke-subnet
subnet_cidr	CIDR range for the subnet	10.10.0.0/24
pods_cidr	CIDR range for pods	10.20.0.0/16
services_cidr	CIDR range for services	10.30.0.0/16
cluster_name	Name of the GKE cluster	ml-cluster
pyspark_node_count	Number of nodes in the PySpark node pool	2
pyspark_machine_type	Machine type for PySpark nodes	e2-standard-4
tensorflow_node_count	Number of nodes in the TensorFlow node pool	2
tensorflow_machine_type	Machine type for TensorFlow nodes	e2-standard-8

Outputs

Name	Description
cluster_name	The name of the GKE cluster
cluster_location	The location (zone) of the GKE cluster
cluster_endpoint	The IP address of the Kubernetes master endpoint
cluster_ca_certificate	The public certificate of the cluster's certificate authority
pyspark_node_pool_name	Name of the PySpark node pool
tensorflow_node_pool_name	Name of the TensorFlow node pool
vpc_name	The name of the VPC
subnet_name	The name of the subnet
service_account_email	The email of the service account used by the GKE nodes
kubectl_command	Command to get kubectl credentials for the cluster

Customization

You can customize the configuration by modifying the variables in variables.tf or by providing different values in your terraform.tfvars file.

Running PySpark Applications on GKE Spark Cluster

This guide outlines the steps to submit a PySpark script to a Spark cluster deployed on Google Kubernetes Engine (GKE) and how to verify its execution and workload distribution.

Prerequisites

A provisioned GKE cluster with a Spark node pool.
Spark master and worker deployments and services running in the cluster (e.g., using spark-master-deployment.yaml, spark-master-service.yaml, spark-worker-deployment.yaml).
A bastion host VM (gke-bastion) configured with gcloud and kubectl to interact with the GKE cluster (as per connection.tf).
Your PySpark script (e.g., main.py) ready.

Bastion instance requirements

Install Spark on the Bastion VM: Download and configure a compatible Spark version on the gke-bastion VM.
Ensure Script is on Bastion: Your main.py must be accessible on the bastion.

This method requires an extra Spark installation on the bastion but can be useful for certain workflows.

Steps to Submit Your PySpark Script

The primary method described here involves copying your script to the gke_bastion instance and executing it from there.

1. Access Your GKE Cluster via the Bastion Host

a. Get the SSH command for the bastion host: If you have Terraform installed and are in your project directory, run:

terraform output ssh_command

This will output a command similar to:

gcloud compute ssh gke-bastion --zone=<your-zone> --project=<your-project-id>

b. SSH into the bastion host: Execute the command obtained in the previous step.

c. Verify kubectl access: Once in the bastion host, confirm kubectl is configured and can communicate with your cluster:

terraform output kubectl_command

Execute the command obtained in the previous step and then execute

kubectl get nodes

You should see a list of your cluster nodes.

2. Deploy spark manifests into the gke_bastion Instance

a. If your spark-master-deployment.yaml (or another script) is on your local machine, you'll first need to copy it to the bastion host. You can use gcloud compute scp:

# From your local machine, not the bastion
gcloud compute scp /path/to/your/local/spark-master-deployment.yaml <your-bastion-user>@gke-bastion:/tmp/spark-master-deployment.yaml --zone=<your-zone> --project=<your-project-id>

Replace placeholders accordingly. /tmp/spark-master-deployment.yaml is a suggested path on the bastion.

b. After copying all the manifests into the bastion instance, apply the manifests

kubectl apply -f /tmp/spark-master-deployment.yaml

3. Identify the Kubernetes Service IP for the Spark Master deployment

Use the label defined in your Spark master deployment (app: spark-master) to find the pod name:

kubectl get services

4. Copy Your PySpark Script to the gke_bastion Instance

If your main.py (or another script) is on your local machine, you'll first need to copy it to the bastion host. You can use gcloud compute scp:

# From your local machine, not the bastion
gcloud compute scp /path/to/your/local/spark.py <your-bastion-user>@gke-bastion:/tmp/spark.py --zone=<your-zone> --project=<your-project-id>

Replace placeholders accordingly. /tmp/main.py is a suggested path on the bastion.

5. Upload your dataset to google storage bucket

Use the upload_dataset.sh to upload your dataset to google storage bucket. Edit the script to match the actual dataset name.

bash ./upload_dataset.sh

At the end, a config map with the project id will be created as a variable holder for the gcs-connector in the submit command.

Using Workload Identity with Spark for GCS Access

This project uses GKE Workload Identity to allow Spark pods to access Google Cloud Storage (GCS) without requiring service account keys. Here's how it works:

1. Understanding Workload Identity

Workload Identity is a GKE feature that allows Kubernetes service accounts to act as Google service accounts. This means:

Pods can access GCP resources using the permissions of the bound Google service account
No need to download or manage service account keys
More secure and manageable authentication

2. Setup Components

The following components are set up for Workload Identity:

GKE Cluster Configuration: Workload Identity is enabled on the cluster with workload_pool = "${var.project_id}.svc.id.goog"
Kubernetes Service Account: A service account named spark-sa is created in the Kubernetes cluster
IAM Binding: The Kubernetes service account is bound to the GKE service account with the role roles/iam.workloadIdentityUser
Storage Permissions: The GKE service account has roles/storage.objectViewer permission on the GCS bucket

3. Applying Workload Identity Configuration

After deploying the infrastructure with Terraform, run the provided script to apply the Workload Identity configuration:

bash ./config.sh

This script:

Creates the Kubernetes service account with the proper annotation.
Applies the ConfigMap with the project ID as variable holder for the gcs-connector in the submit command.
Restarts the Spark deployments to pick up the new service account.

Running Spark Jobs with GCS Access

Once Workload Identity is configured, you can run Spark jobs that access GCS without additional authentication:

spark-submit \
    --master spark://<load balancer service ip>:7077 \
    --deploy-mode client \
    --name health-kmeans-job-standalone \
    --packages com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.2 \
    --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \ 
    --conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS main.py

The Spark pods will automatically use the GKE service account's permissions to access GCS.

MySQL Database StatefulSet (Only tested on local kind cluster)

You can find the MySQL StatefulSet at infra/mysql-database directory. Is a basec deployment of a MySQL database for a kubernetes cluster. It is based on the official MySQL Docker image and a modified version of percona-xtrabackup with the preinstalled required packeges not included in the official image based in redhat minimal image.

This is a work in progress. Multi node is not available yet. Taint is not available yet.

Warning

Security isn't set yet.

Testing the MySQL database

The workload is created using a temporal container.

For Bash:

kubectl run mysql-client --image=mysql:8.4.0 -i --rm --restart=Never --\
  mysql -h mysql-0.mysql <<EOF
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello');
EOF

And for Powershell:

@"
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello');
"@ | kubectl run mysql-client --image=mysql:8.4.0 -i --rm --restart=Never -- mysql -h mysql-0.mysql

Check the just created database and table with:

For Bash:

kubectl run mysql-client --image=mysql:8.4.0 -i -t --rm --restart=Never --\
  mysql -h mysql-read -e "SELECT * FROM test.messages"

And for Powershell:

kubectl run mysql-client --image=mysql:8.4.0 -i -t --rm --restart=Never -- `
  mysql -h mysql-read -e "SELECT * FROM test.messages"

Load a csv into a SQL table

If it is required, run a forward port command to map port 3306 in the local cluster. But the MySQL StatefulSet configuration has a load balancer service (mysql-exernal for rw transactions) that can be used to connect to the database outside the cluster but in the same subnet.

Also, for Windows 11, the kind cluster maps the load balancer services ports to localhost in the host machine.

Warning

Only for local development, may differ in cloud providers. This command is to map the service's port 3306 into the local machine if it is required.

kubectl port-forward svc/mysql-external 3306:3306

Run the python script (.\infra\local\mysql-database\load_csv.py) to upload the csv file into the database.

Check the network. In a local development environment is running in the same "kind" subnetwork with the external ip or in the "host" subnetwork when using the 127.0.0.1 ip.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
infra		infra
workloads		workloads
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GKE Cluster for PySpark and TensorFlow

Architecture

Prerequisites

Usage

Variables

Outputs

Customization

Running PySpark Applications on GKE Spark Cluster

Prerequisites

Bastion instance requirements

Steps to Submit Your PySpark Script

1. Access Your GKE Cluster via the Bastion Host

2. Deploy spark manifests into the gke_bastion Instance

3. Identify the Kubernetes Service IP for the Spark Master deployment

4. Copy Your PySpark Script to the gke_bastion Instance

5. Upload your dataset to google storage bucket

Using Workload Identity with Spark for GCS Access

1. Understanding Workload Identity

2. Setup Components

3. Applying Workload Identity Configuration

Running Spark Jobs with GCS Access

MySQL Database StatefulSet (Only tested on local kind cluster)

Testing the MySQL database

Load a csv into a SQL table

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GKE Cluster for PySpark and TensorFlow

Architecture

Prerequisites

Usage

Variables

Outputs

Customization

Running PySpark Applications on GKE Spark Cluster

Prerequisites

Bastion instance requirements

Steps to Submit Your PySpark Script

1. Access Your GKE Cluster via the Bastion Host

2. Deploy spark manifests into the gke_bastion Instance

3. Identify the Kubernetes Service IP for the Spark Master deployment

4. Copy Your PySpark Script to the gke_bastion Instance

5. Upload your dataset to google storage bucket

Using Workload Identity with Spark for GCS Access

1. Understanding Workload Identity

2. Setup Components

3. Applying Workload Identity Configuration

Running Spark Jobs with GCS Access

MySQL Database StatefulSet (Only tested on local kind cluster)

Testing the MySQL database

Load a csv into a SQL table

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages