LabLink Infrastructure Template

GitHub Template Repository for deploying LabLink infrastructure to AWS

Deploy your own LabLink infrastructure for cloud-based VM allocation and management. This template uses Terraform and GitHub Actions to automate deployment of the LabLink allocator service to AWS.

📖 Main Documentation: https://talmolab.github.io/lablink/

What is LabLink?

LabLink automates deployment and management of cloud-based VMs for running research software. It provides:

Web interface for requesting and managing VMs
Automatic VM provisioning with your software pre-installed
GPU support for ML/AI workloads
Chrome Remote Desktop access to VM GUI
Flexible configuration for different research needs

Two Deployment Paths

There are two supported ways to stand up a LabLink deployment. Both consume the same Hydra-validated config.yaml, so you can switch between them later without redoing the AWS setup.

Path	When to choose it	How it works
Path A — `lablink-cli` (single terminal)	You want to deploy from your laptop without wiring up GitHub Actions, or you prefer a single-process workflow.	`uv tool install lablink-cli` → `lablink doctor` → `lablink configure` → `lablink setup` → `lablink deploy`. See the lablink-cli docs.
Path B — this template + GitHub Actions (covered below)	You want fork-and-deploy with deployments triggered/audited from GitHub, OIDC-based AWS auth, and state stored in S3.	Click Use this template → run `./scripts/setup.sh` → trigger the `terraform-deploy.yml` workflow.

The rest of this README documents Path B. If you're new and have no preference for GitHub Actions, Path A is the lowest-friction option.

Heads-up: the interactive TUI config wizard (lablink configure) is currently Path A-only. Path B users get the typed Hydra schema and the example YAMLs in lablink-infrastructure/config/ but no interactive wizard inside GitHub Actions. Tracked in lablink#339.

Quick Start

1. Use This Template

Click the "Use this template" button at the top of this repository to create your own deployment repository.

2. Run the Setup Script (One-Time)

The setup script creates AWS infrastructure and GitHub secrets:

./scripts/setup.sh

What the script does:

Checks prerequisites (AWS CLI, GitHub CLI, credentials)
Creates OIDC provider and IAM role for GitHub Actions
Creates S3 bucket (with versioning) and DynamoDB table
Creates Route53 hosted zone (if using custom domain)
Sets GitHub secrets (AWS_ROLE_ARN, AWS_REGION, ADMIN_PASSWORD, DB_PASSWORD)
Calls configure.sh to generate lablink-infrastructure/config/config.yaml
Verifies all resources were created successfully

The script is idempotent — safe to re-run if interrupted.

Updating configuration later: To change settings like instance type, image tags, or DNS options without re-creating infrastructure, run the configuration wizard directly:

./scripts/configure.sh

This can be run as many times as needed. It reads your existing config.yaml values as defaults.

Important: The config file path (lablink-infrastructure/config/config.yaml) is hardcoded in the infrastructure. Do not move or rename this file.

See Configuration Reference for all options, or Manual Setup if you prefer to create resources individually.

3. Deploy

Via GitHub Actions (Recommended):

Go to Actions → "Deploy LabLink Infrastructure"
Click "Run workflow"
Select environment (test, prod, or ci-test)
Click "Run workflow"

Via Local Terraform:

cd lablink-infrastructure
../scripts/init-terraform.sh test
terraform apply -var="resource_suffix=test"

4. Access Your Infrastructure

After deployment completes:

Allocator URL: Check workflow output or Terraform output for the URL/IP
SSH Access: Download the PEM key from workflow artifacts
Web Interface: Navigate to allocator URL in your browser

Prerequisites

Required

AWS Account with permissions to create:
- EC2 instances
- Security Groups
- Elastic IPs
- (Optional) Route 53 records for DNS
GitHub Account with ability to:
- Create repositories from templates
- Configure GitHub Actions secrets
- Run GitHub Actions workflows
Basic Knowledge of:
- Terraform (helpful but not required)
- AWS services

AWS Setup Required

Before deploying, you must set up:

S3 Bucket for Terraform state storage
IAM Role for GitHub Actions OIDC authentication
(Optional) Elastic IP for persistent allocator address
(Optional) Route 53 Hosted Zone for custom domain

See AWS Setup Guide below for detailed instructions.

GitHub Secrets Setup

Why OIDC instead of long-lived AWS keys?

The deploy and destroy workflows authenticate to AWS using OpenID Connect (OIDC) rather than static AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY secrets. The flow at deploy time looks like:

GitHub Actions issues a short-lived JSON Web Token (JWT) for the running workflow, signed by token.actions.githubusercontent.com.
The workflow calls sts:AssumeRoleWithWebIdentity against the IAM role you registered (AWS_ROLE_ARN).
AWS validates the JWT against the OIDC provider trust policy (which restricts which repo:ORG/REPO:* subject can assume the role) and returns temporary credentials.
Terraform uses those temporary credentials for the duration of the job — typically an hour or less — then they expire.

Why this matters:

No long-lived AWS keys ever live in GitHub secrets, so a compromised repository secret cannot be replayed indefinitely.
The trust policy pins the role to your specific repository (and optionally a branch/environment), so other repos in your org can't assume it by accident.
Credentials auto-rotate every workflow run — there is no key to rotate manually.

./scripts/setup.sh creates the OIDC provider, the IAM role with the correct trust policy, and writes the role ARN to the AWS_ROLE_ARN GitHub secret for you. The manual steps below are for users who prefer to wire this up themselves.

AWS_ROLE_ARN

Create an IAM role with OIDC provider for GitHub Actions:

Create OIDC provider in IAM (if not exists):
- Provider URL: https://token.actions.githubusercontent.com
- Audience: sts.amazonaws.com

Create IAM role with trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:YOUR_ORG/YOUR_REPO:*"
        }
      }
    }
  ]
}

Attach permissions:
- PowerUserAccess (or custom policy with EC2, VPC, S3, Route53, IAM permissions)
Copy the Role ARN and add to GitHub secrets

AWS_REGION

The AWS region where your infrastructure will be deployed. Must match the region in your config.yaml.

Common regions:

us-west-2 (Oregon)
us-east-1 (N. Virginia)
eu-west-1 (Ireland)

Important: AMI IDs are region-specific. If you change regions, update the ami_id in config.yaml.

ADMIN_PASSWORD

Password for accessing the allocator web interface. Choose a strong password (12+ characters, mixed case, numbers, symbols).

This password is used to log in to the admin dashboard where you can:

Create and destroy client VMs
View VM status
Assign VMs to users

DB_PASSWORD

Password for the PostgreSQL database used by the allocator service. Choose a different strong password than ADMIN_PASSWORD.

This is stored securely and injected into the configuration at deployment time.

AWS Setup Guide

Automated Setup (Recommended)

The setup script creates all infrastructure and secrets in one go:

./scripts/setup.sh

This creates all required AWS resources (OIDC provider, IAM role, S3 bucket, DynamoDB table, Route53 hosted zone), sets GitHub secrets, and calls configure.sh to generate config.yaml. It is idempotent and safe to re-run.

To update configuration later (instance types, image tags, DNS/SSL options, etc.), run the config wizard directly:

./scripts/configure.sh

What the script does NOT do:

Does NOT register domain names (you must register via Route53 registrar, CloudFlare, or other registrar)
Does NOT create DNS records (Terraform handles these, or you create manually)

After setup, your DNS/SSL approach is configured based on your wizard choices:

Route53 + Let's Encrypt: Register domain, update nameservers to Route53
CloudFlare DNS + SSL: Manage domain/DNS in CloudFlare, create A record pointing to allocator IP
IP-only (no DNS/SSL): Access via IP address directly

Manual Setup (Alternative)

If you prefer to create resources manually:

1. Create S3 Bucket for Terraform State

# Create bucket (must be globally unique across ALL of AWS)
aws s3 mb s3://tf-state-YOUR-ORG-lablink --region us-west-2

# Enable versioning (recommended)
aws s3api put-bucket-versioning \
  --bucket tf-state-YOUR-ORG-lablink \
  --versioning-configuration Status=Enabled

Update bucket_name in lablink-infrastructure/config/config.yaml to match.

2. Create DynamoDB Table for State Locking

aws dynamodb create-table \
  --table-name lock-table \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-west-2

3. (Optional) Allocate Elastic IP

For persistent allocator IP address across deployments:

# Allocate EIP
aws ec2 allocate-address --domain vpc --region us-west-2

# Tag it for reuse
aws ec2 create-tags \
  --resources eipalloc-XXXXXXXX \
  --tags Key=Name,Value=lablink-eip

Update eip.tag_name in config.yaml if using a different tag name.

4. (Optional) Set Up Route 53 for DNS

If using a custom domain:

Create or use existing hosted zone:

aws route53 create-hosted-zone --name your-domain.com --caller-reference $(date +%s)

Update your domain's nameservers to point to Route 53 NS records

Update dns section in config.yaml:

dns:
  enabled: true
  domain: "your-domain.com"
  zone_id: "Z..." # Optional - will auto-lookup if empty

5. Set Up OIDC Provider and IAM Role

See GitHub Secrets Setup above for detailed IAM role configuration.

Choosing a Config Flavor

The repo ships several example configs under lablink-infrastructure/config/. Pick the one that matches how you want DNS/SSL set up; ./scripts/setup.sh copies your selection to config.yaml.

Example	DNS provider	SSL provider	When to use
`ip-only.example.yaml`	None (access via IP)	None (HTTP only)	Fastest path; demos, debugging, throwaway dev. No domain needed.
`cloudflare.example.yaml`	CloudFlare (manual A record)	CloudFlare proxy	Frequent redeploys without Let's Encrypt rate limits; you already manage DNS in CloudFlare.
`letsencrypt.example.yaml`	Route 53 (Terraform-managed)	Let's Encrypt via Caddy	Stable production / staging with a Route 53 hosted zone. Limit: 5 certs / domain / 7 days.
`letsencrypt-manual.example.yaml`	Route 53 (manual A record)	Let's Encrypt via Caddy	Same as above but you want to manage the A record yourself (e.g., migrations).
`acm.example.yaml`	Route 53 (Terraform-managed)	AWS ACM via Application Load Balancer	Enterprise production; no Let's Encrypt limits, but ALB adds ~$20/mo.
`dev.example.yaml`	Configurable	Configurable	Local Terraform state (no S3 backend); local prototyping.
`test.example.yaml`	Configurable	Configurable	Staging environment, S3-backed state.
`prod.example.yaml`	Configurable	Configurable	Production environment, S3-backed state.
`ci-test.example.yaml`	Route 53	Let's Encrypt	Template-maintainer CI only — do not use for application deployments.

Decision shortcut:

No domain? → ip-only.example.yaml.
Domain in CloudFlare? → cloudflare.example.yaml.
Domain in Route 53, deploy weekly or less? → letsencrypt.example.yaml.
Domain in Route 53, deploy multiple times per week? → cloudflare.example.yaml (avoids Let's Encrypt rate limits) or acm.example.yaml.

See lablink-infrastructure/config/README.md for the full decision tree, per-flavor pros/cons, and rate-limit recovery procedures.

Configuration Reference

All configuration is in lablink-infrastructure/config/config.yaml.

Database Settings

db:
  dbname: "lablink_db"
  user: "lablink"
  password: "PLACEHOLDER_DB_PASSWORD"  # Injected from GitHub secret
  host: "localhost"
  port: 5432

Client VM Settings

machine:
  machine_type: "g4dn.xlarge"  # AWS instance type
  image: "ghcr.io/talmolab/lablink-client-base-image:latest"  # Docker image
  ami_id: "ami-0601752c11b394251"  # Region-specific AMI
  repository: "https://github.com/YOUR_ORG/YOUR_REPO.git"  # Your code/data repo
  software: "your-software"  # Software identifier
  extension: "ext"  # Data file extension

Instance Types:

g4dn.xlarge - GPU instance (NVIDIA T4, good for ML)
t3.large - CPU-only, cheaper
p3.2xlarge - More powerful GPU (NVIDIA V100)

AMI IDs (Ubuntu 24.04 with Docker + Nvidia):

us-west-2: ami-0601752c11b394251
Other regions: Use AWS Console to find similar AMI or create custom

Application Settings

app:
  admin_user: "admin"
  admin_password: "PLACEHOLDER_ADMIN_PASSWORD"  # Injected from secret
  region: "us-west-2"  # Must match AWS_REGION secret

DNS Settings

dns:
  enabled: false  # true to use DNS, false for IP-only
  terraform_managed: false  # true = Terraform creates records
  domain: "lablink.example.com"  # Full domain name (e.g., test.lablink.example.com)
  zone_id: ""  # Leave empty for auto-lookup

Domain Naming:

Specify the full domain directly (e.g., lablink.example.com or test.lablink.example.com)
No automatic subdomain construction - use exactly what you specify

SSL/TLS Settings

ssl:
  provider: "none"  # "letsencrypt", "cloudflare", "acm", or "none"
  email: "admin@example.com"  # For Let's Encrypt notifications
  certificate_arn: ""  # Required when provider="acm"

SSL Providers:

none: HTTP only (for testing)
letsencrypt: Automatic SSL with Caddy (production certs)
cloudflare: Use CloudFlare proxy for SSL
acm: AWS Certificate Manager via Application Load Balancer

Let's Encrypt Rate Limits

⚠️ Important: When using Let's Encrypt (ssl.provider: "letsencrypt"), be aware of rate limits:

Limit Type	Limit	Lockout Period
Certificates per exact domain	5 per week	7 days
Certificates per registered domain	50 per week	7 days

What this means:

You can only deploy the same domain (e.g., test.lablink.example.com) 5 times in 7 days
If you hit the limit, you must wait 7 days before deploying that domain again
No override available for the per-domain limit

Testing Strategies to Avoid Rate Limits:

Strategy	DNS	SSL	Use Case	Rate Limit Risk
IP-only	Disabled	None	Development/debugging	✅ None
CloudFlare	Enabled	CloudFlare	Frequent testing	✅ None
Subdomain rotation	Enabled	Let's Encrypt	SSL testing	⚠️ Low (5 per subdomain)
Production	Enabled	Let's Encrypt	Stable deployment	⚠️ Low (rarely redeploy)

📖 See Testing Best Practices for detailed testing strategies and monitoring certificate usage.

Elastic IP Settings

eip:
  strategy: "persistent"  # "persistent" or "dynamic"
  tag_name: "lablink-eip"  # Tag to find reusable EIP

Deployment Workflows

Deploy LabLink Infrastructure

Deploys or updates your LabLink infrastructure.

Triggers:

Manual: Actions → "Deploy LabLink Infrastructure" → Run workflow
Automatic: Push to test branch

Inputs:

environment: test or prod

What it does:

Configures AWS credentials via OIDC
Injects passwords from GitHub secrets into config
Runs Terraform to create/update infrastructure
Verifies deployment and DNS
Uploads SSH key as artifact

Destroy LabLink Infrastructure

⚠️ WARNING: This destroys all infrastructure and data!

Triggers:

Manual only: Actions → "Destroy LabLink Infrastructure" → Run workflow

Inputs:

confirm_destroy: Must type "yes" to confirm
environment: test or prod

What it does:

Creates a minimal terraform backend configuration
Initializes Terraform with S3 backend to access client VM state
Destroys client VMs directly from the S3 state (for test/prod/ci-test)
Destroys the allocator infrastructure (EC2, security groups, EIP, etc.)

Note: Client VM state is stored in S3 (same bucket as infrastructure state). Terraform can destroy resources using only the state file - no terraform configuration files needed!

Manual Cleanup and Troubleshooting

If the destroy workflow fails or leaves orphaned resources, see the Manual Cleanup Guide for step-by-step procedures to:

Remove orphaned IAM roles, policies, and instance profiles
Clean up leftover EC2 instances, security groups, and key pairs
Fix Terraform state file issues (checksum mismatches, corrupted state)
Verify complete resource removal

Common scenarios covered:

Destroy workflow failures
"Resource in use" errors
Orphaned client VMs
State lock issues

Customization

For Different Research Software

Update config.yaml:

machine:
  repository: "https://github.com/your-org/your-software-data.git"
  software: "your-software-name"
  extension: "your-file-ext"  # e.g., "h5", "npy", "csv"

(Optional) Use custom Docker image:

machine:
  image: "ghcr.io/your-org/your-custom-image:latest"

For Different AWS Regions

Update config.yaml:

app:
  region: "eu-west-1"  # Your region
machine:
  ami_id: "ami-XXXXXXX"  # Region-specific AMI

Update GitHub secret AWS_REGION
Find appropriate AMI for region (Ubuntu 24.04 with Docker)

For Different Instance Types

machine:
  machine_type: "t3.xlarge"  # No GPU, cheaper
  # or
  machine_type: "p3.2xlarge"  # More powerful GPU

See AWS EC2 Instance Types for options.

Client Startup Script

The client VMs can be configured with a custom startup script. See the LabLink Infrastructure README for more details.

Troubleshooting

Orphaned Resources After Failed Destroy

Cause: Destroy workflow failed or Terraform state is out of sync with AWS resources

Solution: Use the automated cleanup script:

# Dry-run to see what would be deleted
./scripts/cleanup-orphaned-resources.sh <environment> --dry-run

# Actual cleanup
./scripts/cleanup-orphaned-resources.sh <environment>

The script automatically reads configuration from config.yaml, backs up Terraform state files, and deletes resources in the correct dependency order. For detailed manual cleanup procedures, see MANUAL_CLEANUP_GUIDE.md.

Deployment Fails with "InvalidAMI"

Cause: AMI ID doesn't exist in your region

Solution: Update ami_id in config.yaml with a region-appropriate AMI

Cannot Access Allocator Web Interface

Cause: Security group or DNS not configured

Solution:

Check security group allows inbound traffic on port 5000
If using DNS, verify DNS records propagated
Try accessing via public IP first

Terraform State Lock Error

Cause: Previous deployment didn't complete or cleanup

Solution:

# In lablink-infrastructure/
terraform force-unlock LOCK_ID

DNS Not Resolving

Cause: DNS propagation delay or Route 53 not configured

Solution:

Wait 5-10 minutes for propagation
Verify Route 53 hosted zone exists
Check nameservers match at domain registrar
Use nslookup your-domain.com to test

More Help

Main Documentation: https://talmolab.github.io/lablink/
Infrastructure Docs: lablink-infrastructure/README.md
GitHub Issues: https://github.com/talmolab/lablink/issues
Deployment Checklist: DEPLOYMENT_CHECKLIST.md

Project Structure

lablink-template/
├── .github/workflows/                  # GitHub Actions workflows
│   ├── terraform-deploy.yml            # Deploy infrastructure (OIDC → AWS)
│   ├── terraform-destroy.yml           # Destroy infrastructure + client VMs
│   ├── config-validation.yml           # Validate config.yaml on PR
│   └── startup-script-validation.yml   # Lint custom-startup.sh on PR
├── lablink-infrastructure/             # Terraform infrastructure
│   ├── main.tf                         # Core Terraform config (EC2, EIP, IAM, Route53)
│   ├── alb.tf                          # ALB resources (only when ssl.provider="acm")
│   ├── backend.tf                      # Backend configuration
│   ├── backend-*.hcl                   # Per-environment backend overrides (dev/test/prod/ci-test)
│   ├── user_data.sh                    # EC2 initialization script (templated by Terraform)
│   ├── config/
│   │   ├── config.yaml                 # Your active configuration
│   │   ├── *.example.yaml              # Per-flavor templates (ip-only, cloudflare, letsencrypt, acm, dev/test/prod, ci-test)
│   │   ├── custom-startup.sh           # Optional per-client-VM startup hook
│   │   └── README.md                   # Detailed config selection guide
│   └── README.md                       # Infrastructure documentation
├── scripts/                            # Helper scripts
│   ├── setup.sh                        # One-time setup: OIDC, IAM, S3, DynamoDB, GitHub secrets
│   ├── configure.sh                    # Interactive config.yaml wizard (re-runnable)
│   ├── init-terraform.sh               # Terraform init helper (reads bucket from config)
│   ├── verify-deployment.sh            # Post-deploy DNS/HTTP/SSL checks
│   ├── estimate-costs.sh               # Daily AWS cost estimate for a given config
│   ├── cleanup-orphaned-resources.sh   # Recover from failed `terraform destroy`
│   └── validate-all-configs.{sh,ps1}   # Validate every *.example.yaml against the schema
├── MANUAL_CLEANUP_GUIDE.md             # Manual cleanup procedures
├── DEPLOYMENT_CHECKLIST.md             # Pre-deployment checklist
├── README.md                           # This file
└── LICENSE

Contributing

Found an issue with the template or want to suggest improvements?

Open an issue: https://github.com/talmolab/lablink-template/issues
For LabLink core issues: https://github.com/talmolab/lablink/issues

License

BSD 2-Clause License - see LICENSE file for details.

Links

Main LabLink Repository: https://github.com/talmolab/lablink
Documentation: https://talmolab.github.io/lablink/
Template Repository: https://github.com/talmolab/lablink-template
Example Deployment: https://github.com/talmolab/sleap-lablink (SLEAP-specific configuration)

Need Help? Check the Deployment Checklist or Troubleshooting section above.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
docs		docs
lablink-infrastructure		lablink-infrastructure
openspec		openspec
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
DEPLOYMENT_CHECKLIST.md		DEPLOYMENT_CHECKLIST.md
LICENSE		LICENSE
MANUAL_CLEANUP_GUIDE.md		MANUAL_CLEANUP_GUIDE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

LabLink Infrastructure Template

What is LabLink?

Two Deployment Paths

Quick Start

1. Use This Template

2. Run the Setup Script (One-Time)

3. Deploy

4. Access Your Infrastructure

Prerequisites

Required

AWS Setup Required

GitHub Secrets Setup

Why OIDC instead of long-lived AWS keys?

AWS_ROLE_ARN

AWS_REGION

ADMIN_PASSWORD

DB_PASSWORD

AWS Setup Guide

Automated Setup (Recommended)

Manual Setup (Alternative)

1. Create S3 Bucket for Terraform State

2. Create DynamoDB Table for State Locking

3. (Optional) Allocate Elastic IP

4. (Optional) Set Up Route 53 for DNS

5. Set Up OIDC Provider and IAM Role

Choosing a Config Flavor

Configuration Reference

Database Settings

Client VM Settings

Application Settings

DNS Settings

SSL/TLS Settings

Let's Encrypt Rate Limits

Elastic IP Settings

Deployment Workflows

Deploy LabLink Infrastructure

Destroy LabLink Infrastructure

Manual Cleanup and Troubleshooting

Customization

For Different Research Software

For Different AWS Regions

For Different Instance Types

Client Startup Script

Troubleshooting

Orphaned Resources After Failed Destroy

Deployment Fails with "InvalidAMI"

Cannot Access Allocator Web Interface

Terraform State Lock Error

DNS Not Resolving

More Help

Project Structure

Contributing

License

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages