GitHub Template Repository for deploying LabLink infrastructure to AWS
Deploy your own LabLink infrastructure for cloud-based VM allocation and management. This template uses Terraform and GitHub Actions to automate deployment of the LabLink allocator service to AWS.
📖 Main Documentation: https://talmolab.github.io/lablink/
LabLink automates deployment and management of cloud-based VMs for running research software. It provides:
- Web interface for requesting and managing VMs
- Automatic VM provisioning with your software pre-installed
- GPU support for ML/AI workloads
- Chrome Remote Desktop access to VM GUI
- Flexible configuration for different research needs
There are two supported ways to stand up a LabLink deployment. Both consume the same Hydra-validated config.yaml, so you can switch between them later without redoing the AWS setup.
| Path | When to choose it | How it works |
|---|---|---|
Path A — lablink-cli (single terminal) |
You want to deploy from your laptop without wiring up GitHub Actions, or you prefer a single-process workflow. | uv tool install lablink-cli → lablink doctor → lablink configure → lablink setup → lablink deploy. See the lablink-cli docs. |
| Path B — this template + GitHub Actions (covered below) | You want fork-and-deploy with deployments triggered/audited from GitHub, OIDC-based AWS auth, and state stored in S3. | Click Use this template → run ./scripts/setup.sh → trigger the terraform-deploy.yml workflow. |
The rest of this README documents Path B. If you're new and have no preference for GitHub Actions, Path A is the lowest-friction option.
Heads-up: the interactive TUI config wizard (
lablink configure) is currently Path A-only. Path B users get the typed Hydra schema and the example YAMLs inlablink-infrastructure/config/but no interactive wizard inside GitHub Actions. Tracked in lablink#339.
Click the "Use this template" button at the top of this repository to create your own deployment repository.
The setup script creates AWS infrastructure and GitHub secrets:
./scripts/setup.shWhat the script does:
- Checks prerequisites (AWS CLI, GitHub CLI, credentials)
- Creates OIDC provider and IAM role for GitHub Actions
- Creates S3 bucket (with versioning) and DynamoDB table
- Creates Route53 hosted zone (if using custom domain)
- Sets GitHub secrets (
AWS_ROLE_ARN,AWS_REGION,ADMIN_PASSWORD,DB_PASSWORD) - Calls
configure.shto generatelablink-infrastructure/config/config.yaml - Verifies all resources were created successfully
The script is idempotent — safe to re-run if interrupted.
Updating configuration later: To change settings like instance type, image tags, or DNS options without re-creating infrastructure, run the configuration wizard directly:
./scripts/configure.shThis can be run as many times as needed. It reads your existing config.yaml values as defaults.
Important: The config file path (lablink-infrastructure/config/config.yaml) is hardcoded in the infrastructure. Do not move or rename this file.
See Configuration Reference for all options, or Manual Setup if you prefer to create resources individually.
Via GitHub Actions (Recommended):
- Go to Actions → "Deploy LabLink Infrastructure"
- Click "Run workflow"
- Select environment (
test,prod, orci-test) - Click "Run workflow"
Via Local Terraform:
cd lablink-infrastructure
../scripts/init-terraform.sh test
terraform apply -var="resource_suffix=test"After deployment completes:
- Allocator URL: Check workflow output or Terraform output for the URL/IP
- SSH Access: Download the PEM key from workflow artifacts
- Web Interface: Navigate to allocator URL in your browser
-
AWS Account with permissions to create:
- EC2 instances
- Security Groups
- Elastic IPs
- (Optional) Route 53 records for DNS
-
GitHub Account with ability to:
- Create repositories from templates
- Configure GitHub Actions secrets
- Run GitHub Actions workflows
-
Basic Knowledge of:
- Terraform (helpful but not required)
- AWS services
Before deploying, you must set up:
- S3 Bucket for Terraform state storage
- IAM Role for GitHub Actions OIDC authentication
- (Optional) Elastic IP for persistent allocator address
- (Optional) Route 53 Hosted Zone for custom domain
See AWS Setup Guide below for detailed instructions.
The deploy and destroy workflows authenticate to AWS using OpenID Connect (OIDC) rather than static AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY secrets. The flow at deploy time looks like:
- GitHub Actions issues a short-lived JSON Web Token (JWT) for the running workflow, signed by
token.actions.githubusercontent.com. - The workflow calls
sts:AssumeRoleWithWebIdentityagainst the IAM role you registered (AWS_ROLE_ARN). - AWS validates the JWT against the OIDC provider trust policy (which restricts which
repo:ORG/REPO:*subject can assume the role) and returns temporary credentials. - Terraform uses those temporary credentials for the duration of the job — typically an hour or less — then they expire.
Why this matters:
- No long-lived AWS keys ever live in GitHub secrets, so a compromised repository secret cannot be replayed indefinitely.
- The trust policy pins the role to your specific repository (and optionally a branch/environment), so other repos in your org can't assume it by accident.
- Credentials auto-rotate every workflow run — there is no key to rotate manually.
./scripts/setup.sh creates the OIDC provider, the IAM role with the correct trust policy, and writes the role ARN to the AWS_ROLE_ARN GitHub secret for you. The manual steps below are for users who prefer to wire this up themselves.
Create an IAM role with OIDC provider for GitHub Actions:
-
Create OIDC provider in IAM (if not exists):
- Provider URL:
https://token.actions.githubusercontent.com - Audience:
sts.amazonaws.com
- Provider URL:
-
Create IAM role with trust policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "token.actions.githubusercontent.com:sub": "repo:YOUR_ORG/YOUR_REPO:*" } } } ] } -
Attach permissions:
PowerUserAccess(or custom policy with EC2, VPC, S3, Route53, IAM permissions)
-
Copy the Role ARN and add to GitHub secrets
The AWS region where your infrastructure will be deployed. Must match the region in your config.yaml.
Common regions:
us-west-2(Oregon)us-east-1(N. Virginia)eu-west-1(Ireland)
Important: AMI IDs are region-specific. If you change regions, update the ami_id in config.yaml.
Password for accessing the allocator web interface. Choose a strong password (12+ characters, mixed case, numbers, symbols).
This password is used to log in to the admin dashboard where you can:
- Create and destroy client VMs
- View VM status
- Assign VMs to users
Password for the PostgreSQL database used by the allocator service. Choose a different strong password than ADMIN_PASSWORD.
This is stored securely and injected into the configuration at deployment time.
The setup script creates all infrastructure and secrets in one go:
./scripts/setup.shThis creates all required AWS resources (OIDC provider, IAM role, S3 bucket, DynamoDB table, Route53 hosted zone), sets GitHub secrets, and calls configure.sh to generate config.yaml. It is idempotent and safe to re-run.
To update configuration later (instance types, image tags, DNS/SSL options, etc.), run the config wizard directly:
./scripts/configure.shWhat the script does NOT do:
- Does NOT register domain names (you must register via Route53 registrar, CloudFlare, or other registrar)
- Does NOT create DNS records (Terraform handles these, or you create manually)
After setup, your DNS/SSL approach is configured based on your wizard choices:
- Route53 + Let's Encrypt: Register domain, update nameservers to Route53
- CloudFlare DNS + SSL: Manage domain/DNS in CloudFlare, create A record pointing to allocator IP
- IP-only (no DNS/SSL): Access via IP address directly
If you prefer to create resources manually:
# Create bucket (must be globally unique across ALL of AWS)
aws s3 mb s3://tf-state-YOUR-ORG-lablink --region us-west-2
# Enable versioning (recommended)
aws s3api put-bucket-versioning \
--bucket tf-state-YOUR-ORG-lablink \
--versioning-configuration Status=EnabledUpdate bucket_name in lablink-infrastructure/config/config.yaml to match.
aws dynamodb create-table \
--table-name lock-table \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-west-2For persistent allocator IP address across deployments:
# Allocate EIP
aws ec2 allocate-address --domain vpc --region us-west-2
# Tag it for reuse
aws ec2 create-tags \
--resources eipalloc-XXXXXXXX \
--tags Key=Name,Value=lablink-eipUpdate eip.tag_name in config.yaml if using a different tag name.
If using a custom domain:
-
Create or use existing hosted zone:
aws route53 create-hosted-zone --name your-domain.com --caller-reference $(date +%s) -
Update your domain's nameservers to point to Route 53 NS records
-
Update
dnssection inconfig.yaml:dns: enabled: true domain: "your-domain.com" zone_id: "Z..." # Optional - will auto-lookup if empty
See GitHub Secrets Setup above for detailed IAM role configuration.
The repo ships several example configs under lablink-infrastructure/config/. Pick the one that matches how you want DNS/SSL set up; ./scripts/setup.sh copies your selection to config.yaml.
| Example | DNS provider | SSL provider | When to use |
|---|---|---|---|
ip-only.example.yaml |
None (access via IP) | None (HTTP only) | Fastest path; demos, debugging, throwaway dev. No domain needed. |
cloudflare.example.yaml |
CloudFlare (manual A record) | CloudFlare proxy | Frequent redeploys without Let's Encrypt rate limits; you already manage DNS in CloudFlare. |
letsencrypt.example.yaml |
Route 53 (Terraform-managed) | Let's Encrypt via Caddy | Stable production / staging with a Route 53 hosted zone. Limit: 5 certs / domain / 7 days. |
letsencrypt-manual.example.yaml |
Route 53 (manual A record) | Let's Encrypt via Caddy | Same as above but you want to manage the A record yourself (e.g., migrations). |
acm.example.yaml |
Route 53 (Terraform-managed) | AWS ACM via Application Load Balancer | Enterprise production; no Let's Encrypt limits, but ALB adds ~$20/mo. |
dev.example.yaml |
Configurable | Configurable | Local Terraform state (no S3 backend); local prototyping. |
test.example.yaml |
Configurable | Configurable | Staging environment, S3-backed state. |
prod.example.yaml |
Configurable | Configurable | Production environment, S3-backed state. |
ci-test.example.yaml |
Route 53 | Let's Encrypt | Template-maintainer CI only — do not use for application deployments. |
Decision shortcut:
- No domain? →
ip-only.example.yaml. - Domain in CloudFlare? →
cloudflare.example.yaml. - Domain in Route 53, deploy weekly or less? →
letsencrypt.example.yaml. - Domain in Route 53, deploy multiple times per week? →
cloudflare.example.yaml(avoids Let's Encrypt rate limits) oracm.example.yaml.
See lablink-infrastructure/config/README.md for the full decision tree, per-flavor pros/cons, and rate-limit recovery procedures.
All configuration is in lablink-infrastructure/config/config.yaml.
db:
dbname: "lablink_db"
user: "lablink"
password: "PLACEHOLDER_DB_PASSWORD" # Injected from GitHub secret
host: "localhost"
port: 5432machine:
machine_type: "g4dn.xlarge" # AWS instance type
image: "ghcr.io/talmolab/lablink-client-base-image:latest" # Docker image
ami_id: "ami-0601752c11b394251" # Region-specific AMI
repository: "https://github.com/YOUR_ORG/YOUR_REPO.git" # Your code/data repo
software: "your-software" # Software identifier
extension: "ext" # Data file extensionInstance Types:
g4dn.xlarge- GPU instance (NVIDIA T4, good for ML)t3.large- CPU-only, cheaperp3.2xlarge- More powerful GPU (NVIDIA V100)
AMI IDs (Ubuntu 24.04 with Docker + Nvidia):
us-west-2:ami-0601752c11b394251- Other regions: Use AWS Console to find similar AMI or create custom
app:
admin_user: "admin"
admin_password: "PLACEHOLDER_ADMIN_PASSWORD" # Injected from secret
region: "us-west-2" # Must match AWS_REGION secretdns:
enabled: false # true to use DNS, false for IP-only
terraform_managed: false # true = Terraform creates records
domain: "lablink.example.com" # Full domain name (e.g., test.lablink.example.com)
zone_id: "" # Leave empty for auto-lookupDomain Naming:
- Specify the full domain directly (e.g.,
lablink.example.comortest.lablink.example.com) - No automatic subdomain construction - use exactly what you specify
ssl:
provider: "none" # "letsencrypt", "cloudflare", "acm", or "none"
email: "admin@example.com" # For Let's Encrypt notifications
certificate_arn: "" # Required when provider="acm"SSL Providers:
none: HTTP only (for testing)letsencrypt: Automatic SSL with Caddy (production certs)cloudflare: Use CloudFlare proxy for SSLacm: AWS Certificate Manager via Application Load Balancer
ssl.provider: "letsencrypt"), be aware of rate limits:
| Limit Type | Limit | Lockout Period |
|---|---|---|
| Certificates per exact domain | 5 per week | 7 days |
| Certificates per registered domain | 50 per week | 7 days |
What this means:
- You can only deploy the same domain (e.g.,
test.lablink.example.com) 5 times in 7 days - If you hit the limit, you must wait 7 days before deploying that domain again
- No override available for the per-domain limit
Testing Strategies to Avoid Rate Limits:
| Strategy | DNS | SSL | Use Case | Rate Limit Risk |
|---|---|---|---|---|
| IP-only | Disabled | None | Development/debugging | ✅ None |
| CloudFlare | Enabled | CloudFlare | Frequent testing | ✅ None |
| Subdomain rotation | Enabled | Let's Encrypt | SSL testing | |
| Production | Enabled | Let's Encrypt | Stable deployment |
📖 See Testing Best Practices for detailed testing strategies and monitoring certificate usage.
eip:
strategy: "persistent" # "persistent" or "dynamic"
tag_name: "lablink-eip" # Tag to find reusable EIPDeploys or updates your LabLink infrastructure.
Triggers:
- Manual: Actions → "Deploy LabLink Infrastructure" → Run workflow
- Automatic: Push to
testbranch
Inputs:
environment:testorprod
What it does:
- Configures AWS credentials via OIDC
- Injects passwords from GitHub secrets into config
- Runs Terraform to create/update infrastructure
- Verifies deployment and DNS
- Uploads SSH key as artifact
Triggers:
- Manual only: Actions → "Destroy LabLink Infrastructure" → Run workflow
Inputs:
confirm_destroy: Must type "yes" to confirmenvironment:testorprod
What it does:
- Creates a minimal terraform backend configuration
- Initializes Terraform with S3 backend to access client VM state
- Destroys client VMs directly from the S3 state (for test/prod/ci-test)
- Destroys the allocator infrastructure (EC2, security groups, EIP, etc.)
Note: Client VM state is stored in S3 (same bucket as infrastructure state). Terraform can destroy resources using only the state file - no terraform configuration files needed!
If the destroy workflow fails or leaves orphaned resources, see the Manual Cleanup Guide for step-by-step procedures to:
- Remove orphaned IAM roles, policies, and instance profiles
- Clean up leftover EC2 instances, security groups, and key pairs
- Fix Terraform state file issues (checksum mismatches, corrupted state)
- Verify complete resource removal
Common scenarios covered:
- Destroy workflow failures
- "Resource in use" errors
- Orphaned client VMs
- State lock issues
-
Update
config.yaml:machine: repository: "https://github.com/your-org/your-software-data.git" software: "your-software-name" extension: "your-file-ext" # e.g., "h5", "npy", "csv"
-
(Optional) Use custom Docker image:
machine: image: "ghcr.io/your-org/your-custom-image:latest"
-
Update
config.yaml:app: region: "eu-west-1" # Your region machine: ami_id: "ami-XXXXXXX" # Region-specific AMI
-
Update GitHub secret
AWS_REGION -
Find appropriate AMI for region (Ubuntu 24.04 with Docker)
machine:
machine_type: "t3.xlarge" # No GPU, cheaper
# or
machine_type: "p3.2xlarge" # More powerful GPUSee AWS EC2 Instance Types for options.
The client VMs can be configured with a custom startup script. See the LabLink Infrastructure README for more details.
Cause: Destroy workflow failed or Terraform state is out of sync with AWS resources
Solution: Use the automated cleanup script:
# Dry-run to see what would be deleted
./scripts/cleanup-orphaned-resources.sh <environment> --dry-run
# Actual cleanup
./scripts/cleanup-orphaned-resources.sh <environment>The script automatically reads configuration from config.yaml, backs up Terraform state files, and deletes resources in the correct dependency order. For detailed manual cleanup procedures, see MANUAL_CLEANUP_GUIDE.md.
Cause: AMI ID doesn't exist in your region
Solution: Update ami_id in config.yaml with a region-appropriate AMI
Cause: Security group or DNS not configured
Solution:
- Check security group allows inbound traffic on port 5000
- If using DNS, verify DNS records propagated
- Try accessing via public IP first
Cause: Previous deployment didn't complete or cleanup
Solution:
# In lablink-infrastructure/
terraform force-unlock LOCK_IDCause: DNS propagation delay or Route 53 not configured
Solution:
- Wait 5-10 minutes for propagation
- Verify Route 53 hosted zone exists
- Check nameservers match at domain registrar
- Use
nslookup your-domain.comto test
- Main Documentation: https://talmolab.github.io/lablink/
- Infrastructure Docs: lablink-infrastructure/README.md
- GitHub Issues: https://github.com/talmolab/lablink/issues
- Deployment Checklist: DEPLOYMENT_CHECKLIST.md
lablink-template/
├── .github/workflows/ # GitHub Actions workflows
│ ├── terraform-deploy.yml # Deploy infrastructure (OIDC → AWS)
│ ├── terraform-destroy.yml # Destroy infrastructure + client VMs
│ ├── config-validation.yml # Validate config.yaml on PR
│ └── startup-script-validation.yml # Lint custom-startup.sh on PR
├── lablink-infrastructure/ # Terraform infrastructure
│ ├── main.tf # Core Terraform config (EC2, EIP, IAM, Route53)
│ ├── alb.tf # ALB resources (only when ssl.provider="acm")
│ ├── backend.tf # Backend configuration
│ ├── backend-*.hcl # Per-environment backend overrides (dev/test/prod/ci-test)
│ ├── user_data.sh # EC2 initialization script (templated by Terraform)
│ ├── config/
│ │ ├── config.yaml # Your active configuration
│ │ ├── *.example.yaml # Per-flavor templates (ip-only, cloudflare, letsencrypt, acm, dev/test/prod, ci-test)
│ │ ├── custom-startup.sh # Optional per-client-VM startup hook
│ │ └── README.md # Detailed config selection guide
│ └── README.md # Infrastructure documentation
├── scripts/ # Helper scripts
│ ├── setup.sh # One-time setup: OIDC, IAM, S3, DynamoDB, GitHub secrets
│ ├── configure.sh # Interactive config.yaml wizard (re-runnable)
│ ├── init-terraform.sh # Terraform init helper (reads bucket from config)
│ ├── verify-deployment.sh # Post-deploy DNS/HTTP/SSL checks
│ ├── estimate-costs.sh # Daily AWS cost estimate for a given config
│ ├── cleanup-orphaned-resources.sh # Recover from failed `terraform destroy`
│ └── validate-all-configs.{sh,ps1} # Validate every *.example.yaml against the schema
├── MANUAL_CLEANUP_GUIDE.md # Manual cleanup procedures
├── DEPLOYMENT_CHECKLIST.md # Pre-deployment checklist
├── README.md # This file
└── LICENSE
Found an issue with the template or want to suggest improvements?
- Open an issue: https://github.com/talmolab/lablink-template/issues
- For LabLink core issues: https://github.com/talmolab/lablink/issues
BSD 2-Clause License - see LICENSE file for details.
- Main LabLink Repository: https://github.com/talmolab/lablink
- Documentation: https://talmolab.github.io/lablink/
- Template Repository: https://github.com/talmolab/lablink-template
- Example Deployment: https://github.com/talmolab/sleap-lablink (SLEAP-specific configuration)
Need Help? Check the Deployment Checklist or Troubleshooting section above.