Reference LLMOps implementation for fine-tuning and deploying HuggingFace language models on Amazon SageMaker AI
This repo conatins a reusable accelerator that demonstrates LLM lifecycle orchestration from model fine-tuning to deployment, with automated GitHub repository creation, CI/CD pipelines, and monitoring capabilities. This serves as a foundation that can be customized and hardened for your specific production requirements.
This diagram showcases an LLMOps architecture that integrates GitHub Actions with SageMaker services for automated data preprocessing, model training, model evaluation and registration.
- Create SageMaker Project β Platform detects project creation
- Auto-Generate Repositories β Creates build & deploy repos with seed code
- Train Models β Push code triggers automated training pipelines
- Register Models β Successful training registers models in Model Registry
- Deploy Models β Model approval triggers automated deployment workflows
- β
Streamlined Setup - Simplified deployment with
make deploy - β End-to-End Automation - From code commit to model deployment
- β MLflow Integration - Experiment tracking and model registry
- β Monitoring Capabilities - CloudWatch dashboards and metrics
- β Security Foundations - IAM roles, VPC isolation, secrets management
- β Resource Configuration - Environment-specific resource sizing
Note: This is a reference implementation. Additional hardening, security reviews, and customization are recommended before deploying to production environments.
- SageMaker Pipelines: Orchestrated ML workflows for data processing, training, and evaluation
- SageMaker Model Registry: Centralized model versioning and approval workflow for trained ESG models
- S3 Storage: Artifact storage for datasets, models, and pipeline outputs
- IAM Roles: Secure cross-service authentication using OIDC and execution roles
The platform uses a two-layer architecture to separate reusable infrastructure from project-specific templates:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLMOps Platform (Shared) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π’ SageMaker Domain Stack β
β βββ Studio Domain & User Profiles β
β βββ Execution Roles & Permissions β
β βββ Shared S3 Buckets β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βοΈ Infrastructure Stack β
β βββ EventBridge Rules (project creation, model approval) β
β βββ Lambda Functions (GitHub repo management) β
β βββ Step Functions (orchestration workflows) β
β βββ GitHub OIDC Provider & IAM Roles β
β βββ Service Catalog Portfolio β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π Observability Stack β
β βββ MLflow Tracking Server (ECS Fargate) β
β βββ CloudWatch Dashboards β
β βββ SNS/Slack Notifications β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Project Template (Per Use Case) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π CloudFormation Template β
β βββ Model Package Group (for model versioning) β
β βββ Project-specific S3 Buckets β
β βββ Resource Tags (project-id, project-name) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π Event-Driven Workflow (Triggered by EventBridge) β
β βββ Step Functions: Orchestrates repo creation β
β βββ Lambda: Creates GitHub repositories β
β βββ Lambda: Populates repos with seed code β
β βββ Lambda: Configures GitHub Actions variables β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π¦ Seed Code (Copied to GitHub Repos) β
β βββ model_build/ - Training pipeline β
β β βββ SageMaker Pipelines definition β
β β βββ Training scripts β
β β βββ GitHub Actions workflows β
β βββ model_deploy/ - Deployment pipeline β
β βββ CDK stack for endpoints β
β βββ GitHub Actions workflows β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This is a generic LLMOps platform that can be used for any HuggingFace language model fine-tuning project. It provides:
- SageMaker Studio Domain - Development environment for data scientists
- GitHub Integration - Automated repository creation with OIDC authentication
- MLflow Tracking - Centralized experiment tracking and model registry
- Event-Driven Automation - EventBridge + Lambda + Step Functions orchestration
- Observability Stack - CloudWatch dashboards, SNS notifications, Slack integration
- CloudFormation Template - Creates project-specific resources (Model Package Group, S3 buckets)
- Seed Code Repositories - Pre-configured training and deployment pipelines
- GitHub Actions Workflows - CI/CD automation for model lifecycle
The platform includes a complete example demonstrating fine-tuning Mistral-7B for ESG sustainability report summarization. This serves as a reference implementation showing how to use the platform for your own use cases.
Before running make deploy, ensure you have:
- AWS Account with appropriate permissions
- Docker Desktop running
- GitHub Organization where repositories will be created
- GitHub Personal Access Token (see detailed permissions below)
Deploy the complete LLMOps platform with just a few commands:
# 1. Set required environment variables
export TARGET_GITHUB_ORG="your-github-org"
export GITHUB_TOKEN="ghp_your_github_token_here"
# 2. (Optional) Set AWS region - defaults to your configured AWS CLI region
aws configure set region us-west-2
# 3. Ensure Docker Desktop is running
docker info
# 4. Deploy everything (handles CDK bootstrap, Docker images, and all 3 stacks)
make deployThe make deploy command automatically handles CDK bootstrap, pulls required Docker images, validates environment variables, and deploys all infrastructure in the correct order.
Before deploying, you must configure your target GitHub organization where project repositories will be created:
Important: This platform is designed as a reference implementation and accelerator. Review and customize the configuration for your specific security, compliance, and operational requirements before use.
# REQUIRED: Set your target GitHub organization
export TARGET_GITHUB_ORG="your-github-org"
# REQUIRED: Set your GitHub Personal Access Token
export GITHUB_TOKEN="ghp_your_github_token_here"Why is TARGET_GITHUB_ORG required?
- This is where the platform will create repositories for each SageMaker project
- There is no sensible default - it must be your organization
- The platform will fail to deploy without this configuration
The platform uses sensible defaults, but you can customize if needed:
# Edit sm-cdk/llmops_sm/config.py
GitConfig(
# Template source (defaults to aws-samples for production)
template_github_org="aws-samples", # Where to get seed code templates
template_github_repo="llmops-finetuning-foundation",
template_code_folder="seed-code",
# Target organization (REQUIRED - set via environment variable)
target_github_org=None, # Set via TARGET_GITHUB_ORG env var
# GitHub token (stored in AWS Secrets Manager)
github_token_secret_name="llmops-sm-github-token",
)Parameter Definitions:
template_github_org: Organization hosting the seed code templates (default:aws-samplesfor reference templates)template_github_repo: Repository containing seed code templatestemplate_code_folder: Folder within template repo containing seed codetarget_github_org: YOUR organization where project repos will be created (REQUIRED)
For Custom Templates: If you've forked this repository to customize templates for your organization:
export TEMPLATE_GITHUB_ORG="your-custom-org"Q: Why split the template between CloudFormation and Step Functions?
A: CloudFormation templates in SageMaker Projects have limitations:
- Cannot directly interact with external APIs (like GitHub)
- Limited to AWS resource creation
- No built-in retry logic for complex workflows
The Step Functions + Lambda approach provides:
- β GitHub API integration for repository management
- β Asynchronous processing with automatic retries
- β Complex orchestration logic (check project status, create repos, sync code)
- β Extensibility for future integrations (GitLab, Bitbucket, etc.)
Q: What's shared vs. project-specific?
Shared (Platform Layer):
- SageMaker Domain, MLflow server, EventBridge rules, Lambda functions
- Deployed once, used by all projects
- Changes require platform redeployment
Project-Specific (Template Layer):
- Model Package Group, project S3 buckets, GitHub repositories
- Created per project instantiation
- Each project is isolated
Q: How do I add a new template (e.g., for computer vision)?
- Create new CloudFormation template in
sm-cdk/templates/ - Add new seed code folder in
seed-code/ - Register template in Service Catalog (in
main_stack.py) - Platform infrastructure remains unchanged
llmops-finetuning-foundation/
βββ π README.md # Platform overview (this file)
βββ π§ Makefile # One-command deployment
β
βββ π± seed-code/ # Template code for new projects
β βββ esg-benchmarking/ # Example: ESG report summarization
β βββ model_build/ # Training pipeline template
β β βββ ml_pipelines/ # SageMaker Pipelines definitions
β β βββ source_scripts/ # Training, preprocessing, evaluation
β βββ model_deploy/ # Deployment pipeline template
β βββ deploy_endpoint/ # CDK stack for SageMaker endpoints
β βββ src/ # Inference code
β
βββ βοΈ sm-cdk/ # Platform infrastructure (CDK)
βββ app.py # CDK application entry point
βββ lambda/ # Lambda function implementations
β βββ check-project-status/ # Monitors project creation
β βββ create-deploy-repository/ # Creates GitHub repos
β βββ sync-repositories/ # Populates repos with seed code
β βββ model-approval-trigger/ # Triggers deployment on approval
βββ llmops_sm/ # Core platform implementation
β βββ config.py # Platform configuration
β βββ constructs/ # Reusable CDK constructs
β βββ stacks/ # Infrastructure stacks
β βββ sagemaker_domain_stack.py # SageMaker Domain
β βββ main_stack.py # EventBridge + Lambda
β βββ observability_stack.py # MLflow + Monitoring
βββ templates/ # Service Catalog templates
βββ esg-project-template.yaml # Example template
Key Directories:
seed-code/- Template code copied to new GitHub repositoriessm-cdk/- Platform infrastructure (deployed once)sm-cdk/templates/- Project templates (instantiated per project)
The platform requires a GitHub Personal Access Token with fine-grained permissions to automate repository creation and management. You can create a token at: https://github.com/settings/tokens
When creating a Fine-grained personal access token, configure the following permissions:
| Permission | Access Level | Purpose |
|---|---|---|
| Contents | Read and Write | Download template repository contents, create/update files in generated repositories |
| Metadata | Read-only | Access repository metadata (automatically included) |
| Administration | Read and Write | Create new repositories in your organization or user account |
| Variables | Read and Write | Create and update GitHub Actions repository variables for pipeline configuration |
| Permission | Access Level | Purpose |
|---|---|---|
| Administration | Read and Write | Create repositories within the organization |
| Members | Read-only | Verify organization membership (optional but recommended) |
- Token Name:
llmops-platform-automation(or your preferred name) - Expiration: Choose based on your security policy (90 days recommended)
- Repository Access:
- All repositories (if you want the platform to access any repo), OR
- Only select repositories (specify your template repository)
- Permissions: Set the permissions listed above
If using a Classic Personal Access Token (not recommended for production), select these scopes:
- β
repo(Full control of private repositories)- Includes:
repo:status,repo_deployment,public_repo,repo:invite,security_events
- Includes:
- β
workflow(Update GitHub Action workflows) - β
admin:orgβwrite:org(Create repositories in organization) - β
admin:repo_hookβwrite:repo_hook(Write repository hooks)
The Lambda functions use your GitHub token to:
-
Repository Operations:
- Create new private repositories for each SageMaker project
- Download template code from the source repository
- Upload seed code files to newly created repositories
- Create and update files via GitHub API
-
Configuration Management:
- Create GitHub Actions repository variables (not secrets)
- Configure pipeline parameters (S3 paths, AWS regions, role ARNs, etc.)
- Set up project-specific environment configuration
-
Content Synchronization:
- Fetch repository tree structures
- Create Git blobs, trees, and commits
- Update branch references
- Sync template folder contents to target repositories
- π Store Securely: The token is stored in AWS Secrets Manager, never in code
- π Rotate Regularly: Set token expiration and rotate before expiry
- π Monitor Usage: Review token usage in GitHub settings regularly
- π― Least Privilege: Use fine-grained tokens with minimum required permissions
- π« Never Commit: Never commit tokens to version control
- π Audit Trail: GitHub logs all API operations performed with the token
Note: Additional security hardening may be required for your organization's specific compliance and security requirements.
| Error | Cause | Solution |
|---|---|---|
401 Unauthorized |
Token invalid or expired | Regenerate token and update in Secrets Manager |
403 Forbidden |
Insufficient permissions | Add missing permissions to token |
404 Not Found |
Token can't access repository | Grant token access to template repository |
422 Unprocessable Entity |
Repository already exists | Delete existing repository or use different name |
make setup # Complete project setup
make bootstrap # Bootstrap CDK environment
make deploy # Deploy all stacks
make deploy-domain # Deploy SageMaker Domain onlyTo completely remove the platform:
# Clean up all resources
make clean-all
# Or step by step
make clean-endpoints # Remove SageMaker endpoints
make clean-projects # Remove SageMaker projects
make destroy-stack # Remove CDK stacks- Seed Code Templates - Template repositories and use cases
- Model Build Pipeline - Training pipeline documentation
- Model Deploy Pipeline - Deployment pipeline documentation
