|
| 1 | +# ML Migration Skill Rules |
| 2 | + |
| 3 | +## ⛔ Required Reads Tracking (MANDATORY) |
| 4 | + |
| 5 | +**This is the MOST IMPORTANT rule.** You MUST track all file reads in `migration-config.yaml`. |
| 6 | + |
| 7 | +### Why This Exists |
| 8 | + |
| 9 | +Agents often skip reading reference files even when instructed to read them. This tracking system ensures: |
| 10 | +1. Every required file is actually read (using the Read tool) |
| 11 | +2. Progress is visible and verifiable |
| 12 | +3. You cannot proceed to the next phase without completing reads |
| 13 | + |
| 14 | +### How to Track Reads |
| 15 | + |
| 16 | +1. **Before each phase**, check `required_reads` in config for files needed |
| 17 | +2. **Add the file** to `required_reads` with `status: pending` |
| 18 | +3. **Actually read the file** using the Read tool |
| 19 | +4. **Update status to `read`** in config ONLY AFTER reading |
| 20 | +5. **Verify all phase reads** show `status: read` before proceeding |
| 21 | + |
| 22 | +### Required Reads Format |
| 23 | + |
| 24 | +```yaml |
| 25 | +required_reads: |
| 26 | + - file: "path/to/file.md" |
| 27 | + phase: "I3" |
| 28 | + status: pending # → read (after using Read tool) |
| 29 | +``` |
| 30 | +
|
| 31 | +### Gate Check Before Each Phase |
| 32 | +
|
| 33 | +**MANDATORY:** Before transitioning to any phase, output this check: |
| 34 | +
|
| 35 | +``` |
| 36 | +⛔ PHASE [X] GATE CHECK: |
| 37 | +Required reads: |
| 38 | +- [x] file1.md (status: read) |
| 39 | +- [ ] file2.md (status: pending) ← BLOCKED |
| 40 | + |
| 41 | +Status: BLOCKED - Must read file2.md first |
| 42 | +``` |
| 43 | +
|
| 44 | +**NEVER proceed with any `status: pending` reads for the current or earlier phases.** |
| 45 | + |
| 46 | +### What Happens If You Skip Reads |
| 47 | + |
| 48 | +- Wrong CLI commands for the detected platform |
| 49 | +- Failed authentication patterns |
| 50 | +- Incorrect model registration code |
| 51 | +- Broken SPCS deployments |
| 52 | +- User frustration and wasted time |
| 53 | + |
| 54 | +### ⚠️ Sub-Skill Files ARE Required Reads |
| 55 | + |
| 56 | +**CRITICAL:** Sub-skill files (`SKILL.md` from other skills) are tracked the same way as reference files. |
| 57 | + |
| 58 | +| Phase | Required Sub-Skill | Why | |
| 59 | +|-------|-------------------|-----| |
| 60 | +| I7 | `../model-registry/SKILL.md` | Contains actual registration workflow | |
| 61 | +| I7 | `../spcs-inference/SKILL.md` | Contains actual SPCS deployment workflow | |
| 62 | +| T3 | `../ml-jobs/SKILL.md` | Contains actual job submission workflow | |
| 63 | + |
| 64 | +**Common mistake:** Reading a reference file (like `xgboost-booster.md`) and skipping the sub-skill file. Reference files provide context, but sub-skill files provide the **workflow you must execute**. |
| 65 | + |
| 66 | +``` |
| 67 | +❌ WRONG: Read xgboost-booster.md → Skip model-registry/SKILL.md → Guess at registration |
| 68 | +✅ RIGHT: Read xgboost-booster.md → Read model-registry/SKILL.md → Follow its workflow |
| 69 | +``` |
| 70 | +
|
| 71 | +--- |
| 72 | +
|
| 73 | +## Universal Rules |
| 74 | +
|
| 75 | +### Resource Selection |
| 76 | +
|
| 77 | +- **NEVER assume** which role, database, schema, warehouse, compute pool, or image repository to use |
| 78 | +- **ALWAYS list available options** and ask the user to select |
| 79 | +- Even if only one option exists, confirm with the user before proceeding |
| 80 | +- Present options with brief descriptions (e.g., instance size, purpose) |
| 81 | +- Make sure you are running all commands with the role the user specified |
| 82 | +
|
| 83 | +### Authentication |
| 84 | +
|
| 85 | +- **Use programmatic authentication** when possible instead of asking users to run commands manually |
| 86 | +- **Snowflake image registry:** |
| 87 | + - **NEVER use username/password** - only token-based authentication |
| 88 | + - Use: `snow spcs image-registry token --format=JSON | $CONTAINER_CMD login <url> -u 0sessiontoken --password-stdin` |
| 89 | + - Get URL with: `snow spcs image-registry url --connection <conn>` |
| 90 | +- For AWS: check `aws configure list-profiles` and ask user to select a profile |
| 91 | +- For Azure: check `az account list` for available subscriptions |
| 92 | +- For GCP: check `gcloud auth list` for authenticated accounts |
| 93 | +- For Databricks: check `databricks auth profiles` for available profiles |
| 94 | +- Only fall back to interactive login if programmatic methods fail |
| 95 | +
|
| 96 | +### Config File |
| 97 | +
|
| 98 | +- **Generate `migration-config.yaml`** after collecting all user decisions |
| 99 | +- **Only include fields relevant** to the detected migration type - do not include all possible fields |
| 100 | +- **Stop and wait** for user to review/edit the config before execution |
| 101 | +- **Read from config** during execution - never re-prompt for values already in config |
| 102 | +- Config file should be in the current working directory, not /tmp/ |
| 103 | +
|
| 104 | +### Communication |
| 105 | +
|
| 106 | +- **Explain assumptions** when you make them - let user know what you detected and decided |
| 107 | +- **Present trade-offs** when multiple approaches exist |
| 108 | +- **Stop at defined checkpoints** - don't proceed through multiple phases without user confirmation |
| 109 | +- When errors occur, explain what went wrong and what alternatives exist |
| 110 | +
|
| 111 | +### Error Recovery - NO FALLBACKS |
| 112 | +
|
| 113 | +**This is a critical rule across all workflows.** |
| 114 | +
|
| 115 | +When a user specifies resources in their config (role, database, schema, compute pool, stage, warehouse), you MUST: |
| 116 | +
|
| 117 | +1. **ONLY use those exact resources** - no substitutions |
| 118 | +2. **NEVER try alternatives** if the specified resource fails |
| 119 | +3. **STOP and report the error** if access is denied |
| 120 | +4. **Ask user to update config** with valid resources |
| 121 | +
|
| 122 | +**WHY:** The config-driven approach exists so users control exactly what resources are used. Trying alternatives: |
| 123 | +- May use resources the user doesn't want to use |
| 124 | +- May incur unexpected costs |
| 125 | +- May write data to wrong locations |
| 126 | +- Violates user trust and expectations |
| 127 | +
|
| 128 | +**WRONG:** |
| 129 | +``` |
| 130 | +User config specifies: compute_pool: MY_POOL |
| 131 | +Error: Permission denied on MY_POOL |
| 132 | +Agent: "Let me try ANOTHER_POOL instead..." ❌ NEVER DO THIS |
| 133 | +``` |
| 134 | +
|
| 135 | +**RIGHT:** |
| 136 | +``` |
| 137 | +User config specifies: compute_pool: MY_POOL |
| 138 | +Error: Permission denied on MY_POOL |
| 139 | +Agent: "Permission denied on MY_POOL. Please either: |
| 140 | +1. Update your config to use a different compute pool |
| 141 | +2. Ask your admin to grant USAGE on MY_POOL to your role |
| 142 | +Run: SHOW COMPUTE POOLS; to see available pools." ✅ CORRECT |
| 143 | +``` |
| 144 | +
|
| 145 | +### Migration Rules File |
| 146 | +
|
| 147 | +- **ALWAYS create `rules/migration-rule.md` FIRST** before any other files (Phase 0) |
| 148 | +- Create the `rules/` directory in the current working directory if it doesn't exist |
| 149 | +- The rules file guides the agent throughout the migration |
| 150 | +
|
| 151 | +--- |
| 152 | +
|
| 153 | +## Inference-Specific Rules |
| 154 | +
|
| 155 | +### Docker Operations |
| 156 | +
|
| 157 | +- Prefer **pulling existing images** over building new ones for lift-and-shift migrations |
| 158 | +- Check for available container runtimes: `docker`, `podman`, `nerdctl` in that order |
| 159 | +- Use `--platform linux/amd64` when pulling images for Snowflake SPCS |
| 160 | +- For model artifacts stored separately (e.g., S3), prefer mounting from Snowflake stage over baking into image |
| 161 | +
|
| 162 | +### SPCS Service Deployment |
| 163 | +
|
| 164 | +- **Ask user for ingress preference FIRST** before checking privileges: |
| 165 | + - Public ingress (HTTP access from outside Snowflake) |
| 166 | + - Internal-only (SQL/Python within Snowflake only) |
| 167 | +- **If user chose Public ingress, CHECK privilege before proceeding:** |
| 168 | + ```sql |
| 169 | + SHOW GRANTS TO ROLE <user_role>; |
| 170 | + -- Look for: BIND SERVICE ENDPOINT on ACCOUNT |
| 171 | + ``` |
| 172 | +- **⛔ BLOCKING RULE: If user chose Public but lacks BIND SERVICE ENDPOINT privilege:** |
| 173 | + - **DO NOT create an internal-only endpoint as a fallback** |
| 174 | + - **STOP and inform user** they must either: |
| 175 | + 1. Get the privilege granted: `GRANT BIND SERVICE ENDPOINT ON ACCOUNT TO ROLE <role>;` |
| 176 | + 2. Switch to a role that has the privilege |
| 177 | + 3. Explicitly choose internal-only access (restart the choice) |
| 178 | + - **Never silently downgrade** from public to internal-only |
| 179 | +- Only set `ingress_enabled=False` if user **explicitly chose** internal-only access |
| 180 | + |
| 181 | +### Framework Support |
| 182 | + |
| 183 | +- **Known built-in supported types** (direct `log_model()`): |
| 184 | + - scikit-learn, XGBoost (sklearn API), LightGBM, CatBoost, Prophet |
| 185 | + - PyTorch, TensorFlow, Keras |
| 186 | + - Sentence Transformers, Hugging Face pipeline, MLFlow PyFunc |
| 187 | +- **Known exceptions** requiring CustomModel: |
| 188 | + - `xgb.core.Booster` (raw Booster lacks sklearn interface) |
| 189 | +- **If model type not in either list above:** |
| 190 | + 1. Check official docs for current support |
| 191 | + 2. If supported → direct `log_model()` |
| 192 | + 3. If not supported → CustomModel required |
| 193 | +- **Do NOT assume** an unknown type requires CustomModel without checking docs first |
| 194 | + |
| 195 | +### SageMaker-Specific |
| 196 | + |
| 197 | +- SageMaker endpoints separate container image from model artifacts |
| 198 | +- Model artifacts are typically in S3, mounted at `/opt/ml/model` at runtime |
| 199 | +- Entry point is specified via `SAGEMAKER_PROGRAM` environment variable |
| 200 | +- AWS Deep Learning Container images require ECR login before pulling |
| 201 | + |
| 202 | +--- |
| 203 | + |
| 204 | +## Training-Specific Rules |
| 205 | + |
| 206 | +### Execution Model (Local vs Container Runtime) |
| 207 | + |
| 208 | +**⚠️ CRITICAL: Understand where code runs.** |
| 209 | + |
| 210 | +There are TWO execution contexts - never confuse them: |
| 211 | + |
| 212 | +| Context | Where it runs | What APIs are available | |
| 213 | +|---------|---------------|------------------------| |
| 214 | +| **Launcher script** | Your local machine | `snowflake.ml.jobs` (submit_file, remote, etc.) | |
| 215 | +| **Training script** | Container Runtime | `snowflake.ml.modeling.tune` (Tuner), `snowflake.ml.data` (DataConnector), etc. | |
| 216 | + |
| 217 | +**Container Runtime APIs** (Tuner, PyTorchDistributor, etc.) are **ONLY available inside Container Runtime**. They do NOT exist in the pip-installed `snowflake-ml-python` package. |
| 218 | + |
| 219 | +```python |
| 220 | +# ❌ WRONG - This will fail locally with ModuleNotFoundError |
| 221 | +from snowflake.ml.modeling.tune import Tuner # NOT available locally! |
| 222 | + |
| 223 | +# ✅ CORRECT - Launcher script (runs locally) |
| 224 | +from snowflake.ml.jobs import submit_file |
| 225 | +job = submit_file("train_hpo.py", "COMPUTE_POOL", stage_name="STAGE") |
| 226 | + |
| 227 | +# ✅ CORRECT - Training script (runs in Container Runtime) |
| 228 | +# train_hpo.py - this file is submitted and runs remotely |
| 229 | +from snowflake.ml.modeling.tune import Tuner, TunerConfig # Available here! |
| 230 | +``` |
| 231 | + |
| 232 | +### Default Approach: submit_file() |
| 233 | + |
| 234 | +**Use `submit_file()` as the default approach** for all training migrations because: |
| 235 | +- More robust across Python versions (avoids serialization issues) |
| 236 | +- Container Runtime uses Python 3.10 - if user's local Python differs, @remote will fail |
| 237 | +- Better for multi-file projects |
| 238 | +- Clearer separation of training code |
| 239 | +- Easier to debug and iterate |
| 240 | + |
| 241 | +```python |
| 242 | +from snowflake.ml.jobs import submit_file |
| 243 | + |
| 244 | +job = submit_file( |
| 245 | + "train.py", |
| 246 | + "<COMPUTE_POOL_FROM_CONFIG>", |
| 247 | + stage_name="<STAGE_FROM_CONFIG>", |
| 248 | + pip_requirements=["scikit-learn", "pandas"] |
| 249 | +) |
| 250 | +``` |
| 251 | + |
| 252 | +### @remote Decorator (Use Only When) |
| 253 | + |
| 254 | +Only use `@remote` when ALL of these conditions are met: |
| 255 | +1. User explicitly requests it |
| 256 | +2. User confirms local Python version is 3.10 (matches Container Runtime) |
| 257 | +3. Single-function training with no external file dependencies |
| 258 | +4. Simple serializable return values |
| 259 | + |
| 260 | +### Model Saving - MANDATORY |
| 261 | + |
| 262 | +**Model persistence is REQUIRED, not optional.** With `submit_file()`, return values are NOT accessible. |
| 263 | + |
| 264 | +Every training script MUST include model registration: |
| 265 | +```python |
| 266 | +from snowflake.ml.registry import Registry |
| 267 | +registry = Registry(session, database_name="<DB>", schema_name="<SCHEMA>") |
| 268 | +mv = registry.log_model(model, model_name="<MODEL_NAME>", version_name="v1") |
| 269 | +``` |
| 270 | + |
| 271 | +- **NEVER skip model persistence** - user will lose their trained model |
| 272 | +- **Use resources from config** - database, schema, stage must come from user's config |
| 273 | + |
| 274 | +### Code Conversion |
| 275 | + |
| 276 | +- **DO NOT modify** the core training logic (model architecture, loss functions, optimizers) |
| 277 | +- **DO modify** data loading, model saving, and environment variable usage |
| 278 | +- **Preserve** hyperparameter handling but convert to function arguments |
| 279 | +- **Keep** the original file as a reference (`original_train.py.bak`) |
| 280 | + |
| 281 | +### Data Loading |
| 282 | + |
| 283 | +- **NEVER assume** data is in a specific location |
| 284 | +- **ASK** which Snowflake table contains the training data |
| 285 | +- **Use DataConnector** for large datasets that don't fit in memory |
| 286 | +- For small datasets, `session.table().to_pandas()` is sufficient |
| 287 | + |
| 288 | +### Dependencies |
| 289 | + |
| 290 | +- **Extract** all dependencies from source (requirements.txt, environment.yaml, setup.py) |
| 291 | +- **Verify** packages are available in Container Runtime before assuming they need installation |
| 292 | +- **List** any packages that need to be added via the `pip_requirements` parameter |
| 293 | + |
| 294 | +### Hyperparameter Optimization (HPO) |
| 295 | + |
| 296 | +**⚠️ REMINDER: Tuner API runs in Container Runtime, NOT locally.** |
| 297 | + |
| 298 | +**MANDATORY: Before writing ANY HPO code:** |
| 299 | + |
| 300 | +1. Use ONLY the native Snowflake Tuner API (in submitted training script): |
| 301 | + ```python |
| 302 | + from snowflake.ml.modeling.tune import Tuner, TunerConfig, uniform, loguniform, randint, choice |
| 303 | + from snowflake.ml.modeling.tune.search import BayesOpt, RandomSearch, GridSearch |
| 304 | + ``` |
| 305 | + |
| 306 | +2. **DO NOT use Optuna, Ray Tune, or Hyperopt patterns** - use native Snowflake APIs only |
| 307 | + |
| 308 | +3. **Understand search algorithm limitations:** |
| 309 | + - `BayesOpt()` only supports `uniform()` and `loguniform()` - NO integer or categorical params |
| 310 | + - `RandomSearch()` supports ALL parameter types including `randint()` and `choice()` |
| 311 | + - `GridSearch()` requires explicit value lists |
| 312 | + |
| 313 | +4. **If migrating from a platform that uses Bayesian optimization with integer parameters:** |
| 314 | + - Either switch to `RandomSearch()` in Snowflake |
| 315 | + - Or use `uniform()` and cast to `int()` inside the training function |
| 316 | + |
| 317 | +### Validation |
| 318 | + |
| 319 | +- **ALWAYS validate** converted code compiles: `python -m py_compile` |
| 320 | +- **Suggest a test run** with limited data before full training |
| 321 | +- **Compare outputs** if possible - metrics should be similar between platforms |
| 322 | +- **Document any differences** in behavior between source and target |
0 commit comments