Skip to content

Commit 0196c00

Browse files
committed
Promote ml-migration with v1.1.1 audit pass
Re-staged with v1.1.1 holistic prompt that adds stopping-point markers, correct INSTRUCTIONS.md sub-flow cross-refs, and drops invalid tool snowflake_object_search. Note: tdd.discipline.has_red_flags residual (pre-existing, not v1.1.1 regression).
1 parent be318e8 commit 0196c00

23 files changed

Lines changed: 3967 additions & 0 deletions

skills/ml-migration/LICENSE

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
Snowflake Skills License
2+
3+
© 2026 Snowflake Inc. All rights reserved.
4+
5+
LICENSE: Use of these materials (including all code, prompts, assets, files, and other components of these skills (collectively, “Skills”)) is governed by your agreement with Snowflake for the Service. If no separate agreement exists, use is governed by Snowflake’s Terms of Service (available at: https://www.snowflake.com/en/legal/terms-of-service/).
6+
7+
Your applicable agreement is referred to as the "Agreement." "Service" is as defined in the Agreement.
8+
9+
ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the contrary, you may not:
10+
11+
* Extract from the Service or retain copies of the Skills outside use with the Service;
12+
* Reproduce or copy the Skills , except for temporary copies created automatically during authorized use of the Service;
13+
* Create derivative works based on the Skills;
14+
* Distribute, sublicense, or transfer the Skills to any third party;
15+
* Make, offer to sell, sell, or import any inventions embodied in the Skills; nor,
16+
* Reverse engineer, decompile, or disassemble the Skills.
17+
18+
The receipt, viewing, or possession of the Skills does not convey or imply any license or right beyond those expressly granted above.
19+
20+
Snowflake retains all rights, title, and interest in the Skills, including all copyrights, trademarks, patents, and all other applicable intellectual property rights.
21+
22+
THE SKILLS ARE PROVIDED “AS IS,” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SKILLS OR THE USE OR OTHER DEALINGS IN THE SKILLS.

skills/ml-migration/RULES.md

Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
# ML Migration Skill Rules
2+
3+
## ⛔ Required Reads Tracking (MANDATORY)
4+
5+
**This is the MOST IMPORTANT rule.** You MUST track all file reads in `migration-config.yaml`.
6+
7+
### Why This Exists
8+
9+
Agents often skip reading reference files even when instructed to read them. This tracking system ensures:
10+
1. Every required file is actually read (using the Read tool)
11+
2. Progress is visible and verifiable
12+
3. You cannot proceed to the next phase without completing reads
13+
14+
### How to Track Reads
15+
16+
1. **Before each phase**, check `required_reads` in config for files needed
17+
2. **Add the file** to `required_reads` with `status: pending`
18+
3. **Actually read the file** using the Read tool
19+
4. **Update status to `read`** in config ONLY AFTER reading
20+
5. **Verify all phase reads** show `status: read` before proceeding
21+
22+
### Required Reads Format
23+
24+
```yaml
25+
required_reads:
26+
- file: "path/to/file.md"
27+
phase: "I3"
28+
status: pending # → read (after using Read tool)
29+
```
30+
31+
### Gate Check Before Each Phase
32+
33+
**MANDATORY:** Before transitioning to any phase, output this check:
34+
35+
```
36+
⛔ PHASE [X] GATE CHECK:
37+
Required reads:
38+
- [x] file1.md (status: read)
39+
- [ ] file2.md (status: pending) ← BLOCKED
40+
41+
Status: BLOCKED - Must read file2.md first
42+
```
43+
44+
**NEVER proceed with any `status: pending` reads for the current or earlier phases.**
45+
46+
### What Happens If You Skip Reads
47+
48+
- Wrong CLI commands for the detected platform
49+
- Failed authentication patterns
50+
- Incorrect model registration code
51+
- Broken SPCS deployments
52+
- User frustration and wasted time
53+
54+
### ⚠️ Sub-Skill Files ARE Required Reads
55+
56+
**CRITICAL:** Sub-skill files (`SKILL.md` from other skills) are tracked the same way as reference files.
57+
58+
| Phase | Required Sub-Skill | Why |
59+
|-------|-------------------|-----|
60+
| I7 | `../model-registry/SKILL.md` | Contains actual registration workflow |
61+
| I7 | `../spcs-inference/SKILL.md` | Contains actual SPCS deployment workflow |
62+
| T3 | `../ml-jobs/SKILL.md` | Contains actual job submission workflow |
63+
64+
**Common mistake:** Reading a reference file (like `xgboost-booster.md`) and skipping the sub-skill file. Reference files provide context, but sub-skill files provide the **workflow you must execute**.
65+
66+
```
67+
❌ WRONG: Read xgboost-booster.md → Skip model-registry/SKILL.md → Guess at registration
68+
✅ RIGHT: Read xgboost-booster.md → Read model-registry/SKILL.md → Follow its workflow
69+
```
70+
71+
---
72+
73+
## Universal Rules
74+
75+
### Resource Selection
76+
77+
- **NEVER assume** which role, database, schema, warehouse, compute pool, or image repository to use
78+
- **ALWAYS list available options** and ask the user to select
79+
- Even if only one option exists, confirm with the user before proceeding
80+
- Present options with brief descriptions (e.g., instance size, purpose)
81+
- Make sure you are running all commands with the role the user specified
82+
83+
### Authentication
84+
85+
- **Use programmatic authentication** when possible instead of asking users to run commands manually
86+
- **Snowflake image registry:**
87+
- **NEVER use username/password** - only token-based authentication
88+
- Use: `snow spcs image-registry token --format=JSON | $CONTAINER_CMD login <url> -u 0sessiontoken --password-stdin`
89+
- Get URL with: `snow spcs image-registry url --connection <conn>`
90+
- For AWS: check `aws configure list-profiles` and ask user to select a profile
91+
- For Azure: check `az account list` for available subscriptions
92+
- For GCP: check `gcloud auth list` for authenticated accounts
93+
- For Databricks: check `databricks auth profiles` for available profiles
94+
- Only fall back to interactive login if programmatic methods fail
95+
96+
### Config File
97+
98+
- **Generate `migration-config.yaml`** after collecting all user decisions
99+
- **Only include fields relevant** to the detected migration type - do not include all possible fields
100+
- **Stop and wait** for user to review/edit the config before execution
101+
- **Read from config** during execution - never re-prompt for values already in config
102+
- Config file should be in the current working directory, not /tmp/
103+
104+
### Communication
105+
106+
- **Explain assumptions** when you make them - let user know what you detected and decided
107+
- **Present trade-offs** when multiple approaches exist
108+
- **Stop at defined checkpoints** - don't proceed through multiple phases without user confirmation
109+
- When errors occur, explain what went wrong and what alternatives exist
110+
111+
### Error Recovery - NO FALLBACKS
112+
113+
**This is a critical rule across all workflows.**
114+
115+
When a user specifies resources in their config (role, database, schema, compute pool, stage, warehouse), you MUST:
116+
117+
1. **ONLY use those exact resources** - no substitutions
118+
2. **NEVER try alternatives** if the specified resource fails
119+
3. **STOP and report the error** if access is denied
120+
4. **Ask user to update config** with valid resources
121+
122+
**WHY:** The config-driven approach exists so users control exactly what resources are used. Trying alternatives:
123+
- May use resources the user doesn't want to use
124+
- May incur unexpected costs
125+
- May write data to wrong locations
126+
- Violates user trust and expectations
127+
128+
**WRONG:**
129+
```
130+
User config specifies: compute_pool: MY_POOL
131+
Error: Permission denied on MY_POOL
132+
Agent: "Let me try ANOTHER_POOL instead..." ❌ NEVER DO THIS
133+
```
134+
135+
**RIGHT:**
136+
```
137+
User config specifies: compute_pool: MY_POOL
138+
Error: Permission denied on MY_POOL
139+
Agent: "Permission denied on MY_POOL. Please either:
140+
1. Update your config to use a different compute pool
141+
2. Ask your admin to grant USAGE on MY_POOL to your role
142+
Run: SHOW COMPUTE POOLS; to see available pools." ✅ CORRECT
143+
```
144+
145+
### Migration Rules File
146+
147+
- **ALWAYS create `rules/migration-rule.md` FIRST** before any other files (Phase 0)
148+
- Create the `rules/` directory in the current working directory if it doesn't exist
149+
- The rules file guides the agent throughout the migration
150+
151+
---
152+
153+
## Inference-Specific Rules
154+
155+
### Docker Operations
156+
157+
- Prefer **pulling existing images** over building new ones for lift-and-shift migrations
158+
- Check for available container runtimes: `docker`, `podman`, `nerdctl` in that order
159+
- Use `--platform linux/amd64` when pulling images for Snowflake SPCS
160+
- For model artifacts stored separately (e.g., S3), prefer mounting from Snowflake stage over baking into image
161+
162+
### SPCS Service Deployment
163+
164+
- **Ask user for ingress preference FIRST** before checking privileges:
165+
- Public ingress (HTTP access from outside Snowflake)
166+
- Internal-only (SQL/Python within Snowflake only)
167+
- **If user chose Public ingress, CHECK privilege before proceeding:**
168+
```sql
169+
SHOW GRANTS TO ROLE <user_role>;
170+
-- Look for: BIND SERVICE ENDPOINT on ACCOUNT
171+
```
172+
- **⛔ BLOCKING RULE: If user chose Public but lacks BIND SERVICE ENDPOINT privilege:**
173+
- **DO NOT create an internal-only endpoint as a fallback**
174+
- **STOP and inform user** they must either:
175+
1. Get the privilege granted: `GRANT BIND SERVICE ENDPOINT ON ACCOUNT TO ROLE <role>;`
176+
2. Switch to a role that has the privilege
177+
3. Explicitly choose internal-only access (restart the choice)
178+
- **Never silently downgrade** from public to internal-only
179+
- Only set `ingress_enabled=False` if user **explicitly chose** internal-only access
180+
181+
### Framework Support
182+
183+
- **Known built-in supported types** (direct `log_model()`):
184+
- scikit-learn, XGBoost (sklearn API), LightGBM, CatBoost, Prophet
185+
- PyTorch, TensorFlow, Keras
186+
- Sentence Transformers, Hugging Face pipeline, MLFlow PyFunc
187+
- **Known exceptions** requiring CustomModel:
188+
- `xgb.core.Booster` (raw Booster lacks sklearn interface)
189+
- **If model type not in either list above:**
190+
1. Check official docs for current support
191+
2. If supported → direct `log_model()`
192+
3. If not supported → CustomModel required
193+
- **Do NOT assume** an unknown type requires CustomModel without checking docs first
194+
195+
### SageMaker-Specific
196+
197+
- SageMaker endpoints separate container image from model artifacts
198+
- Model artifacts are typically in S3, mounted at `/opt/ml/model` at runtime
199+
- Entry point is specified via `SAGEMAKER_PROGRAM` environment variable
200+
- AWS Deep Learning Container images require ECR login before pulling
201+
202+
---
203+
204+
## Training-Specific Rules
205+
206+
### Execution Model (Local vs Container Runtime)
207+
208+
**⚠️ CRITICAL: Understand where code runs.**
209+
210+
There are TWO execution contexts - never confuse them:
211+
212+
| Context | Where it runs | What APIs are available |
213+
|---------|---------------|------------------------|
214+
| **Launcher script** | Your local machine | `snowflake.ml.jobs` (submit_file, remote, etc.) |
215+
| **Training script** | Container Runtime | `snowflake.ml.modeling.tune` (Tuner), `snowflake.ml.data` (DataConnector), etc. |
216+
217+
**Container Runtime APIs** (Tuner, PyTorchDistributor, etc.) are **ONLY available inside Container Runtime**. They do NOT exist in the pip-installed `snowflake-ml-python` package.
218+
219+
```python
220+
# ❌ WRONG - This will fail locally with ModuleNotFoundError
221+
from snowflake.ml.modeling.tune import Tuner # NOT available locally!
222+
223+
# ✅ CORRECT - Launcher script (runs locally)
224+
from snowflake.ml.jobs import submit_file
225+
job = submit_file("train_hpo.py", "COMPUTE_POOL", stage_name="STAGE")
226+
227+
# ✅ CORRECT - Training script (runs in Container Runtime)
228+
# train_hpo.py - this file is submitted and runs remotely
229+
from snowflake.ml.modeling.tune import Tuner, TunerConfig # Available here!
230+
```
231+
232+
### Default Approach: submit_file()
233+
234+
**Use `submit_file()` as the default approach** for all training migrations because:
235+
- More robust across Python versions (avoids serialization issues)
236+
- Container Runtime uses Python 3.10 - if user's local Python differs, @remote will fail
237+
- Better for multi-file projects
238+
- Clearer separation of training code
239+
- Easier to debug and iterate
240+
241+
```python
242+
from snowflake.ml.jobs import submit_file
243+
244+
job = submit_file(
245+
"train.py",
246+
"<COMPUTE_POOL_FROM_CONFIG>",
247+
stage_name="<STAGE_FROM_CONFIG>",
248+
pip_requirements=["scikit-learn", "pandas"]
249+
)
250+
```
251+
252+
### @remote Decorator (Use Only When)
253+
254+
Only use `@remote` when ALL of these conditions are met:
255+
1. User explicitly requests it
256+
2. User confirms local Python version is 3.10 (matches Container Runtime)
257+
3. Single-function training with no external file dependencies
258+
4. Simple serializable return values
259+
260+
### Model Saving - MANDATORY
261+
262+
**Model persistence is REQUIRED, not optional.** With `submit_file()`, return values are NOT accessible.
263+
264+
Every training script MUST include model registration:
265+
```python
266+
from snowflake.ml.registry import Registry
267+
registry = Registry(session, database_name="<DB>", schema_name="<SCHEMA>")
268+
mv = registry.log_model(model, model_name="<MODEL_NAME>", version_name="v1")
269+
```
270+
271+
- **NEVER skip model persistence** - user will lose their trained model
272+
- **Use resources from config** - database, schema, stage must come from user's config
273+
274+
### Code Conversion
275+
276+
- **DO NOT modify** the core training logic (model architecture, loss functions, optimizers)
277+
- **DO modify** data loading, model saving, and environment variable usage
278+
- **Preserve** hyperparameter handling but convert to function arguments
279+
- **Keep** the original file as a reference (`original_train.py.bak`)
280+
281+
### Data Loading
282+
283+
- **NEVER assume** data is in a specific location
284+
- **ASK** which Snowflake table contains the training data
285+
- **Use DataConnector** for large datasets that don't fit in memory
286+
- For small datasets, `session.table().to_pandas()` is sufficient
287+
288+
### Dependencies
289+
290+
- **Extract** all dependencies from source (requirements.txt, environment.yaml, setup.py)
291+
- **Verify** packages are available in Container Runtime before assuming they need installation
292+
- **List** any packages that need to be added via the `pip_requirements` parameter
293+
294+
### Hyperparameter Optimization (HPO)
295+
296+
**⚠️ REMINDER: Tuner API runs in Container Runtime, NOT locally.**
297+
298+
**MANDATORY: Before writing ANY HPO code:**
299+
300+
1. Use ONLY the native Snowflake Tuner API (in submitted training script):
301+
```python
302+
from snowflake.ml.modeling.tune import Tuner, TunerConfig, uniform, loguniform, randint, choice
303+
from snowflake.ml.modeling.tune.search import BayesOpt, RandomSearch, GridSearch
304+
```
305+
306+
2. **DO NOT use Optuna, Ray Tune, or Hyperopt patterns** - use native Snowflake APIs only
307+
308+
3. **Understand search algorithm limitations:**
309+
- `BayesOpt()` only supports `uniform()` and `loguniform()` - NO integer or categorical params
310+
- `RandomSearch()` supports ALL parameter types including `randint()` and `choice()`
311+
- `GridSearch()` requires explicit value lists
312+
313+
4. **If migrating from a platform that uses Bayesian optimization with integer parameters:**
314+
- Either switch to `RandomSearch()` in Snowflake
315+
- Or use `uniform()` and cast to `int()` inside the training function
316+
317+
### Validation
318+
319+
- **ALWAYS validate** converted code compiles: `python -m py_compile`
320+
- **Suggest a test run** with limited data before full training
321+
- **Compare outputs** if possible - metrics should be similar between platforms
322+
- **Document any differences** in behavior between source and target

0 commit comments

Comments
 (0)