Snowflake-Labs
diff --git a/‎skills/snowpark-connect/LICENSE‎
Lines changed: 22 additions & 0 deletions b/‎skills/snowpark-connect/LICENSE‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎skills/snowpark-connect/SKILL.md‎
Lines changed: 100 additions & 0 deletions b/‎skills/snowpark-connect/SKILL.md‎
Lines changed: 100 additions & 0 deletions
diff --git a/‎skills/snowpark-connect/examples/README.md‎
Lines changed: 143 additions & 0 deletions b/‎skills/snowpark-connect/examples/README.md‎
Lines changed: 143 additions & 0 deletions
diff --git a/‎skills/snowpark-connect/examples/analysis.json‎
Lines changed: 82 additions & 0 deletions b/‎skills/snowpark-connect/examples/analysis.json‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎skills/snowpark-connect/examples/data/applications.parquet‎
8.44 MB b/‎skills/snowpark-connect/examples/data/applications.parquet‎
8.44 MB
diff --git a/‎skills/snowpark-connect/examples/data/companies.parquet‎
8.22 KB b/‎skills/snowpark-connect/examples/data/companies.parquet‎
8.22 KB
diff --git a/‎skills/snowpark-connect/examples/data/jobs.parquet‎
1.11 MB b/‎skills/snowpark-connect/examples/data/jobs.parquet‎
1.11 MB
@@ -0,0 +1,22 @@
+Snowflake Skills License 
+
+© 2026 Snowflake Inc. All rights reserved.
+
+LICENSE: Use of these materials (including all code, prompts, assets, files, and other components of these skills (collectively, “Skills”)) is governed by your agreement with Snowflake for the Service. If no separate agreement exists, use is governed by Snowflake’s Terms of Service (available at: https://www.snowflake.com/en/legal/terms-of-service/). 
+
+Your applicable agreement is referred to as the "Agreement." "Service" is as defined in the Agreement.
+
+ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the contrary, you may not:
+
+* Extract from the Service or retain copies of the Skills outside use with the Service;
+* Reproduce or copy the Skills , except for temporary copies created automatically during authorized use of the Service;
+* Create derivative works based on the Skills; 
+* Distribute, sublicense, or transfer the Skills to any third party;
+* Make, offer to sell, sell, or import any inventions embodied in the Skills; nor, 
+* Reverse engineer, decompile, or disassemble the Skills. 
+
+The receipt, viewing, or possession of the Skills does not convey or imply any license or right beyond those expressly granted above.
+
+Snowflake retains all rights, title, and interest in the Skills, including all copyrights, trademarks, patents, and all other applicable intellectual property rights.
+
+THE SKILLS ARE PROVIDED “AS IS,” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SKILLS OR THE USE OR OTHER DEALINGS IN THE SKILLS.
@@ -0,0 +1,100 @@
+---
+name: snowpark-connect
+title: Migrate to Snowpark Connect
+summary: Migrate, validate, optimize, and deploy PySpark workloads on Snowflake using Snowpark Connect (SCOS).
+description: |
+  Use when migrating PySpark to Snowpark Connect, validating SCOS migrations, analyzing Spark
+  compatibility, optimizing SCOS pipeline performance, or deploying PySpark jobs to Snowflake
+  compute pools via snowpark-submit.
+  Triggers: snowpark connect, scos, pyspark migration, spark connect, validate migration, pyspark compatibility, snowpark-submit.
+tools:
+  - Bash
+  - Read
+  - Write
+  - Edit
+  - Glob
+  - Grep
+  - snowflake_sql_execute
+prompt: Help me migrate this PySpark job to Snowpark Connect and validate it runs on Snowflake.
+language: en
+status: Published
+author: Snowflake Solutions Team
+type: snowflake
+---
+
+# Migrate to Snowpark Connect (SCOS)
+
+## Overview
+
+Snowpark Connect for Spark (SCOS) lets you run PySpark code on Snowflake compute with minimal changes. Most workloads only need to swap the `SparkSession` builder for `snowpark_connect.init_spark_session()`. This skill routes you through the lifecycle: set up a local dev environment, migrate code, validate behavior against the real SCOS runtime, tune performance, and deploy to production via `snowpark-submit` on SPCS compute pools.
+
+## Prerequisites
+
+- Snowflake account with an active warehouse
+- `spark-connect` connection configured in `~/.snowflake/config.toml`
+- Python 3.11 (conda recommended)
+
+## Quick Reference
+
+| Mode | Compute | Command | Use Case |
+|------|---------|---------|----------|
+| SCOS Local | Warehouse | `python script.py` | Development, testing |
+| Snowpark Submit | SPCS Compute Pool | `snowpark-submit` | Production |
+
+### Key code change
+
+```python
+# Standard PySpark
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.appName("App").getOrCreate()
+
+# SCOS
+from snowflake import snowpark_connect
+spark = snowpark_connect.init_spark_session()
+```
+
+## Workflow
+
+Recommended order: **Setup → Migrate → Validate → Optimize → Deploy**
+
+### Step 1: Detect intent
+
+Ask the user which phase they need:
+
+```
+What would you like to do with Snowpark Connect?
+
+1. Set up a local SCOS testing environment
+2. Migrate PySpark code to SCOS
+3. Validate a completed SCOS migration
+4. Optimize SCOS pipeline performance
+5. Deploy a Spark job to Snowflake via snowpark-submit
+```
+
+Wait for user selection before proceeding.
+
+### Step 2: Route to sub-skill
+
+| # | Phase | Triggers | Load |
+|---|-------|----------|------|
+| 1 | Setup | "setup", "local testing", "configure" | `scos-local-testing/SKILL.md` |
+| 2 | Migrate | "migrate", "convert", "port" | `migrate-pyspark-to-snowpark-connect/SKILL.md` |
+| 3 | Validate | "validate", "verify", "smoke test" | `validate-pyspark-to-snowpark-connect/SKILL.md` |
+| 4 | Optimize | "slow", "performance", "cross join", "memory" | `scos-performance/SKILL.md` |
+| 5 | Deploy | "snowpark-submit", "production", "compute pool" | `snowpark-submit/SKILL.md` |
+
+If intent is ambiguous, clarify before routing. If the user is new to SCOS, recommend starting with Phase 1.
+
+## Common Mistakes
+
+- **Skipping local setup.** Trying to migrate without a working `spark-connect` connection in `~/.snowflake/config.toml` produces opaque auth errors. Verify the connection first.
+- **Mixing PySpark and SCOS sessions.** Don't keep a `SparkSession.builder` call alongside `snowpark_connect.init_spark_session()`. Replace it fully.
+- **Assuming 1:1 API parity.** Some PySpark APIs (RDDs, certain UDFs, Hive-specific features) aren't supported. Run the validation phase against real SCOS before declaring done.
+- **Using a Python version other than 3.11.** SCOS pins to 3.11; mismatched envs cause import failures.
+- **Deploying before validating.** `snowpark-submit` runs on SPCS compute pools — debug locally first to avoid burning compute on broken jobs.
+- **Cross joins on large tables.** SCOS will execute them, but they'll be slow. Use the optimize phase to detect and rewrite.
+- **Hardcoding warehouse names.** Pull warehouse and role from `config.toml` so the same code runs in dev and prod.
+
+## Output
+
+The user is routed to the appropriate sub-skill, which handles the detailed workflow for that phase.
@@ -0,0 +1,143 @@
+# PySpark to Snowpark Connect (SCOS) Migration with Cortex Code
+
+This document captures the end-to-end workflow for migrating a PySpark workload to Snowflake SCOS using the Cortex Code **Snowpark Connect** skill.
+
+## Quick Reference
+
+### Setup
+```bash
+conda run -n scos python -c "from snowflake import snowpark_connect; print('OK')"  # verify runtime
+snow sql -q "SELECT 1" -c snowpark-connect  # verify Snowflake connection
+```
+
+### Migrate
+```bash
+# In Cortex Code: activate snowpark-connect skill → select "Migrate"
+# Produces pyspark_transform_scos.py + analysis.json from pyspark_transform.py
+```
+
+### Validate
+```bash
+cd pyspark_transform_scos_test && conda run -n scos --no-capture-output python entrypoint.py
+```
+
+### Optimize
+```bash
+# In Cortex Code: activate snowpark-connect skill → select "Optimize"
+# Converts Python UDFs to native SQL expressions, adds case sensitivity guard
+```
+
+### Deploy
+```bash
+snow sql -q "CREATE STAGE IF NOT EXISTS SCOS_APPS_STAGE DIRECTORY=(ENABLE=TRUE)" -c snowpark-connect
+snow sql -q "CREATE STAGE IF NOT EXISTS SCOS_DATA_STAGE DIRECTORY=(ENABLE=TRUE)" -c snowpark-connect
+snow sql -q "PUT file://pyspark_transform_scos.py @SCOS_APPS_STAGE/ AUTO_COMPRESS=FALSE OVERWRITE=TRUE" -c snowpark-connect
+snow sql -q "PUT file://data/jobs.parquet @SCOS_DATA_STAGE/data/ AUTO_COMPRESS=FALSE OVERWRITE=TRUE" -c snowpark-connect
+snow sql -q "PUT file://data/companies.parquet @SCOS_DATA_STAGE/data/ AUTO_COMPRESS=FALSE OVERWRITE=TRUE" -c snowpark-connect
+snow sql -q "PUT file://data/applications.parquet @SCOS_DATA_STAGE/data/ AUTO_COMPRESS=FALSE OVERWRITE=TRUE" -c snowpark-connect
+conda run -n scos --no-capture-output snowpark-submit \
+    --snowflake-stage=@DEMO.SPCONN.SCOS_APPS_STAGE \
+    --snowflake-workload-name=scos_job_analytics \
+    --snowflake-connection-name=snowpark-connect \
+    --compute-pool=SNOWPARK_SUBMIT_POOL_XS \
+    pyspark_transform_scos.py
+```
+
+---
+
+## Project Structure
+
+```
+example/
+├── pyspark_transform.py              # Original PySpark workload
+├── pyspark_transform_scos.py         # Migrated SCOS workload
+├── analysis.json                     # Compatibility analysis results
+├── data/                             # Source parquet files
+├── output/                           # Pipeline output
+├── pyspark_transform_scos_test/      # Validation test directory
+│   ├── entrypoint.py                 # Test entrypoint with synthetic data
+│   ├── pyspark_transform_scos.py     # Copy of migrated workload
+│   ├── data/                         # Synthetic test data
+│   ├── output/                       # Test output
+│   └── output.log                    # Validation run log
+└── README.md
+```
+
+---
+
+## Detailed Steps
+
+### 1. Local Testing Environment Setup
+
+**Prerequisites:** conda env `scos` with `snowpark-connect`, Snowflake connection `snowpark-connect` in `~/.snowflake/config.toml`, Python 3.11.
+
+```bash
+conda run -n scos python -c "from snowflake import snowpark_connect; print('OK')"
+```
+
+The migration analyzer uses a RAG-based Cortex Search Service (`SCOS_MIGRATION.PUBLIC.SCOS_COMPAT_ISSUES_SERVICE`). Initialized automatically on first use.
+
+---
+
+### 2. Migration
+
+Activate the Snowpark Connect skill in Cortex Code and select **Migrate**. The 6-step workflow: analyze → copy → apply fixes → update imports → add header → verify.
+
+**Analysis found 8 issues** in `pyspark_transform.py`:
+
+| Lines | Risk | Issue | Action |
+|-------|------|-------|--------|
+| 46 | **1.0** | `spark.sparkContext.setLogLevel()` - RDD API not supported | Removed |
+| 49-51 | 0.2 | Local parquet file reads | Added stage performance tip |
+| 100-104 | 0.2 | `coalesce(1)` is a no-op in SCOS | Commented as no-op |
+| 57-80 | 0.1-0.15 | Window/filter/groupBy patterns | Reviewed, safe |
+
+**Key change** — session initialization:
+```python
+# BEFORE                                    # AFTER
+from pyspark.sql import SparkSession        from snowflake import snowpark_connect
+spark = SparkSession.builder \              spark = snowpark_connect.init_spark_session()
+    .master("local[*]").getOrCreate()
+```
+
+---
+
+### 3. Validation
+
+Smoke test using synthetic data on the real SCOS runtime. The entrypoint creates 5 jobs, 3 companies, 5 applications as parquet, then calls the real `main()`.
+
+```bash
+cd pyspark_transform_scos_test && conda run -n scos --no-capture-output python entrypoint.py
+```
+
+**Result:** All pipeline stages passed — parquet reads, window functions, joins, aggregations, parquet write.
+
+---
+
+### 4. Optimization
+
+Activate the Snowpark Connect skill and select **Optimize**. Changes applied:
+
+- **Python UDFs → native SQL expressions**: Replaced `@F.udf` functions with `F.when/otherwise` chains (eliminates serde overhead)
+- **Case sensitivity**: Added `spark.conf.set("spark.sql.caseSensitive", "true")` to prevent column uppercasing
+- **Array indexing**: Replaced `parts[Column]` with `F.element_at(parts, -1)` (required for Spark Connect mode)
+
+---
+
+### 5. Deployment
+
+Activate the Snowpark Connect skill and select **Deploy**. Uses `snowpark-submit` to run on SPCS compute pools.
+
+**Key pattern** — dual-mode session for local dev vs. snowpark-submit:
+```python
+def create_session():
+    if os.environ.get("SPARK_REMOTE"):
+        return SparkSession.builder.remote(os.environ["SPARK_REMOTE"]).getOrCreate()
+    else:
+        from snowflake import snowpark_connect
+        return snowpark_connect.init_spark_session()
+```
+
+Without this, `snowpark-submit` fails with `RuntimeError: Snowpark Connect cannot be run inside of a Spark environment` because it already provides a Spark Connect session.
+
+**Deployment result:** 51K jobs + 200K applications processed, output written to `@SCOS_DATA_STAGE/output/job_analytics/` (893 KB).
@@ -0,0 +1,82 @@
+[
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "46-46",
+    "code": "spark.sparkContext.setLogLevel(\"WARN\")",
+    "final_risk": 1.0,
+    "root_cause": "Uses '.sparkContext' which is not supported in SCOS",
+    "explanation": "RDD operations are not supported in SCOS.",
+    "fix": "Convert to DataFrame operations. RDD operations are not supported in SCOS.",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "50-50",
+    "code": "companies = spark.read.parquet(os.path.join(DATA_DIR, \"companies.parquet\"))",
+    "final_risk": 0.2,
+    "root_cause": "We don't support partitioned write in local files. 4th argument i.e numPartitions in range function is a no-op in Snowpark Connect. ",
+    "explanation": "Reading parquet files is supported in SCOS. The preliminary assessment notes a potential performance concern when reading from external paths rather than Snowflake stages, but this is not a compatibility failure.",
+    "fix": "For better performance, consider uploading files to a Snowflake stage first using session.file.put().",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "49-49",
+    "code": "jobs = spark.read.parquet(os.path.join(DATA_DIR, \"jobs.parquet\"))",
+    "final_risk": 0.2,
+    "root_cause": "We don't support partitioned write in local files. 4th argument i.e numPartitions in range function is a no-op in Snowpark Connect. ",
+    "explanation": "Reading parquet files is supported in SCOS. The warning is about potential performance differences when reading from external paths, not a compatibility failure.",
+    "fix": "For better performance, consider uploading files to a Snowflake stage first using session.file.put().",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "51-51",
+    "code": "applications = spark.read.parquet(os.path.join(DATA_DIR, \"applications.parquet\"))",
+    "final_risk": 0.2,
+    "root_cause": "We don't support partitioned write in local files. 4th argument i.e numPartitions in range function is a no-op in Snowpark Connect. ",
+    "explanation": "Reading parquet files is supported in SCOS. The warning relates to potential performance differences when reading from external paths, not a compatibility failure.",
+    "fix": "For better performance, consider uploading files to a Snowflake stage first using session.file.put().",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "100-104",
+    "code": "final.select(\n        \"job_id\", \"company_name\", \"industry\", \"title\", \"state\",\n        \"salary_bucket\", \"salary_midpoint\", \"posted_month\",\n        \"total_applications\", \"unique_applicants\", \"hires\",\n    ).coalesce(1).write.mode(\"overwrite\").parquet(os.path.join(OUTPUT_DIR, \"job_analytics\"))",
+    "final_risk": 0.2,
+    "root_cause": "coalesce() is a no-op in SCOS - the code will run but may produce multiple output files instead of the intended single file",
+    "explanation": "The coalesce(1) call is a no-op in SCOS, meaning the code will execute successfully but may not produce a single output file as intended. This is a behavioral difference rather than a failure.",
+    "fix": "If single-file output is required, consider post-processing to merge files or use Snowflake-native methods for file consolidation. Otherwise, the code will work but with potentially multiple output files.",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "57-59",
+    "code": "jobs_deduped = jobs.withColumn(\"_rn\", F.row_number().over(w)) \\\n        .filter(F.col(\"_rn\") == 1) \\\n        .drop(\"_rn\")",
+    "final_risk": 0.15,
+    "root_cause": "Cannot filter using original DataFrame columns after transformation operations (drop, select, withColumn, etc.)",
+    "explanation": "The code uses standard window functions and filtering on a newly created column. Unlike the similar test cases which fail when referencing original DataFrame columns after transformations, this code filters on '_rn' which is created in the same transformation chain.",
+    "fix": null,
+    "confidence": "MEDIUM"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "60-62",
+    "code": "jobs_clean = jobs_deduped \\\n        .filter(F.col(\"salary_min\").isNotNull()) \\\n        .filter(F.col(\"salary_max\") > F.col(\"salary_min\"))",
+    "final_risk": 0.1,
+    "root_cause": "Cannot filter using original DataFrame columns after transformation operations (drop, select, withColumn, etc.)",
+    "explanation": "The input code performs standard filtering on existing columns. The similar test cases fail due to filtering on columns after drop() operations, which doesn't apply here since we're filtering directly on the DataFrame's own columns.",
+    "fix": null,
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "76-80",
+    "code": "app_stats = applications.groupBy(\"job_id\").agg(\n        F.count(\"*\").alias(\"total_applications\"),\n        F.countDistinct(\"applicant_id\").alias(\"unique_applicants\"),\n        F.sum(F.when(F.col(\"status\") == \"hired\", 1).otherwise(0)).alias(\"hires\"),\n    )",
+    "final_risk": 0.1,
+    "root_cause": "Ambiguous column reference in select after groupBy and agg on the same column",
+    "explanation": "The input code uses proper aliasing for all aggregations, avoiding the ambiguous column reference issue from the similar test cases. The first/last non-determinism issue doesn't apply since those functions aren't used.",
+    "fix": null,
+    "confidence": "HIGH"
+  }
+]