Snowflake-Labs
diff --git a/‎skills/snowpark-connect/LICENSE‎
Lines changed: 22 additions & 0 deletions b/‎skills/snowpark-connect/LICENSE‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎skills/snowpark-connect/SKILL.md‎
Lines changed: 98 additions & 0 deletions b/‎skills/snowpark-connect/SKILL.md‎
Lines changed: 98 additions & 0 deletions
diff --git a/‎skills/snowpark-connect/examples/README.md‎
Lines changed: 143 additions & 0 deletions b/‎skills/snowpark-connect/examples/README.md‎
Lines changed: 143 additions & 0 deletions
diff --git a/‎skills/snowpark-connect/examples/analysis.json‎
Lines changed: 82 additions & 0 deletions b/‎skills/snowpark-connect/examples/analysis.json‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎skills/snowpark-connect/examples/data/applications.parquet‎
8.44 MB b/‎skills/snowpark-connect/examples/data/applications.parquet‎
8.44 MB
diff --git a/‎skills/snowpark-connect/examples/data/companies.parquet‎
8.22 KB b/‎skills/snowpark-connect/examples/data/companies.parquet‎
8.22 KB
diff --git a/‎skills/snowpark-connect/examples/data/jobs.parquet‎
1.11 MB b/‎skills/snowpark-connect/examples/data/jobs.parquet‎
1.11 MB
@@ -0,0 +1,22 @@
+Snowflake Skills License 
+
+© 2026 Snowflake Inc. All rights reserved.
+
+LICENSE: Use of these materials (including all code, prompts, assets, files, and other components of these skills (collectively, “Skills”)) is governed by your agreement with Snowflake for the Service. If no separate agreement exists, use is governed by Snowflake’s Terms of Service (available at: https://www.snowflake.com/en/legal/terms-of-service/). 
+
+Your applicable agreement is referred to as the "Agreement." "Service" is as defined in the Agreement.
+
+ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the contrary, you may not:
+
+* Extract from the Service or retain copies of the Skills outside use with the Service;
+* Reproduce or copy the Skills , except for temporary copies created automatically during authorized use of the Service;
+* Create derivative works based on the Skills; 
+* Distribute, sublicense, or transfer the Skills to any third party;
+* Make, offer to sell, sell, or import any inventions embodied in the Skills; nor, 
+* Reverse engineer, decompile, or disassemble the Skills. 
+
+The receipt, viewing, or possession of the Skills does not convey or imply any license or right beyond those expressly granted above.
+
+Snowflake retains all rights, title, and interest in the Skills, including all copyrights, trademarks, patents, and all other applicable intellectual property rights.
+
+THE SKILLS ARE PROVIDED “AS IS,” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SKILLS OR THE USE OR OTHER DEALINGS IN THE SKILLS.
@@ -0,0 +1,98 @@
+---
+name: snowpark-connect
+title: Snowpark Connect for Spark
+summary: Route PySpark migration, validation, and deployment work to the right Snowpark Connect (SCOS) sub-flow.
+description: |
+  Use when migrating PySpark code to Snowpark Connect (SCOS), setting up a local SCOS testing
+  environment, validating an SCOS migration, tuning SCOS pipeline performance, or deploying a
+  PySpark job to Snowflake compute pools via snowpark-submit. This umbrella skill detects intent
+  and routes to the matching sub-flow.
+  Triggers: snowpark connect, scos, pyspark migration, spark connect, validate migration, pyspark compatibility, snowpark-submit
+tools:
+  - Bash
+  - Read
+  - Write
+  - Edit
+  - Glob
+  - Grep
+prompt: Help me migrate my PySpark job to Snowpark Connect.
+language: en
+status: Published
+author: Snowflake Solutions Team
+type: snowflake
+---
+
+# Snowpark Connect for Spark (SCOS)
+
+## Overview
+
+Snowpark Connect for Spark (SCOS) lets you run PySpark code against Snowflake compute. This skill is an umbrella that routes you to the right sub-flow based on what you need: setting up a local dev loop, migrating existing PySpark, validating a migration, tuning performance, or deploying to production.
+
+The only required code change to switch a PySpark job to SCOS is the session bootstrap:
+
+```python
+# Standard PySpark
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.appName("App").getOrCreate()
+
+# SCOS
+from snowflake import snowpark_connect
+spark = snowpark_connect.init_spark_session()
+```
+
+## Prerequisites
+
+- Snowflake account with an active warehouse
+- A `spark-connect` connection configured in `~/.snowflake/config.toml`
+- Python 3.11 (conda recommended)
+
+## Run modes
+
+| Mode | Compute | Command | Use case |
+|------|---------|---------|----------|
+| SCOS Local | Warehouse | `python script.py` | Development, testing |
+| Snowpark Submit | SPCS Compute Pool | `snowpark-submit` | Production |
+
+## Workflow
+
+Recommended order: Setup → Migrate → Validate → Optimize → Deploy.
+
+### Step 1: Detect intent
+
+Ask the user which sub-flow they need:
+
+1. Set up local SCOS testing environment
+2. Migrate PySpark code to SCOS
+3. Validate a completed SCOS migration
+4. Optimize SCOS pipeline performance
+5. Deploy a Spark job via `snowpark-submit`
+
+⚠️ STOPPING POINT: Wait for the user to pick a sub-flow before loading any sub-skill. If the request is ambiguous, ask one clarifying question first. If the user is new to SCOS, recommend starting with sub-flow 1 (Setup).
+
+### Step 2: Route to sub-flow
+
+| # | Phase | Trigger keywords | Load |
+|---|-------|------------------|------|
+| 1 | Setup | setup, local testing, dev environment, configure | `scos-local-testing/INSTRUCTIONS.md` |
+| 2 | Migrate | migrate, convert, port, rewrite for SCOS | `migrate-pyspark-to-snowpark-connect/INSTRUCTIONS.md` |
+| 3 | Validate | validate, verify, test migration, smoke test | `validate-pyspark-to-snowpark-connect/INSTRUCTIONS.md` |
+| 4 | Optimize | slow, performance, cross join, memory, optimize | `scos-performance/INSTRUCTIONS.md` |
+| 5 | Deploy | snowpark-submit, deploy, production, compute pool | `snowpark-submit/INSTRUCTIONS.md` |
+
+Each sub-flow contains its own multi-step workflow, code diffs, and verification commands.
+
+## Common Mistakes
+
+- Skipping Setup and trying to migrate first — the local dev loop catches issues fast; production runs do not.
+- Editing more than the session bootstrap during migration. Start by changing only `SparkSession.builder...` to `snowpark_connect.init_spark_session()`, run, then fix what actually breaks.
+- Mixing run modes mid-flow. Use SCOS Local for iteration; switch to `snowpark-submit` only when the job is stable.
+- Tuning performance before the job runs end-to-end. Validate correctness first, optimize second.
+- Hardcoding credentials. Configure `spark-connect` in `~/.snowflake/config.toml` and let the SDK resolve auth.
+
+## Stopping Points
+
+- Step 1 — wait for the user to pick a sub-flow before loading any sub-skill content or running commands.
+
+## Output
+
+The user is routed into the matching sub-flow, which then drives the rest of the work.
@@ -0,0 +1,143 @@
+# PySpark to Snowpark Connect (SCOS) Migration with Cortex Code
+
+This document captures the end-to-end workflow for migrating a PySpark workload to Snowflake SCOS using the Cortex Code **Snowpark Connect** skill.
+
+## Quick Reference
+
+### Setup
+```bash
+conda run -n scos python -c "from snowflake import snowpark_connect; print('OK')"  # verify runtime
+snow sql -q "SELECT 1" -c snowpark-connect  # verify Snowflake connection
+```
+
+### Migrate
+```bash
+# In Cortex Code: activate snowpark-connect skill → select "Migrate"
+# Produces pyspark_transform_scos.py + analysis.json from pyspark_transform.py
+```
+
+### Validate
+```bash
+cd pyspark_transform_scos_test && conda run -n scos --no-capture-output python entrypoint.py
+```
+
+### Optimize
+```bash
+# In Cortex Code: activate snowpark-connect skill → select "Optimize"
+# Converts Python UDFs to native SQL expressions, adds case sensitivity guard
+```
+
+### Deploy
+```bash
+snow sql -q "CREATE STAGE IF NOT EXISTS SCOS_APPS_STAGE DIRECTORY=(ENABLE=TRUE)" -c snowpark-connect
+snow sql -q "CREATE STAGE IF NOT EXISTS SCOS_DATA_STAGE DIRECTORY=(ENABLE=TRUE)" -c snowpark-connect
+snow sql -q "PUT file://pyspark_transform_scos.py @SCOS_APPS_STAGE/ AUTO_COMPRESS=FALSE OVERWRITE=TRUE" -c snowpark-connect
+snow sql -q "PUT file://data/jobs.parquet @SCOS_DATA_STAGE/data/ AUTO_COMPRESS=FALSE OVERWRITE=TRUE" -c snowpark-connect
+snow sql -q "PUT file://data/companies.parquet @SCOS_DATA_STAGE/data/ AUTO_COMPRESS=FALSE OVERWRITE=TRUE" -c snowpark-connect
+snow sql -q "PUT file://data/applications.parquet @SCOS_DATA_STAGE/data/ AUTO_COMPRESS=FALSE OVERWRITE=TRUE" -c snowpark-connect
+conda run -n scos --no-capture-output snowpark-submit \
+    --snowflake-stage=@DEMO.SPCONN.SCOS_APPS_STAGE \
+    --snowflake-workload-name=scos_job_analytics \
+    --snowflake-connection-name=snowpark-connect \
+    --compute-pool=SNOWPARK_SUBMIT_POOL_XS \
+    pyspark_transform_scos.py
+```
+
+---
+
+## Project Structure
+
+```
+example/
+├── pyspark_transform.py              # Original PySpark workload
+├── pyspark_transform_scos.py         # Migrated SCOS workload
+├── analysis.json                     # Compatibility analysis results
+├── data/                             # Source parquet files
+├── output/                           # Pipeline output
+├── pyspark_transform_scos_test/      # Validation test directory
+│   ├── entrypoint.py                 # Test entrypoint with synthetic data
+│   ├── pyspark_transform_scos.py     # Copy of migrated workload
+│   ├── data/                         # Synthetic test data
+│   ├── output/                       # Test output
+│   └── output.log                    # Validation run log
+└── README.md
+```
+
+---
+
+## Detailed Steps
+
+### 1. Local Testing Environment Setup
+
+**Prerequisites:** conda env `scos` with `snowpark-connect`, Snowflake connection `snowpark-connect` in `~/.snowflake/config.toml`, Python 3.11.
+
+```bash
+conda run -n scos python -c "from snowflake import snowpark_connect; print('OK')"
+```
+
+The migration analyzer uses a RAG-based Cortex Search Service (`SCOS_MIGRATION.PUBLIC.SCOS_COMPAT_ISSUES_SERVICE`). Initialized automatically on first use.
+
+---
+
+### 2. Migration
+
+Activate the Snowpark Connect skill in Cortex Code and select **Migrate**. The 6-step workflow: analyze → copy → apply fixes → update imports → add header → verify.
+
+**Analysis found 8 issues** in `pyspark_transform.py`:
+
+| Lines | Risk | Issue | Action |
+|-------|------|-------|--------|
+| 46 | **1.0** | `spark.sparkContext.setLogLevel()` - RDD API not supported | Removed |
+| 49-51 | 0.2 | Local parquet file reads | Added stage performance tip |
+| 100-104 | 0.2 | `coalesce(1)` is a no-op in SCOS | Commented as no-op |
+| 57-80 | 0.1-0.15 | Window/filter/groupBy patterns | Reviewed, safe |
+
+**Key change** — session initialization:
+```python
+# BEFORE                                    # AFTER
+from pyspark.sql import SparkSession        from snowflake import snowpark_connect
+spark = SparkSession.builder \              spark = snowpark_connect.init_spark_session()
+    .master("local[*]").getOrCreate()
+```
+
+---
+
+### 3. Validation
+
+Smoke test using synthetic data on the real SCOS runtime. The entrypoint creates 5 jobs, 3 companies, 5 applications as parquet, then calls the real `main()`.
+
+```bash
+cd pyspark_transform_scos_test && conda run -n scos --no-capture-output python entrypoint.py
+```
+
+**Result:** All pipeline stages passed — parquet reads, window functions, joins, aggregations, parquet write.
+
+---
+
+### 4. Optimization
+
+Activate the Snowpark Connect skill and select **Optimize**. Changes applied:
+
+- **Python UDFs → native SQL expressions**: Replaced `@F.udf` functions with `F.when/otherwise` chains (eliminates serde overhead)
+- **Case sensitivity**: Added `spark.conf.set("spark.sql.caseSensitive", "true")` to prevent column uppercasing
+- **Array indexing**: Replaced `parts[Column]` with `F.element_at(parts, -1)` (required for Spark Connect mode)
+
+---
+
+### 5. Deployment
+
+Activate the Snowpark Connect skill and select **Deploy**. Uses `snowpark-submit` to run on SPCS compute pools.
+
+**Key pattern** — dual-mode session for local dev vs. snowpark-submit:
+```python
+def create_session():
+    if os.environ.get("SPARK_REMOTE"):
+        return SparkSession.builder.remote(os.environ["SPARK_REMOTE"]).getOrCreate()
+    else:
+        from snowflake import snowpark_connect
+        return snowpark_connect.init_spark_session()
+```
+
+Without this, `snowpark-submit` fails with `RuntimeError: Snowpark Connect cannot be run inside of a Spark environment` because it already provides a Spark Connect session.
+
+**Deployment result:** 51K jobs + 200K applications processed, output written to `@SCOS_DATA_STAGE/output/job_analytics/` (893 KB).
@@ -0,0 +1,82 @@
+[
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "46-46",
+    "code": "spark.sparkContext.setLogLevel(\"WARN\")",
+    "final_risk": 1.0,
+    "root_cause": "Uses '.sparkContext' which is not supported in SCOS",
+    "explanation": "RDD operations are not supported in SCOS.",
+    "fix": "Convert to DataFrame operations. RDD operations are not supported in SCOS.",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "50-50",
+    "code": "companies = spark.read.parquet(os.path.join(DATA_DIR, \"companies.parquet\"))",
+    "final_risk": 0.2,
+    "root_cause": "We don't support partitioned write in local files. 4th argument i.e numPartitions in range function is a no-op in Snowpark Connect. ",
+    "explanation": "Reading parquet files is supported in SCOS. The preliminary assessment notes a potential performance concern when reading from external paths rather than Snowflake stages, but this is not a compatibility failure.",
+    "fix": "For better performance, consider uploading files to a Snowflake stage first using session.file.put().",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "49-49",
+    "code": "jobs = spark.read.parquet(os.path.join(DATA_DIR, \"jobs.parquet\"))",
+    "final_risk": 0.2,
+    "root_cause": "We don't support partitioned write in local files. 4th argument i.e numPartitions in range function is a no-op in Snowpark Connect. ",
+    "explanation": "Reading parquet files is supported in SCOS. The warning is about potential performance differences when reading from external paths, not a compatibility failure.",
+    "fix": "For better performance, consider uploading files to a Snowflake stage first using session.file.put().",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "51-51",
+    "code": "applications = spark.read.parquet(os.path.join(DATA_DIR, \"applications.parquet\"))",
+    "final_risk": 0.2,
+    "root_cause": "We don't support partitioned write in local files. 4th argument i.e numPartitions in range function is a no-op in Snowpark Connect. ",
+    "explanation": "Reading parquet files is supported in SCOS. The warning relates to potential performance differences when reading from external paths, not a compatibility failure.",
+    "fix": "For better performance, consider uploading files to a Snowflake stage first using session.file.put().",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "100-104",
+    "code": "final.select(\n        \"job_id\", \"company_name\", \"industry\", \"title\", \"state\",\n        \"salary_bucket\", \"salary_midpoint\", \"posted_month\",\n        \"total_applications\", \"unique_applicants\", \"hires\",\n    ).coalesce(1).write.mode(\"overwrite\").parquet(os.path.join(OUTPUT_DIR, \"job_analytics\"))",
+    "final_risk": 0.2,
+    "root_cause": "coalesce() is a no-op in SCOS - the code will run but may produce multiple output files instead of the intended single file",
+    "explanation": "The coalesce(1) call is a no-op in SCOS, meaning the code will execute successfully but may not produce a single output file as intended. This is a behavioral difference rather than a failure.",
+    "fix": "If single-file output is required, consider post-processing to merge files or use Snowflake-native methods for file consolidation. Otherwise, the code will work but with potentially multiple output files.",
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "57-59",
+    "code": "jobs_deduped = jobs.withColumn(\"_rn\", F.row_number().over(w)) \\\n        .filter(F.col(\"_rn\") == 1) \\\n        .drop(\"_rn\")",
+    "final_risk": 0.15,
+    "root_cause": "Cannot filter using original DataFrame columns after transformation operations (drop, select, withColumn, etc.)",
+    "explanation": "The code uses standard window functions and filtering on a newly created column. Unlike the similar test cases which fail when referencing original DataFrame columns after transformations, this code filters on '_rn' which is created in the same transformation chain.",
+    "fix": null,
+    "confidence": "MEDIUM"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "60-62",
+    "code": "jobs_clean = jobs_deduped \\\n        .filter(F.col(\"salary_min\").isNotNull()) \\\n        .filter(F.col(\"salary_max\") > F.col(\"salary_min\"))",
+    "final_risk": 0.1,
+    "root_cause": "Cannot filter using original DataFrame columns after transformation operations (drop, select, withColumn, etc.)",
+    "explanation": "The input code performs standard filtering on existing columns. The similar test cases fail due to filtering on columns after drop() operations, which doesn't apply here since we're filtering directly on the DataFrame's own columns.",
+    "fix": null,
+    "confidence": "HIGH"
+  },
+  {
+    "file": "/Users/pjain/git/coco-work/test_scos_migration/example/pyspark_transform.py",
+    "lines": "76-80",
+    "code": "app_stats = applications.groupBy(\"job_id\").agg(\n        F.count(\"*\").alias(\"total_applications\"),\n        F.countDistinct(\"applicant_id\").alias(\"unique_applicants\"),\n        F.sum(F.when(F.col(\"status\") == \"hired\", 1).otherwise(0)).alias(\"hires\"),\n    )",
+    "final_risk": 0.1,
+    "root_cause": "Ambiguous column reference in select after groupBy and agg on the same column",
+    "explanation": "The input code uses proper aliasing for all aggregations, avoiding the ambiguous column reference issue from the similar test cases. The first/last non-determinism issue doesn't apply since those functions aren't used.",
+    "fix": null,
+    "confidence": "HIGH"
+  }
+]