Add BLOCK column clustering for Snowflake tables #181

cayetanobv · 2025-11-27T15:20:23Z

Summary

This PR adds automatic clustering by the BLOCK column when creating new Snowflake tables, bringing Snowflake in line with BigQuery's existing clustering behavior for improved query performance.

Background

Currently, BigQuery tables are created with explicit clustering on the BLOCK column (via clustering_fields=["block"] in LoadJobConfig), which optimizes spatial queries. Snowflake tables were missing this optimization, leading to inconsistent query performance between the two platforms.

Changes

Added add_clustering() method to SnowflakeConnection class that applies CLUSTER BY (BLOCK) to tables
Clustering is applied once at the end of upload_raster() after all data and metadata are written
Only applies to new tables (not when appending to existing tables via append_records)
Graceful error handling if clustering already exists or fails

Technical Details

Why at the end of upload_raster()?

The clustering statement is intentionally placed after all data upload operations because:

upload_records() can be called multiple times when using chunk_size (batched uploads)
Only the first batch has overwrite=True, subsequent batches append data
Applying clustering multiple times would be inefficient and could cause errors
Clustering should only be applied once when creating a new table

Snowflake Clustering Background

Unlike BigQuery where clustering is configured at table creation time via the API, Snowflake's write_pandas() function does not support a clustering parameter. The only way to add clustering in Snowflake is through SQL:

CREATE TABLE ... CLUSTER BY (column) at creation time
ALTER TABLE ... CLUSTER BY (column) after creation

Since we use auto_create_table=True in write_pandas(), we apply clustering via ALTER TABLE after the initial data load.

Testing

Verify clustering is applied on new table creation
Verify clustering is NOT applied when appending to existing tables
Verify query performance improvements on BLOCK column queries
Test with both chunked and non-chunked uploads

This change adds automatic clustering by the BLOCK column when creating new Snowflake tables, bringing Snowflake in line with BigQuery's existing clustering behavior for improved query performance on spatial lookups. Changes: - Added add_clustering() method to apply CLUSTER BY (BLOCK) to tables - Clustering is applied once at the end of upload_raster() after all data and metadata are written - Only applies to new tables (not when appending to existing tables) - Gracefully handles errors if clustering already exists 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Alvarohf

LGTM

cayetanobv requested a review from vdelacruzb November 27, 2025 15:20

vdelacruzb approved these changes Nov 27, 2025

View reviewed changes

cayetanobv requested a review from Alvarohf November 27, 2025 16:37

cayetanobv marked this pull request as ready for review November 27, 2025 17:09

Alvarohf approved these changes Dec 1, 2025

View reviewed changes

cayetanobv merged commit 5087dee into main Dec 1, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add BLOCK column clustering for Snowflake tables #181

Add BLOCK column clustering for Snowflake tables #181

Uh oh!

cayetanobv commented Nov 27, 2025

Uh oh!

Alvarohf left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add BLOCK column clustering for Snowflake tables #181

Add BLOCK column clustering for Snowflake tables #181

Uh oh!

Conversation

cayetanobv commented Nov 27, 2025

Summary

Background

Changes

Technical Details

Why at the end of upload_raster()?

Snowflake Clustering Background

Testing

Related

Uh oh!

Alvarohf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants