Skip to content

Conversation

@cayetanobv
Copy link
Member

Summary

This PR adds automatic clustering by the BLOCK column when creating new Snowflake tables, bringing Snowflake in line with BigQuery's existing clustering behavior for improved query performance.

Background

Currently, BigQuery tables are created with explicit clustering on the BLOCK column (via clustering_fields=["block"] in LoadJobConfig), which optimizes spatial queries. Snowflake tables were missing this optimization, leading to inconsistent query performance between the two platforms.

Changes

  • Added add_clustering() method to SnowflakeConnection class that applies CLUSTER BY (BLOCK) to tables
  • Clustering is applied once at the end of upload_raster() after all data and metadata are written
  • Only applies to new tables (not when appending to existing tables via append_records)
  • Graceful error handling if clustering already exists or fails

Technical Details

Why at the end of upload_raster()?

The clustering statement is intentionally placed after all data upload operations because:

  1. upload_records() can be called multiple times when using chunk_size (batched uploads)
  2. Only the first batch has overwrite=True, subsequent batches append data
  3. Applying clustering multiple times would be inefficient and could cause errors
  4. Clustering should only be applied once when creating a new table

Snowflake Clustering Background

Unlike BigQuery where clustering is configured at table creation time via the API, Snowflake's write_pandas() function does not support a clustering parameter. The only way to add clustering in Snowflake is through SQL:

  • CREATE TABLE ... CLUSTER BY (column) at creation time
  • ALTER TABLE ... CLUSTER BY (column) after creation

Since we use auto_create_table=True in write_pandas(), we apply clustering via ALTER TABLE after the initial data load.

Testing

  • Verify clustering is applied on new table creation
  • Verify clustering is NOT applied when appending to existing tables
  • Verify query performance improvements on BLOCK column queries
  • Test with both chunked and non-chunked uploads

Related

This aligns Snowflake's behavior with BigQuery's implementation in raster_loader/io/bigquery.py:91.

🤖 Generated with Claude Code

This change adds automatic clustering by the BLOCK column when creating
new Snowflake tables, bringing Snowflake in line with BigQuery's existing
clustering behavior for improved query performance on spatial lookups.

Changes:
- Added add_clustering() method to apply CLUSTER BY (BLOCK) to tables
- Clustering is applied once at the end of upload_raster() after all data
  and metadata are written
- Only applies to new tables (not when appending to existing tables)
- Gracefully handles errors if clustering already exists

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@cayetanobv cayetanobv requested a review from Alvarohf November 27, 2025 16:37
@cayetanobv cayetanobv marked this pull request as ready for review November 27, 2025 17:09
Copy link

@Alvarohf Alvarohf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cayetanobv cayetanobv merged commit 5087dee into main Dec 1, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants