Skip to content

Make use of clustering #6

@bernard-wagner

Description

@bernard-wagner

Was playing around with the BigQuery public-datasets populated using ETL. Was wondering if it has been investigated whether it will be possible to add clustering in addition to partitioning. Would really reduce the processing costs if queries are structured properly.

https://cloud.google.com/bigquery/docs/clustered-tables

Although it possible to add clustering to an existing table, only new data will be clustered and have reduced processing costs:

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

Instead, the data can be copied to a new table that has clustering enabled.

Example

As an example, the blocks table can get migrated using the following SQL script:

CREATE TABLE
   `bigquery-public-data.crypto_ethereum.blocks-cluster`
 PARTITION BY
   TIMESTAMP_TRUNC(`timestamp`, DAY)
 CLUSTER BY
   `number`, `hash` AS
 SELECT
   *
 FROM
   `bigquery-public-data.crypto_ethereum.blocks`

Thereafter the tables can be renamed to swap out the tables and delete the originals.

ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks` RENAME TO `blocks-legacy`

ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks-cluster` RENAME TO `blocks`

Benchmark

As a comparison, I created a copy of the blocks table and queried by number on the original table and the copy with clustering:

Original

Screenshot 2022-02-27 at 15 58 49

Clustering

Screenshot 2022-02-27 at 15 58 25

Comparison

Looking at the results, the overall amount of data processed was reduced from 14.7 GB to 6.9 MB.

Structure

The major design decision that will need to be made is which columns to cluster on and in what order. It is recommended to keep the order of clustered columns the same as the schema as it simplifies writing queries. Clustering only works if filters (WHERE) are applied in order of clustering.

Clustering has a minor storage overhead, but the benefits generally exceed the disadvantages and as many columns as possible should make use of clustering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions