Make use of clustering

Was playing around with the BigQuery public-datasets populated using ETL. Was wondering if it has been investigated whether it will be possible to add clustering in addition to partitioning. Would really reduce the processing costs if queries are structured properly. 

https://cloud.google.com/bigquery/docs/clustered-tables

Although it possible to add clustering to an existing table, only new data will be clustered and have reduced processing costs:

https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec

Instead, the data can be copied to a new table that has clustering enabled. 

## Example

As an example, the `blocks` table can get migrated using the following SQL script:

```
CREATE TABLE
   `bigquery-public-data.crypto_ethereum.blocks-cluster`
 PARTITION BY
   TIMESTAMP_TRUNC(`timestamp`, DAY)
 CLUSTER BY
   `number`, `hash` AS
 SELECT
   *
 FROM
   `bigquery-public-data.crypto_ethereum.blocks`
```

Thereafter the tables can be renamed to swap out the tables and delete the originals.

```
ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks` RENAME TO `blocks-legacy`

ALTER TABLE `bigquery-public-data:crypto_ethereum.blocks-cluster` RENAME TO `blocks`
```

## Benchmark

As a comparison, I created a copy of the `blocks` table and queried by `number` on the original table and the copy with clustering:

### Original

<img width="756" alt="Screenshot 2022-02-27 at 15 58 49" src="https://user-images.githubusercontent.com/4208881/155885421-b658fb7a-1a9f-4a69-a36f-31c2ffe04eaa.png">

### Clustering

<img width="753" alt="Screenshot 2022-02-27 at 15 58 25" src="https://user-images.githubusercontent.com/4208881/155885434-f8498bf2-2fec-483c-833b-d0bfc5b49b8f.png">

### Comparison

Looking at the results, the overall amount of data processed was reduced from 14.7 GB to 6.9 MB. 

## Structure

The major design decision that will need to be made is which columns to cluster on and in what order. It is recommended to keep the order of clustered columns the same as the schema as it simplifies writing queries. Clustering only works if filters (`WHERE`) are applied in order of clustering. 

Clustering has a minor storage overhead, but the benefits generally exceed the disadvantages and as many columns as possible should make use of clustering.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make use of clustering #6

Example

Benchmark

Original

Clustering

Comparison

Structure

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Make use of clustering #6

Description

Example

Benchmark

Original

Clustering

Comparison

Structure

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions