In general, Snowflake creates well-clustered data in tables, but when DML happens on very big tables (measured by the amount of data in the table, not the number of rows), the data in some table rows may no longer cluster optimally on desired dimensions over time.
To improve the clustering of the underlying table micro-partitions, you can always manually sort rows on key table columns and re-insert them into the table
Instead, Snowflake supports automating these tasks by designating one or more table columns/expressions as a clustering key for the table. A table with a clustering key defined is considered to be clustered.
What is a Clustering Key?
A subset of columns in a table (or expressions on a table) that are specifically chosen to place the table’s data in the same micro-partitions together is known as a clustering key. This is helpful for very big tables whose natural clustering has deteriorated due to heavy DML or poor ordering (at the time the data was loaded or entered).
Some general indicators that can help determine whether to define a clustering key for a table include:
- Queries on the table are running slower than expected or have noticeably degraded over time.
- The clustering depth for the table is large.
A clustering key can be defined at table creation (using the CREATE TABLE command) or afterward (using the ALTER TABLE command). The clustering key for a table can also be altered or dropped at any time.
Benefits of Defining Clustering Keys (for Very Large Tables)
Using a clustering key to co-locate similar rows in the same micro-partitions enables several benefits for very large tables, including:
- Improved scan efficiency in queries by skipping data that does not match filtering predicates.
- Better column compression than in tables with no clustering. This is especially true when other columns are strongly correlated with the columns that comprise the clustering key.
- After a key has been defined on a table, no additional administration is required, unless you choose to drop or modify the key. All future maintenance on the rows in the table (to ensure optimal clustering) is performed automatically by Snowflake.
Although clustering can significantly increase some queries’ performance and decrease their cost, it uses up credits on the compute resources involved. As a result, you should only cluster when doing so will significantly improve the performance of your queries.
Typically, queries benefit from clustering when the queries filter or sort on the clustering key for the table. Sorting is commonly done for ORDER BY operations, for GROUP BY operations, and for some joins. For example, the following join would likely cause Snowflake to perform a sort operation:
SELECT ... FROM my_table INNER JOIN my_materialized_view ON my_materialized_view.col1 = my_table.col1 ...
In this pseudo-example, Snowflake is likely to sort the values in either my_materialized_view.col1 or my_table.col1. For example, if the values in my_table.col1 are sorted, then as the materialized view is being scanned, Snowflake can quickly find the corresponding row in my_table.
The more frequently a table is queried, the more benefit clustering provides. However, the more frequently a table changes, the more expensive it will be to keep it clustered. Therefore, clustering is generally most cost-effective for tables that are queried frequently and do not change frequently.
Considerations for Choosing Clustering for a Table
Whether you want faster response times or lower overall costs, clustering is best for a table that meets all of the following criteria:
- The table contains a large number of micro-partitions. Typically, this means that the table contains multiple terabytes (TB) of data.
- The queries can take advantage of clustering. Typically, this means that one or both of the following are true:
- The queries are selective. In other words, the queries need to read only a small percentage of rows (and thus usually a small percentage of micro-partitions) in the table.
- The queries sort the data. (For example, the query contains an ORDER BY clause on the table.)
Before choosing to cluster a table, Snowflake strongly recommends that you test a representative set of queries on the table to establish some performance baselines.
Strategies for Selecting Clustering Keys
There may be one or many columns or phrases in a single clustering key. Snowflake advises using no more than 3 or 4 columns (or expressions) per key for the majority of tables. More than 3–4 columns usually results in an increase in costs rather than advantages.
Snowflake recommends prioritizing keys in the order below:
- Cluster columns that are most actively used in selective filters. For many fact tables involved in date-based queries (for example “WHERE invoice_date > x AND invoice date <= y”), choosing the date column is a good idea. For event tables, event type might be a good choice, if there are a large number of different event types. (If your table has only a small number of different event types, then see the comments on cardinality below before choosing an event column as a clustering key.)
- If there is room for additional cluster keys, then consider columns frequently used in join predicates, for example “FROM table1 JOIN table2 ON table2.column_A = table1.column_B”.
If you typically filter queries by two dimensions (e.g. application_id and user_status columns), then clustering on both columns can improve performance.
The number of distinct values (i.e. cardinality) in a column/expression is a critical aspect of selecting it as a clustering key. It is important to choose a clustering key that has:
- A large enough number of distinct values to enable effective pruning on the table.
- A small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.
A very low cardinality column (such as one that merely reports a person’s gender) might only produce minor trimming. In contrast, a column with an extremely high cardinality (such as one storing nanosecond timestamp values) is often also not a viable candidate to be used as a clustering key directly.
A clustered table’s data may lose some of its clustering as DML operations (INSERT, UPDATE, DELETE, MERGE, COPY) are carried out on it. To maintain optimal clustering, the table must be periodically or regularly re-clustered.
When reclustering, Snowflake rearranges the column data in such a way that similar records are moved to the same micro-partition using the clustering key for the clustered table. Using the clustering key as a guide, the impacted records are deleted and then reinserted.
Credit and Storage Impact of Reclustering
Reclustering uses credits in Snowflake, just like all other DML operations do. The size of the table and the amount of data that needs to be reclustered determine how many credits are used.
Reclustering also results in storage costs. The rows are physically grouped according to the table’s clustering key each time data is reclustered, which causes Snowflake to create new micro-partitions for the table. Every number of rows added to a table, no matter how few, will result in the creation of new copies of all micro-partitions containing those values.
This process can create significant data turnover because the original micro-partitions are marked as deleted, but retained in the system to enable Time Travel and Fail-safe. The original micro-partitions are purged only after both the Time Travel retention period and the subsequent Fail-safe period have passed (i.e. minimum of 8 days and up to 97 days for extended Time Travel, if you are using Snowflake Enterprise Edition (or higher)). This typically results in increased storage costs.
Building on the clustering diagram from the previous topic, this diagram illustrates how reclustering a table can help reduce scanning of micro-partitions to improve query performance:
- table t1 is naturally clustered by date across micro-partitions 1-4.
- The query (in the diagram) requires scanning micro-partitions 1, 2, and 3.
- date and type are defined as the clustering key. When the table is reclustered, new micro-partitions (5-8) are created.
- After reclustering, the same query only scans micro-partition 5.
In addition, after reclustering:
- Micro-partition 5 has reached a constant state (i.e. it cannot be improved by reclustering) and is therefore excluded when computing depth and overlap for future maintenance. In a well-clustered large table, most micro-partitions will fall into this category.
- The original micro-partitions (1-4) are marked as deleted, but are not purged from the system; they are retained for Time Travel and Fail-safe.
Defining Clustered Tables
Calculating the Clustering Information for a Table
Use the system function, SYSTEM$CLUSTERING_INFORMATION, to calculate clustering details, including clustering depth, for a given table. This function can be run on any columns on any table, regardless of whether the table has an explicit clustering key:
- If a table has an explicit clustering key, the function doesn’t require any input arguments other than the name of the table.
- If a table doesn’t have an explicit clustering key (or a table has a clustering key, but you want to calculate the ratio on other columns in the table), the function takes the desired column(s) as an additional input argument.
Defining a Clustering Key for a Table
A clustering key can be defined when a table is created by appending a CLUSTER BY clause to CREATE TABLE:
CREATE TABLE <name> ... CLUSTER BY ( <expr1> [ , <expr2> ... ] )
Where each clustering key consists of one or more table columns/expressions, which can be of any data type, except GEOGRAPHY, VARIANT, OBJECT, or ARRAY. A clustering key can contain any of the following:
- Base columns.
- Expressions on base columns.
- Expressions on paths in VARIANT columns.