Large tables are statically partitioned in traditional data warehouses in order to scale more effectively and provide acceptable performance. In these systems, a partition is a management unit that can be independently modified using specialised DDL and syntax. Static partitioning, however, has a number of well-known drawbacks, including maintenance costs and data skew, which can produce partitions of disproportionate size.
In contrast to a data warehouse, the Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-partitioning, that delivers all the advantages of static partitioning without the known limitations, as well as providing additional significant benefits.
What are Micro-partitions?
Snowflake tables automatically partition all of the data into contiguous storage units called micro-partitions. A micropartition’s uncompressed data size ranges from 50 MB to 500 MB for each one (note that the actual size in Snowflake is smaller because data is always stored compressed). Tables map rows into distinct micro-partitions that are arranged in a columnar form. This size and structure enables for extremely granular pruning of very large tables, which can be made of millions, or even hundreds of millions, of micro-partitions.
Snowflake stores metadata about all rows stored in a micro-partition, including:
- The range of values for each of the columns in the micro-partition.
- The number of distinct values.
- Additional properties used for both optimization and efficient query processing.
Benefits of Micro-partitioning
The benefits of Snowflake’s approach to partitioning table data include:
- Snowflake micro-partitions are derived automatically as opposed to standard static partitioning; users are not required to create or manage them.
- Micro-partitions are small in size (50 to 500 MB, before compression), which enables extremely efficient DML and fine-grained pruning for faster queries.
- Micro-partitions are uniformly small in size and can overlap in their range of values, which helps prevent skew.
- Columns are stored independently within micro-partitions, often referred to as columnar storage. This makes it possible to scan individual columns efficiently because only those that a query references are really scanned.
- Micro-partitions also individually compress each column. Each micro-columns partition is automatically given the most effective compression strategy by Snowflake.
Impact of Micro-partitions
All DML operations (e.g. DELETE, UPDATE, MERGE) take advantage of the underlying micro-partition metadata to facilitate and simplify table maintenance. For example, some operations, such as deleting all rows from a table, are metadata-only operations.
The micro-partition metadata maintained by Snowflake enables precise pruning of columns in micro-partitions at query run-time, including columns containing semi-structured data. In other words, a query that specifies a filter predicate on a range of values that accesses 10% of the values in the range should ideally only scan 10% of the micro-partitions.
For example, assume a large table contains one year of historical data with date and hour columns. Assuming uniform distribution of the data, a query targeting a particular hour would ideally scan 1/8760th of the micro-partitions in the table and then only scan the portion of the micro-partitions that contain the data for the hour column; Snowflake uses columnar scanning of partitions so that an entire partition is not scanned if a query only filters by one column.
In other words, the closer the ratio of scanned micro-partitions and columnar data is to the ratio of actual data selected, the more efficient is the pruning performed on the table.
For time-series data, this level of pruning enables potentially sub-second response times for queries within ranges (i.e. “slices”) as fine-grained as one hour or even less.
Not all predicate expressions can be used to prune. For example, Snowflake does not prune micro-partitions based on a predicate with a subquery, even if the subquery results in a constant.
What is Data Clustering?
Table data is typically sorted or organised along natural dimensions (such as dates and/or regions). This “clustering” is important because queries may perform worse when table data is incompletely or incorrectly sorted, especially for very big tables.
With Snowflake, clustering metadata is gathered and recorded for each micro-partition produced throughout the process of inserting or loading data into a table. Snowflake then exploits this clustering information to eliminate wasteful scanning of micro-partitions while querying, dramatically increasing the performance of queries that reference these columns.
The table consists of 24 rows stored across 4 micro-partitions, with the rows divided equally between each micro-partition. Within each micro-partition, the data is sorted and stored by column, which enables Snowflake to perform the following actions for queries on the table:
- First, prune micro-partitions that are not needed for the query.
- Then, prune by column within the remaining micro-partitions.
The clustering depth for a populated table measures the average depth (1 or greater) of the overlapping micro-partitions for specified columns in a table. The lesser the average depth, the better clustered the table is with regards to the selected columns.
Clustering depth can be used for a variety of purposes, including:
- Monitoring the clustering “health” of a large table, particularly over time as DML is performed on the table.
- Determining whether a large table would benefit from explicitly defining a clustering key.
A table with no micro-partitions (i.e. an unpopulated/empty table) has a clustering depth of 0.
Clustering Depth Illustrated
The following diagram provides a conceptual example of a table consisting of five micro-partitions with values ranging from A to Z, and illustrates how overlap affects clustering depth:
This diagram illustrates:
- At the beginning, the range of values in all the micro-partitions overlap.
- As the number of overlapping micro-partitions decreases, the overlap depth decreases.
- When there is no overlap in the range of values across all micro-partitions, the micro-partitions are considered to be in a constant state (i.e. they cannot be improved by clustering).
Monitoring Clustering Information for Tables
To view/monitor the clustering metadata for a table, Snowflake provides the following system functions:
- SYSTEM$CLUSTERING_INFORMATION (including clustering depth)