Announcing Data Clustering in Fabric Data Warehouse (Preview)
Data Clustering in Microsoft Fabric Data Warehouse, is a game-changing feature designed to supercharge your queries. Whether you are a data engineer, analyst, or IT professional, Data Clustering is here to help you unlock faster insights, optimize resource usage, and maximize the value of your data.
What is Data Clustering?
Data Clustering in Microsoft Fabric Data Warehouse is a technique that organizes and stores your data based on similarity. By grouping similar records together at the storage level, Data Clustering makes queries run faster and more efficiently.
The Data Warehouse engine maintains your Data Clustering tables automatically, just like it does with regular tables. As a result, queries that consume tables with Data Clustering will run much more efficiently, requiring less resources, reducing consumption usage, and making your queries run faster.
How Data Clustering Improves Query Efficiency & Speed
Retrieving data from cold storage is one of the most expensive (if not the most expensive) tasks for a database query. Large queries often require scanning massive amounts of data, so if the engine cannot prune irrelevant data early, it ends up reading far more data than necessary.
Imagine your queries scanning only the data that matters, with storage optimized for quick retrieval. Data Clustering makes this possible by keeping related rows close together, allowing the engine to skip irrelevant files and row groups for queries that use WHERE predicates. The result? Faster performance, lower compute costs, and reduced storage access, all thanks to more efficient storage.
The following illustrates an example of a query that uses a regular table (left) compared to a query that uses a table with Data Clustering (right). Note that, in this example, both tables have the exact same data, but the table on the right is ordered for better efficiency.

Data Clustering introduces a new syntax option to the CREATE TABLE statement: CLUSTER BY.
CREATE TABLE Bands ( ID BIGINT, Name VARCHAR(MAX), Genre VARCHAR(75), Country VARCHAR(56), DateFounded DATE ) WITH (CLUSTER BY (ID, DateFounded))
During data ingestion, Data Clustering applies a sophisticated algorithm to arrange rows with similar values in adjacent storage locations. This method offers advantages over typical clustering implementations that use standard lexicographical sorting, which arranges rows sequentially based on dictionary order beginning with Column A, followed by Column B, and continuing in that manner. When multiple columns are used in Data Clustering (you can use up to four columns), our algorithm considers the value of all columns involved to determine how to store a particular row in storage, keeping rows with similar values close together.
When the Warehouse engine receives a query, it uses column metadata to make smart decisions about which files to access for each query. The process happens automatically, with no extra steps required. By skipping files that do not contain data needed for your query, queries consume less compute resources, retrieve less data from storage, and run faster – especially for highly selective queries.
Example
As an example, consider a simple table Order items table that uses Data Clustering on the primary key column. We can observe how Data Clustering optimizes queries by comparing how it performs on a table that uses Data Clustering versus a query that uses an identical table, but does not use Data Clustering. This comparison can be made by investigating the exec_requests_history view of Query Insights. Using a query such as the following, you can check resource usage for the ‘Clustered’ and ‘Regular’ versions of your queries:
SELECT
label,
row_count,
total_elapsed_time_ms,
allocated_cpu_time_ms,
data_scanned_disk_mb + data_scanned_memory_mb + data_scanned_remote_storage_mb AS total_data_scanned_mb
FROM
queryinsights.exec_requests_history
WHERE
label IN ('Clustered','Regular')
In this example, this query produces the following results:

Note: both tables in this example (the Clustered and the Regular version) have the exact same data with almost sixty billion rows and approximately 1TB.
From these results, we can observe the following:
- Elapsed time: The clustered version of the query was twenty-four times faster than the regular version.
- Allocated CPU time: The clustered version used sixty-five times less CPU. This represents a huge cost reduction for this query.
- Total data scanned: The clustered version read only about 3.7% of the data the regular table had to scan.

The query we used in this example uses predicates that align well with the clustering column, and it is also a highly selective query – which is how it was able to yield the results observed on this chart.
How is this possible?
The key here is the total data scanned. Because of how the data is organized in the table that uses Data Clustering, how this particular query aligns its query predicate with the clustering column, and finally, due to the fact that this query is highly selective, the query engine was able to resolve this query by scanning just a fraction of the total data in the table. In this particular case, these factors helped achieve a massive reduction in the time to process the query as well as its cost.
What’s next?
Storage in Fabric Data Warehouse automatically maintains data efficiently in storage to improve performance, reliability, and scalability. We are working to make it even better and more automated every day. Data Clustering is yet another step we are taking to optimizing data layout in disk to align with your workload, and we will continue to invest in innovations to make these optimizations more and more automated, allowing you to focus on your data – not on managing it.
Learn More & Get Started!
Ready to experience the benefits of Data Clustering for yourself? Check out our in-depth documentation and hands-on tutorial to get started today. Dive into practical examples, explore best practices, and see how Data Clustering can help make your queries run faster and more efficiently.
Your feedback is important to us. As this feature continues to evolve in Preview, we invite you to share your experiences, report any issues, and suggest improvements. Help us make Fabric Data Warehouse even better for everyone!