Microsoft Fabric Updates Blog

Announcing Data Clustering in Fabric Data Warehouse (Preview)

Data Clustering in Microsoft Fabric Data Warehouse, is a game-changing feature designed to supercharge your queries. Whether you are a data engineer, analyst, or IT professional, Data Clustering is here to help you unlock faster insights, optimize resource usage, and maximize the value of your data.

What is Data Clustering?

Data Clustering in Microsoft Fabric Data Warehouse is a technique that organizes and stores your data based on similarity. By grouping similar records together at the storage level, Data Clustering makes queries run faster and more efficiently.

The Data Warehouse engine maintains your Data Clustering tables automatically, just like it does with regular tables. As a result, queries that consume tables with Data Clustering will run much more efficiently, requiring less resources, reducing consumption usage, and making your queries run faster.

How Data Clustering Improves Query Efficiency & Speed

Retrieving data from cold storage is one of the most expensive (if not the most expensive) tasks for a database query. Large queries often require scanning massive amounts of data, so if the engine cannot prune irrelevant data early, it ends up reading far more data than necessary.

Imagine your queries scanning only the data that matters, with storage optimized for quick retrieval. Data Clustering makes this possible by keeping related rows close together, allowing the engine to skip irrelevant files and row groups for queries that use WHERE predicates. The result? Faster performance, lower compute costs, and reduced storage access, all thanks to more efficient storage.

The following illustrates an example of a query that uses a regular table (left) compared to a query that uses a table with Data Clustering (right). Note that, in this example, both tables have the exact same data, but the table on the right is ordered for better efficiency.

Side-by-side comparison of two Microsoft Fabric query windows. The left panel, labeled “Regular table,” shows a SQL query running against a standard table. The right panel, labeled “Data Clustering,” displays a similar query running against a clustered table. Both panels include the Explorer navigation with schemas and queries listed. A timer reading “00:00.00” appears below each panel, indicating the start of query execution for performance comparison. The Regular query ends at 26.64 seconds and the Data Clustering query ends at 6.66 seconds
Comparing execution of a query with a regular table versus a table with Data Clustering

Data Clustering introduces a new syntax option to the CREATE TABLE statement: CLUSTER BY.

CREATE TABLE Bands ( 
     ID BIGINT, 
     Name VARCHAR(MAX), 
     Genre VARCHAR(75), 
     Country VARCHAR(56), 
     DateFounded DATE 
) WITH (CLUSTER BY (ID, DateFounded))

During data ingestion, Data Clustering applies a sophisticated algorithm to arrange rows with similar values in adjacent storage locations. This method offers advantages over typical clustering implementations that use standard lexicographical sorting, which arranges rows sequentially based on dictionary order beginning with Column A, followed by Column B, and continuing in that manner. When multiple columns are used in Data Clustering (you can use up to four columns), our algorithm considers the value of all columns involved to determine how to store a particular row in storage, keeping rows with similar values close together.

When the Warehouse engine receives a query, it uses column metadata to make smart decisions about which files to access for each query. The process happens automatically, with no extra steps required. By skipping files that do not contain data needed for your query, queries consume less compute resources, retrieve less data from storage, and run faster – especially for highly selective queries.

Example

As an example, consider a simple table Order items table that uses Data Clustering on the primary key column. We can observe how Data Clustering optimizes queries by comparing how it performs on a table that uses Data Clustering versus a query that uses an identical table, but does not use Data Clustering. This comparison can be made by investigating the exec_requests_history view of Query Insights. Using a query such as the following, you can check resource usage for the ‘Clustered’ and ‘Regular’ versions of your queries:

SELECT 
     label, 
     row_count,
     total_elapsed_time_ms,
     allocated_cpu_time_ms,
     data_scanned_disk_mb + data_scanned_memory_mb + data_scanned_remote_storage_mb AS total_data_scanned_mb 
FROM 
     queryinsights.exec_requests_history 
WHERE 
     label IN ('Clustered','Regular')

In this example, this query produces the following results:

Screenshot of a query results table with two rows labeled “Clustered” and “Regular.” Both rows have the same row count of 2,526. Three key performance columns are highlighted in red:
total_elapsed_time_ms: Clustered = 1,909 ms; Regular = 45,742 ms. allocated_cpu_time_ms: Clustered = 39,278 ms; Regular = 2,551,199 ms. total_data_scanned_mb: Clustered = 3,840.909 MB; Regular = 106,794.899 MB.
The image illustrates that clustered tables dramatically reduce elapsed time, CPU time, and data scanned compared to regular tables.
Results from exec_requests_history

Note: both tables in this example (the Clustered and the Regular version) have the exact same data with almost sixty billion rows and approximately 1TB.

From these results, we can observe the following:

  • Elapsed time: The clustered version of the query was twenty-four times faster than the regular version.
  • Allocated CPU time: The clustered version used sixty-five times less CPU. This represents a huge cost reduction for this query.
  • Total data scanned: The clustered version read only about 3.7% of the data the regular table had to scan.
Bar chart comparing query performance for a regular table versus a Data Clustering table in Microsoft Fabric. Three metrics are shown: Elapsed Time: Regular table bar is much taller; green text indicates “24x faster” for clustered table. Allocated CPU Time: Regular table bar is significantly taller; green text reads “65x less CPU” for clustered table. Total Data Scanned: Regular table bar reaches 100%; clustered table bar is very small at 3.7%. Legend shows gray for Regular and green for Clustered.
Comparing query execution of a regular table versus a data clustering table

The query we used in this example uses predicates that align well with the clustering column, and it is also a highly selective query – which is how it was able to yield the results observed on this chart.

How is this possible?

The key here is the total data scanned. Because of how the data is organized in the table that uses Data Clustering, how this particular query aligns its query predicate with the clustering column, and finally, due to the fact that this query is highly selective, the query engine was able to resolve this query by scanning just a fraction of the total data in the table. In this particular case, these factors helped achieve a massive reduction in the time to process the query as well as its cost.

What’s next?

Storage in Fabric Data Warehouse automatically maintains data efficiently in storage to improve performance, reliability, and scalability. We are working to make it even better and more automated every day. Data Clustering is yet another step we are taking to optimizing data layout in disk to align with your workload, and we will continue to invest in innovations to make these optimizations more and more automated, allowing you to focus on your data – not on managing it.

Learn More & Get Started!

Ready to experience the benefits of Data Clustering for yourself? Check out our in-depth documentation and hands-on tutorial to get started today. Dive into practical examples, explore best practices, and see how Data Clustering can help make your queries run faster and more efficiently.

Your feedback is important to us. As this feature continues to evolve in Preview, we invite you to share your experiences, report any issues, and suggest improvements. Help us make Fabric Data Warehouse even better for everyone!

Related blog posts

Announcing Data Clustering in Fabric Data Warehouse (Preview)

December 18, 2025 by Jovan Popovic

Unlock Flexible Time-Based Reporting with DATE_BUCKET() in Microsoft Fabric DW! Microsoft Fabric Data Warehouse continues to evolve with powerful features that make analytics easier and more adaptable. One of the latest additions is the DATE_BUCKET() function—a game-changer for time-based reporting.

December 16, 2025 by Alex Powers

As 2025 ends, we’re taking a moment to reflect on Microsoft Fabric’s second year in the market and the collective progress made alongside our community, customers, and partners. What began as a unified vision for data and AI has grown into a platform adopted by more than 28,000 organizations worldwide, anchored by OneLake and shaped … Continue reading “Microsoft Fabric 2025 holiday recap: Unified Data and AI Innovation”