Boost performance effortlessly with Automated Table Statistics in Microsoft Fabric
We’re thrilled to introduce Automated Table Statistics in Microsoft Fabric Data Engineering — a major upgrade that helps you get blazing-fast query performance with zero manual effort.
Whether you’re running complex joins, large aggregations, or heavy filtering workloads, Fabric’s new automated statistics will help Spark make smarter decisions, saving you time, compute, and money.
What are Automated Table Statistics?
In simple terms, table statistics are summary metrics about your data that Spark uses to optimize queries which include:
- Total row counts.
- Null counts per column.
- Minimum and maximum values per column.
- Distinct value counts per column.
- Average and maximum column lengths.
Until now, generating these statistics required running manual commands (ANALYZE TABLE
) or setting up custom pipelines. With this release, Fabric collects them automatically when you create a new Delta table — no setup or tuning required.
Why does it matter?
Spark’s Cost-Based Optimizer (CBO) relies on good statistics to:
- Choose the best join strategy (e.g., broadcast vs. shuffle joins).
- Prune partitions effectively.
- Reduce unnecessary data shuffles.
- Optimize aggregations.
Without accurate stats, Spark can only guess, often leading to slow or expensive query plans. With automated stats, Spark becomes much smarter — and in our internal benchmarks, we’ve seen up to ~45% performance gains on complex workloads.
How does it work?
- Enabled by default on all new Delta tables.
- Tracks first 32 columns (including nested columns).
- Stores stats separately in Parquet format to avoid bloating data files.
- Fully integrated with Fabric Spark 3.5 and later.
You can also fine-tune behavior with configurations:
Enable/disable stats collection:
spark.conf.set("spark.microsoft.delta.stats.collect.extended", "true")
Enable/disable optimizer injection:
spark.conf.set("spark.microsoft.delta.stats.injection.enabled", "true")
How to Check or Recompute Statistics
Want to see what’s under the hood? Fabric lets you inspect collected stats and recompute them if needed:
Check current stats (Scala):
println(spark.read.table("tableName").queryExecution.optimizedPlan.stats)
Recompute after schema changes:
StatisticsStore.recomputeStatisticsWithCompaction(spark, "tableName")
For full control, you can also use the familiar ANALYZE TABLE
command.
Important limitations to know
As with any advanced feature, it’s important to understand the current boundaries:
- Stats are collected only at write time.
- Updates from other engines won’t be included.
- Only the first 32 columns are tracked.
- Recomputing stats may require rewriting the table.
- No fallback → in rare cases, stats may sometimes impact performance.
We’re actively working on expanding support — including improvements for existing tables and better recompute workflows.
Learn More
To learn more about automated statistics for tables please visit the Configure and manage Automated Table Statistics in Fabric Spark documentation.