Microsoft Fabric Updates Blog

Introducing Optimized Compaction in Fabric Spark

End Write Amplification and Automate Your Table Maintenance

Compaction is one the most necessary but also challenging aspects of managing a Lakehouse architecture. Similar to file systems and even relational databases, unless closely managed, data will get fragmented over time, and can lead to excessive compute costs. The OPTIMIZE command exists to solve for this challenge: small files are grouped into bins targeting a specific ideal file size and then rewritten to blob storage. The result is the same data, but contained in fewer files that are larger.

However, imagine this scenario: you have a nightly OPTIMIZE job which runs to keep your tables, all under 1GB, nicely compacted. Upon inspection of the Delta table transaction log, you find that most of your data is being rewritten after every ELT cycle, leading to expensive OPTIMIZE jobs, even though you are only changing a small portion of the overall data every night. Meanwhile, as business requirements lead to more frequent Delta table updates, in between ELT cycles, it appears that jobs get slower and slower until the next scheduled OPTIMIZE job is run. Sound familiar?

If you’ve felt like OPTIMIZE is too slow, rewrites too much data, or in general should be automatically triggered, you’re not alone. We’re introducing three features that will transform the efficiency, efficacy, and performance impact of compaction jobs: Fast Optimize, File Level Compaction Targets, and Auto Compaction.

The Hidden Costs of Traditional Compaction

Traditional Delta table maintenance carries hidden costs that tend to compound over time:

Write Amplification: Files can get recompacted repeatedly as target file size configs change or as optimize jobs produce suboptimal files that still qualify as being ‘uncompacted’. Tables that have files smaller than 1GB might be recompacted hundreds or even many thousands of times in its lifetime, wasting compute resources and storage I/O.

Manual Intervention Required: Teams spend valuable time scheduling, monitoring, and troubleshooting compaction jobs instead of focusing on business logic.

Performance Degradation: Small files accumulate between maintenance windows, causing query performance to degrade until the next scheduled optimization.

Unpredictable Costs: Without intelligent short-circuiting, users need to self-code logic to evaluate if compaction might be beneficial and without doing so, compaction jobs can run longer than expected, impacting both performance and cost predictability.

Fast Optimize: Skip the Suboptimal Work

Diagram showing that fast optimize adds additional checks to evaluate if a bin of files should be compacted.

Fast Optimize intelligently analyzes your Delta table’s files and short-circuits compaction operations that aren’t estimated to meaningfully improve performance.

Instead of blindly proceeding to compact files anytime more than 1 small file exists, Fast Optimize evaluates whether each candidate bin (group of small files) is estimated to meets your compaction goals or if too many small files exist. If the compaction job isn’t estimated to produce compacted files meeting the defined minimum target file size (i.e. delta.databricks.delta.optimize.minFileSize) and doesn’t have too many small files (delta.microsoft.delta.optimize.fast.minNumFiles), the operation short-circuits or reduces the compaction scope.

While Fast Optimize is disabled by default in Runtime 1.3, Microsoft recommends enabling it at the session level:

spark.conf.set('spark.microsoft.delta.optimize.fast.enabled', True)

With the Fast Optimize session configuration enabled, all existing `OPTIMIZE` code paths are supported, with the following limitations:

  1. Liquid Clustering and Z-Order are not impacted by Fast Optimize.
  2. Auto Compaction has it’s own internal calculation and is not impacted by Fast Optimize to prevent compounding logic that can make it difficult to discern why compaction was or was not run.

In a study mimicking a real-world scenario where `OPTIMIZE` was run at the end of every ELT cycle, Fast Optimize reduced the time spent doing compaction by 80% over 200 ELT cycles without even the slightest regression in performance.

Fast optimize resulted in 5x faster compaction over 200 ELT iterations.

The magic of Fast Optimize is in the long-term avoidance of write amplification.

Example: Illustrates suboptimal bin skipping:

Diagram showing how fast optimize resulted in suboptimal bins being skipped.

File Level Compaction Target: Remember What’s Already Compacted

What it does: This feature tags files with the compaction target used when they were created, preventing already-optimized files from being unnecessarily recompacted if the target file size changes over time.

The problem it solves: Imagine you compact a table with a 128MB target, then later as the table gets bigger, you change your target to 512MB. Without this feature, those perfectly good 128MB files would be recompacted again, despite being well-sized when they were originally compacted.

How it works: Delta automatically stores metadata about the compaction target alongside file statistics (OPTIMIZE_TARGET_SIZE). Future OPTIMIZE operations use this tag to determine if the file is compacted or not. If the files size is at least one half of the OPTIMIZE_TARGET_SIZE value, it is considered compacted.

The result: Dramatic  reduction in write-amplification and more predictable compaction job performance as the target file size changes over time.

While disabled by default in Runtime 1.3, Microsoft recommends enabling file level targets at the session level:

spark.conf.set('spark.microsoft.delta.optimize.fileLevelTarget.enabled', True)

Auto Compaction: Fix Small File Problems Before They Hurt

While this feature isn’t new, we recently revamped the OSS Delta implementation and now recommend Auto Compaction for Spark customers wanting a hands-off approach to table maintenance.

What it does: Auto Compaction monitors your table’s file distribution as part of every write operation and automatically triggers compaction when small file accumulation crosses defined thresholds.

Why it matters: Instead of waiting for scheduled maintenance windows, your tables maintain optimal performance automatically. Small files get compacted before they impact query performance.

Smart triggering: The feature uses table-specific heuristics to determine when a table is bordering on having too many small files, thus eliminating the necessity to manually trigger or scheduled compaction jobs.

How to Enable Auto Compaction

Session level:

spark.conf.set('spark.databricks.delta.autoCompact.enabled', True)

Table level:

CREATE TABLE dbo.ac_enabled_table
TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')

It can also be enabled on existing tables with:

ALTER TABLE dbo.ac_enabled_table
SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')

Performance benefit: Queries maintain consistent performance without the traditional sawtooth pattern of degradation between maintenance windows.

Cost benefit: Auto compaction utilizes the same amount of compute as scheduled optimize operations, but it automatically runs at precisely the right intervals. For most customers who do not schedule optimize at the optimal frequency for each table, auto compaction can lead to substantial cost savings by reducing unnecessary compute usage and minimizing manual intervention.

In a study comparing the performance impact of running 200 small-file-generating merge operations into a table, auto compaction was 5x faster by the last iteration. This was a result of a growing ‘small-file problem’ being mitigated via the automatically triggered synchronous compaction operations.

Chart showing that after 200 iterations, merge executed 5x faster when auto compaction was enabled.

Putting It All Together: Approaching a Maintenance-Free Future

These features work together to create a compaction strategy that’s both intelligent and hands-off:

  1. Auto Compaction prevents small file accumulation without dedicated maintenance jobs.
  2. Fast Optimize makes your ad-hoc or scheduled OPTIMIZE operations complete quickly by avoiding suboptimal compaction work.
  3. File Level Compaction Target prevents redundant work when you adjust optimization strategies over time.

The result? Tables that maintain optimal performance with minimal operational overhead and predictable resource usage.

All features mentioned are compatible with tables created with other Delta writers.

Ready to eliminate write amplification and automate your compaction strategy? These features are available now in Microsoft Fabric Spark. Check out the Compacting Delta tables documentation for detailed configuration options and best practices.

相關部落格文章

Introducing Optimized Compaction in Fabric Spark

11月 3, 2025 作者: Arshad Ali

Additional authors – Madhu Bhowal, Ashit Gosalia, Aniket Adnaik, Kevin Cheung, Sarah Battersby, Michael Park Esri is recognized as the global market leader in geographic information system (GIS) technology, location intelligence, and mapping, primarily through its flagship software, ArcGIS. Esri empowers businesses, governments, and communities to tackle the world’s most pressing challenges through spatial analysis. … Continue reading “ArcGIS GeoAnalytics for Microsoft Fabric Spark (Generally Available)”

10月 29, 2025 作者: Adam Saxton

This month’s update delivers key advancements across Microsoft Fabric, including enhanced security with Outbound Access Protection and Workspace-Level Private Link, smarter data engineering features like Adaptive Target File Size, and new integrations such as Data Agent in Lakehouse. Together, these improvements streamline workflows and strengthen data governance for users. Contents Events & Announcements Fabric Data … Continue reading “Fabric October 2025 Feature Summary”