Microsoft Fabric Updates Blog

Optimizing Spark Compute for Medallion Architectures in Microsoft Fabric

Guidance to Maximizing Productivity and Efficiency for your Data Engineering Workloads

Data engineering teams often grapple with the complexities of planning and configuring compute resources for their data platforms. This is especially true when working with large-scale, complex datasets and demanding downstream SLAs. A one-size-fits-all approach is rarely effective, as different data layers and datasets have unique requirements that necessitate tailored compute configurations.

To address these challenges, organizations often adopt a layered data architecture pattern. This pattern emphasizes modularity and scalability, allowing for better data management and governance. One popular approach is the Medallion architecture.

What is a Medallion Architecture?

Medallion architecture is a popular data engineering approach that emphasizes modularity. It organizes the data platform into three distinct layers: Bronze, Silver, and Gold. This layered structure aims to improve data quality and accessibility as data moves from left to right across the layers.

Enterprise data platforms often employ similar layered structures to ingest data from various on-premises and cloud sources, build entities, and ultimately serve these data assets through reports or applications for broader organizational insights. This common approach aligns well with the modular philosophy of the Medallion pattern.

Tailoring Compute for Medallion Layers

Its crucial to make sure that the compute configurations are custom fit for these different layers for achieving better performance and efficiency in your data engineering workloads.

Bronze Layer: The Bronze layer serves as the raw data landing zone, where unstructured data is ingested “as-is” from various sources (on-premises or cloud). This layer typically handles full or delta loads and acts as an archive of the original data. To minimize delays in subsequent processing stages, it’s essential to maintain a write-optimized configuration for the Bronze layer so the data ingestion jobs complete within your sla requirements.

Spark Configurations for Bronze Layer:

spark.sql("ALTER TABLE $tableName SET TBLPROPERTIES ('delta.checkpointInterval' = '25')"

delta.checkpointInterval: Checkpointing helps Delta Lake maintain data consistency during writes. Setting a higher interval reduces the frequency of checkpoints, allowing for faster data ingestion.

spark.databricks.delta.collect.stats: This collects statistics on Delta tables for faster future queries. However, for the bronze layer with frequent writes with potentially changing data and scenarios where there are no join, filter or aggregate queries on these tables, disabling stats would make the jobs more optimized.

spark.conf.set("spark.sql.parquet.vorder.enabled","false")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","false")
spark.conf.set("spark.databricks.delta.collect.stats","false")

spark.sql.parquet.vorder.enabled and spark.databricks.delta.optimizeWrite.enabled configurations are primarily beneficial for sorting and data analysis (more relevant to Gold layer which is which is read-heavy). Disabling them in the Bronze layer avoids unnecessary overhead during writes (bypassing steps to reorganize data), which translates to faster data landing.

Silver Layer: In the Silver layer, data is cleaned, enriched, and subjected to data quality checks. While not directly consumable for business reporting, this layer is vital for applying business requirements and standardizing data into a common format.

The Silver layer serves as a valuable resource for data engineers, data scientists, and analysts who create data entities and business aggregates for the Gold layer. However, in the silver layer, where frequent data writes occur alongside occasional sorting, vorder can introduce some overhead due to additional steps involved to reorganize data during writes.

Disabling vorder prioritizes write performance for the silver layer. This is crucial because data transformation and cleaning in the silver layer often involve numerous write operations.

Spark Configurations for Silver Layer:

spark.conf.set("spark.sql.parquet.vorder.enabled","false")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","false") 
                                  or 
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize","157286400")

Disabling delta.optimizeWrite helps eliminate pre-shuffle step, bypassing the effort to reorganize data during write leading to quicker write operations. Alternatively, you can fine-tune the optimization for Silver layers that are used for querying and data exploration, by enabling the optimizeWrite property and setting the binSize value between 150-250 MB can enhance parallelism for Spark jobs that create aggregates in the Gold layer.

Gold Layer: The Gold layer houses aggregated and curated data entities designed for business consumption. These entities are primarily accessed by business users and analysts to derive key insights. To ensure a low-latency experience for concurrent queries from multiple users, the Gold layer should be read-optimized.

Spark Configurations for Gold Layer:

spark.conf.set("spark.sql.parquet.vorder.enabled","true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","true")
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize","1073741824")

Gold layer tables are configured with vorder and optimizeWrite enabled, and a binSize of 1GB to minimize the number of files and optimize for read performance. Since the Gold layer is heavily queried, vorder can dramatically improve query execution times, especially for complex queries involving sorting and with optimizeWrite enabled Delta Lake automatically reorganizes data during writes, making it more efficient for queries to scan and filter data in the Gold layer to provide faster reads for Power BI reports accessed by business users.

You can configure all the above-mentioned properties using the environment item and attach them to your notebook or spark job definitions.

You can also use magic commands (%%configure) for notebook jobs in your Fabric workspace.

We hope these tips help you tailor your Spark compute configurations to meet the needs of your Medallion architecture and achieve optimal performance.

To learn more about how to create, configure and use an environment, please refer to our documentation Create, configure, and use an environment in Fabric – Microsoft Fabric | Microsoft Learn

To learn more about how to configure your sessions using magic commands, please refer to our documentation Develop, execute, and manage notebooks – Microsoft Fabric | Microsoft Learn

Related blog posts

Optimizing Spark Compute for Medallion Architectures in Microsoft Fabric

October 9, 2024 by Misha Desai

At Fabric, we’re passionate about contributing to the open-source community, particularly in areas that advance the usability and scalability of machine learning tools. One of our recent endeavors has been making substantial contributions back to the FLAML (Fast and Lightweight AutoML) project, a robust library designed to automate the tedious and complex process of machine … Continue reading “Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML”

September 25, 2024 by Santhosh Kumar Ravindran

We’re excited to introduce high concurrency mode for notebooks in pipelines, bringing session sharing to one of the most popular orchestration mechanisms for enterprise data ingestion and transformation. Notebooks will now automatically be packed into an active high concurrency session without compromising performance or security, while paying for a single session. Key Benefits: Why Use … Continue reading “Introducing High Concurrency Mode for Notebooks in Pipelines for Fabric Spark”