Microsoft Fabric Updates Blog

Data Pipeline Performance Improvement Part 3: Gaining more than 50% improvement for Historical Loads

Introduction / Recap

Welcome to the final entry of our 3-part series on improving performance for historical data loads! In the first two entries we dove deep into the technical weeds to demonstrate the capabilities of Data Pipeline Expression Language.

Part 1: Data Pipeline Performance Improvements Part 1: How to convert a time interval (dd.hh:mm:ss) into seconds | Microsoft Fabric Blog | Microsoft Fabric

Part 2: Data Pipeline Performance Improvements Part 2: Creating an Array of JSONs | Microsoft Fabric Blog | Microsoft Fabric

Overview

The central principle of this series is to stress:

  1. Fabric Data pipelines are much more than a tool to copy data from point A to point B.
  2. For optimal performance, you should first understand the limits and capabilities of your source and destination then design around them (e.g. concurrent connections).
  3. Consider how you can convert a pipeline with a single Copy Activity into a pipeline that can logically partition your source data and run multiple Copy Activities in parallel for each partition.

Details

Let’s take a look back at a previous blog post on ingesting files into the Fabric Lakehouse from a REST API with pagination: Data Pipelines Tutorial: Ingest files into a Lakehouse from a REST API with pagination ft. AVEVA Data Hub | Microsoft Fabric Blog | Microsoft Fabric.

In this blog, I discussed how to account for pagination given the current state of Fabric Data Pipelines in Public Preview.

Complete architecture of the design pattern for retrieving files from a REST API and landing them into Microsoft Fabric]

This pipeline is performant when the number of paginated pages isn’t too large. However, if we want to process 10 years of data at a 1 second interval with each page only having a few thousand records, this pipeline will take a considerable amount of time. This is because our Until Loop activity is sequential. The first trick we can leverage to improve performance is by disabling the Wait on Completion setting within the Invoke Pipeline activity. This allows us to call the child pipeline and instantly move forward. Even so, this Data pipeline isn’t suited for large historical loads.

By creating another Data pipeline to take the input parameters of the desired date range for the historical load, we can leverage what we learned in Part 1 and 2 to logically partition this date range into smaller buckets. Then for each bucket, we call the main pipeline.

The magic lives in the For Each and the Invoke Pipeline activities.

The For Each activity, as mentioned in Part 2, has a few settings to leverage.

  • Sequential (Boolean)
  • Batch count (Integer)

To leverage parallel executions, we will want to disable Sequential. Disabling this will expose the Batch count setting. By default, without specifying a value, the For Each activity will run 20 iterations in parallel. The maximum value you can put here is 50, allowing for 50 iterations to run in parallel. All iterations within a single batch must complete before the For Each activity will start the next batch. Overall performance can worsen if, within a batch, there is a longer running iteration compared to the other iterations. This leads to most of the iterations completing with some remaining in a Running state, reducing the total number of concurrent iterations at any given time.

The recommendation to combat this behavior is to sort your iterations (input array) to group processes that should take about the same time to complete. For example, if we have 100 SQL tables, and a Batch count set to 10, we will want to sort the input list of SQL Table names to where similar sized tables land in the same batch, allowing for each iteration within the batch to complete around the same time, reducing the idle time.

The second magic setting is within the Invoke Pipeline activity. By default, the setting Wait on completion is enabled. This means the parent pipeline will not move forward until the child pipelines report that they completed. With this enabled, the pipeline will run the Batch count of iterations, each running that number of Invoke Pipeline activities in parallel but having to wait for all of them to complete before running the next batch. By disabling Wait on completion, the Invoke Pipeline activity will fire and instantly move forward, allowing for each batch to complete almost instantly, allowing for much higher degree of parallelism.

Example

If we have 200 items in our input array, with a For Each activity, Batch Count = 50, and an inner Invoke Pipeline activity with Wait on Completion = False, our pipeline will run the first 50 in parallel then almost instantaneously move onto the next 50 and so on, essentially running all 200 items at the same time.

By implementing this design using a Parent Pipeline to determine logical partitions and running multiple child pipelines in parallel, I was able to reduce the total duration of historical loads by more than half.

However, this is where Principal 2: For optimal performance you should first understand the limits and capabilities of your source and destination then design around them comes into play. With this design, you can easily overwhelm both your source and destination systems. For example, I implemented this design for Azure SQL Server and quickly received the error: The request limit for the database is 900 and has been reached.

There are a few methods you can use to throttle this design:

  1. Increase the range of your logical partitions, reducing the total number of partitions, which also reduces the total number of parallel processes (connections made to source/sink).
  2. Enable wait on completion for the Invoke Pipeline. In combination with the For Each Batch count, this allows you to control how many parallel processes can run.

Challenges

While this design does improve the performance of historical loads by more than 50%, we have a few tradeoffs that I want to be transparent about:

  • Additional development overhead – the design is much more difficult to develop, and it just depends on your scenario if 50% data loading improvement is greater than the amount of time to implement.
  • Supportability – More complexity makes it more difficult to enhance and make changes without causing any regression issues.
  • Monitoring – In current state of Fabric Data pipelines, it is difficult to monitor Child Pipelines, especially when you are spawning multiple and disabling Wait on completion (This will report success to the parent, meaning it successfully invoked the pipeline. This does not mean the child pipeline was actually successful.)

However, even with these tradeoffs, I have worked with many customers on implementing a similar design pattern with positive results, hence why I wanted to share this series with the community.

Conclusion

This wraps up this short series on how to leverage the Expression Language to create logical partitions and run multiple Copy Activities in parallel to improve overall performance of your historical loads.

As always, please feel free to leave a comment or question and thank you for going on this journey with me as we worked through this topic!

منشورات المدونات ذات الصلة

Data Pipeline Performance Improvement Part 3: Gaining more than 50% improvement for Historical Loads

أكتوبر 29, 2024 بواسطة Leo Li

We’re excited to announce several powerful updates to the Virtual Network (VNET) Data Gateway, designed to further enhance performance and improve the overall user experience. These new features allow users to better manage increasing workloads, perform complex data transformations, and simplify log management. Expanded Cluster Size from 5 to 7 One of the key improvements … Continue reading “New Features and Enhancements for Virtual Network Data Gateway”

أكتوبر 28, 2024 بواسطة Estera Kot

We’re thrilled to announce that the Native Execution Engine is now available at no additional cost, unlocking next-level performance and efficiency for your workloads. What’s New?  The Native Execution Engine now supports Fabric Runtime 1.3, which includes Apache Spark 3.5 and Delta Lake 3.2. This upgrade enhances Microsoft Fabric’s Data Engineering and Data Science workflows, … Continue reading “Native Execution Engine available at no additional cost!”