Data Pipeline Performance Improvement Part 3: Gaining more than 50% improvement for Historical Loads
Introduction / Recap
Welcome to the final entry of our 3-part series on improving performance for historical data loads! In the first two entries we dove deep into the technical weeds to demonstrate the capabilities of Data Pipeline Expression Language.
Overview
The central principle of this series is to stress:
- Fabric Data pipelines are much more than a tool to copy data from point A to point B.
- For optimal performance, you should first understand the limits and capabilities of your source and destination then design around them (e.g. concurrent connections).
- Consider how you can convert a pipeline with a single Copy Activity into a pipeline that can logically partition your source data and run multiple Copy Activities in parallel for each partition.
Details
Let’s take a look back at a previous blog post on ingesting files into the Fabric Lakehouse from a REST API with pagination: Data Pipelines Tutorial: Ingest files into a Lakehouse from a REST API with pagination ft. AVEVA Data Hub | Microsoft Fabric Blog | Microsoft Fabric.
In this blog, I discussed how to account for pagination given the current state of Fabric Data Pipelines in Public Preview.
This pipeline is performant when the number of paginated pages isn’t too large. However, if we want to process 10 years of data at a 1 second interval with each page only having a few thousand records, this pipeline will take a considerable amount of time. This is because our Until Loop activity is sequential. The first trick we can leverage to improve performance is by disabling the Wait on Completion setting within the Invoke Pipeline activity. This allows us to call the child pipeline and instantly move forward. Even so, this Data pipeline isn’t suited for large historical loads.
By creating another Data pipeline to take the input parameters of the desired date range for the historical load, we can leverage what we learned in Part 1 and 2 to logically partition this date range into smaller buckets. Then for each bucket, we call the main pipeline.
The magic lives in the For Each and the Invoke Pipeline activities.
The For Each activity, as mentioned in Part 2, has a few settings to leverage.
- Sequential (Boolean)
- Batch count (Integer)
To leverage parallel executions, we will want to disable Sequential. Disabling this will expose the Batch count setting. By default, without specifying a value, the For Each activity will run 20 iterations in parallel. The maximum value you can put here is 50, allowing for 50 iterations to run in parallel. All iterations within a single batch must complete before the For Each activity will start the next batch. Overall performance can worsen if, within a batch, there is a longer running iteration compared to the other iterations. This leads to most of the iterations completing with some remaining in a Running state, reducing the total number of concurrent iterations at any given time.
The recommendation to combat this behavior is to sort your iterations (input array) to group processes that should take about the same time to complete. For example, if we have 100 SQL tables, and a Batch count set to 10, we will want to sort the input list of SQL Table names to where similar sized tables land in the same batch, allowing for each iteration within the batch to complete around the same time, reducing the idle time.
The second magic setting is within the Invoke Pipeline activity. By default, the setting Wait on completion is enabled. This means the parent pipeline will not move forward until the child pipelines report that they completed. With this enabled, the pipeline will run the Batch count of iterations, each running that number of Invoke Pipeline activities in parallel but having to wait for all of them to complete before running the next batch. By disabling Wait on completion, the Invoke Pipeline activity will fire and instantly move forward, allowing for each batch to complete almost instantly, allowing for much higher degree of parallelism.
Example
If we have 200 items in our input array, with a For Each activity, Batch Count = 50, and an inner Invoke Pipeline activity with Wait on Completion = False, our pipeline will run the first 50 in parallel then almost instantaneously move onto the next 50 and so on, essentially running all 200 items at the same time.
By implementing this design using a Parent Pipeline to determine logical partitions and running multiple child pipelines in parallel, I was able to reduce the total duration of historical loads by more than half.
However, this is where Principal 2: For optimal performance you should first understand the limits and capabilities of your source and destination then design around them comes into play. With this design, you can easily overwhelm both your source and destination systems. For example, I implemented this design for Azure SQL Server and quickly received the error: The request limit for the database is 900 and has been reached.
There are a few methods you can use to throttle this design:
- Increase the range of your logical partitions, reducing the total number of partitions, which also reduces the total number of parallel processes (connections made to source/sink).
- Enable wait on completion for the Invoke Pipeline. In combination with the For Each Batch count, this allows you to control how many parallel processes can run.
Challenges
While this design does improve the performance of historical loads by more than 50%, we have a few tradeoffs that I want to be transparent about:
- Additional development overhead – the design is much more difficult to develop, and it just depends on your scenario if 50% data loading improvement is greater than the amount of time to implement.
- Supportability – More complexity makes it more difficult to enhance and make changes without causing any regression issues.
- Monitoring – In current state of Fabric Data pipelines, it is difficult to monitor Child Pipelines, especially when you are spawning multiple and disabling Wait on completion (This will report success to the parent, meaning it successfully invoked the pipeline. This does not mean the child pipeline was actually successful.)
However, even with these tradeoffs, I have worked with many customers on implementing a similar design pattern with positive results, hence why I wanted to share this series with the community.
Conclusion
This wraps up this short series on how to leverage the Expression Language to create logical partitions and run multiple Copy Activities in parallel to improve overall performance of your historical loads.
As always, please feel free to leave a comment or question and thank you for going on this journey with me as we worked through this topic!