Microsoft Fabric Updates Blog

Use Fabric Data Factory Data Pipelines to Orchestrate Notebook-based Workflows

Microsoft Fabric Data Factory’s data pipelines enable data engineers to build complex workflows that can orchestrate many different types of data processing, data movement, data transformation, and other activity types. In this post, I want to focus on some good practices when building Fabric Spark Notebook workflows using Data Factory in Fabric with data pipelines.

Execute Notebook activities in data pipelines using parameter settings
Data pipeline example in Fabric Data Factory executing Notebooks

In this sample pipeline above, I have 2 Notebook activities, 1 Teams Activity and 1 If Condition with 2 more activities inside the container. Both of the Notebook activities are at the start of the pipeline while only the activity that is named “Notebook Double” has a connector to the If Condition. What this means is that the Notebook activities will execute in parallel, while the If Condition will execute in sequence following the completion of the notebook from the “Notebook Double” activity. The connector between the Notebook Double activity and the If Condition uses the “On Success” port from the first activity, meaning that the If Condition will only execute follow a successful signal from the notebook. Notice there is a red connector as well connected to the Notebook Double On Failure port. If the notebook execution fails, the pipeline will subsequently call the Teams activity and send a Teams message. This is a common pattern to use the output ports from your notebook execution to use different paths for success and failure.

You may notice that the “Another Notebook” activity is grey, indicating that it has been set to “disabled”. This is another good practice when building and debugging your pipelines. I’ve not yet completed my configurations for this activity, but I need to test the pipeline with the first notebook path. Setting the activity to “Deactivated” essentially comments out that activity so that the pipeline can be saved and run as a test. To run the debug test, just click the Run button on the ribbon as shown below.

Screenshot of activity output ports to enable workflow redirection
Use activity output ports to redirect your workflow

Another good practice when orchestrating your Fabric Notebooks is to set the “Retry” property on activity to a value greater than 0. The transient and ephemeral nature of running many Spark notebooks in an automated pipeline environment, is that there may be occasions when the execution of the notebook fails because the pool, cluster, or session is busy or unavailable. Using the retry property inside the pipeline notebook activity allows Data Factory to try the execution again, based on the number of retries you have set. In my sample below, I’ve set retry 2 to, a very common practice.

The general tab of the notebook activity showing the name "Another Notebook" and the activity state set to deactivated
Use the activity settings panel to set activity state and retries

Let’s circle back to the original activity called “Notebook Double”. This activity is executing a very simple notebook that I authored in my workspace that takes a parameter value and exists by returning the incoming numeric value doubled. Inside of the pipeline activity, I am sending in a value to that parameter in the notebook, making the execution of this notebook dynamic based upon that parameter value. To make the automated execution of this pipeline even more dynamic, you can use a pipeline parameter and set the value as a parameter or expression using the “add dynamic content” link.

I’ll finish this article with the If Condition. This is a very common Data Factory pattern for pipelines where I’ve taken the On Success workflow from the Notebook Activity and from there, I will examine the output value that I am returning to the pipeline execution from my notebook.

You can use the pipeline If Condition activity for branching and conditional execution
Use the pipeline expression builder to configure the If condition

Inside the notebook, you must send the output value back to the pipeline using:

mssparkutils.notebook.exit(outputval)

This will allow you to examine the output of the notebook execution for branching and conditional execution inside of your pipeline. In my example, the If Condition uses this expression to extract the output value from the notebook:

@equals(activity('Notebook Double').output.result.exitValue,0)

The syntax used is the Data Factory pipeline control flow expression language and I am checking for a value of 0. If the value is 0, I will treat that as a bad result and send an automated email indicating an error return the notebook. This way, the False path of my If Condition can execute a Script activity that will log the results of the notebook. The updated If Condition will look like this:

Screenshot of the if condition inside of data pipeline with the True and False paths
Use If Then control flow to either send an email or logging

Related blog posts

Use Fabric Data Factory Data Pipelines to Orchestrate Notebook-based Workflows

May 16, 2024 by Jianlei Shen

To improve the flexibility for copying data in Fabric Data Factory, we are excited to announce that now you can edit destination table column types when copying data!  Supported scenarios This new feature allows you to edit the data type of the column for a new or auto-created destination table, if your data destination is … Continue reading “Edit the Destination Table Column Type when Copying Data to Lakehouse Table, Data Warehouse and SQL Data Stores “

May 16, 2024 by Dan Liu

Leverage the power of task flows to design and build your data solutions and manage workspace items in Microsoft Fabric. We’re thrilled to announce that the task flows feature is now in public preview and is enabled for all existing Microsoft Fabric users. Fabric is unifying everything needed to deliver end-to-end data and analytics solutions … Continue reading “Announcing the public preview of task flows in Microsoft Fabric”