Data Factory Spotlight: Dataflow Gen2
Data Factory empowers you to ingest, prepare and transform data across your data estate with a modern data integration experience. Whether you are a citizen or professional developer, Data Factory is your one-stop-shop to move or transform data. It offers you intelligent transformations and a rich set of activities from hundreds of cloud and on-premises data sources, and a growing list of output destinations that span across both Microsoft and 3rd party databases, services, and applications.
There are two primary high-level features Data Factory implements: dataflows and pipelines.
- Data Pipelines enable you to leverage rich out-of-the-box data orchestration capabilities to compose flexible data workflows that meet your enterprise needs. You can learn more about their capabilities in the Data Factory Spotlight: data pipelines blog post.
- Dataflows Gen2 enable you to leverage a low-code interface and 300+ data and AI-based transformations, letting you transform data easier and with more flexibility than any other tool.
In this article, we will focus on Dataflow Gen2 and how it works. Dataflow Gen2 is built using the familiar Power Query experience that’s available today across several Microsoft products and services such as Excel, Power BI, Power Platform, Dynamics 365 Insights applications, and more. Power Query empowers all users, ranging from citizen to professional, to perform data ingestion and data transformations across their data estate. With Dataflow Gen2, you can perform joins, aggregations, data cleansing, custom transformations, and much more all from an easy-to-use, highly visual, low-code UI.
Dataflow Gen2 in Fabric is the evolution of Dataflow, redesigned from the ground up with more powerful compute engines, data movement capabilities to help you load data to several Fabric and Azure data destinations. In this blog, we will provide an overview of the Dataflow Gen2 architecture and how the dataflow engine works.
The Dataflow Gen2 Engine
Your data integration project can be as simple as ingesting data from source to destination with little modification, or as complex as blending and transforming data from different structured and unstructured cloud and on-premise data sources. At each end of the spectrum, different strategies are required to either transform data in memory, or land it in staging storage where you can leverage different compute technologies to handle big data volumes. Another challenge in data integration projects is to manage orchestration logic to make it all work in sync. Building such projects requires the same complexity as coding a program flow in any procedural programming language. As with any computer program, it requires trained professionals that can conceptualize and create a program flow, handling all the branches and exceptions, configuring compute and data providers to work together, and is similarly rife with bugs that must be weeded out from the system over time.
The dataflow engine abstracts away traditional ETL and ELT complexities described above. In the next sections, we will review the different technologies the dataflow engine brings together and through examples, we will show how complex decisions and configurations are abstracted away.
Dataflows Gen2 are authored using Power Query. Once you publish a dataflow, the dataflow definition is generated – this is the program that will run once the dataflow is refreshed to produce tables in staging storage and/or output destination. During refresh, the definition of a dataflow is used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination.
The above diagram captures the various components of the dataflow architecture, including the Lakehouse artifact used to stage data being ingested, and Warehouse artifact used as a compute engine and means to write back results to staging or supported output destinations faster. When warehouse compute cannot be used, or when staging is disabled for a query, the mashup engine will extract, transform, or load the data to staging or destination.
When your first Dataflow Gen2 is created in a workspace, dataflow staging Lakehouse and Warehouse are provisioned along with their related SQL Endpoint and Datasets. These artifacts are an implementation detail of Dataflow Gen2 that is currently visible and required for their operations. These artifacts should not be deleted. It is recommended to not leverage them outside of dataflows experience as they will be hidden in the future.
How does Dataflow Gen2 work?
Simple dataflows may only require moving a small dataset from source to destination with little modification and can be performed in memory by the mashup engine. However, this may not work well for all your ETL projects.
In more complex projects with patterns that involve blending data from multiple cloud or on-prem sources or transforming big data volumes from sources that do not support folding (predicate pushdown), it may be best to first ingest the data into staging and then perform the transformation work using dataflow’s specialized engine, and finally, load the data to an output destination.
Let’s review how dataflows work with a few examples:
Example: Big data transformation using both Dataflow Staging and Compute
I would like to build reports and dashboards for the Contoso Fast Food chain. For my report, I need to analyze store and online transactions per customer. My data includes hundreds of millions of records for customers, physical and online store purchases all stored in ADLS Gen2 storage account.
I need to extract the data from the source, clean it, append transactions and merge then with the customer data. Then, perform aggregations and load data to the destination I will use for reporting. As you see in the diagram below, I will:
- Create a query for each of the three data sources. By default, these queries are staged. This is indicated by the blue vertical line, and the query name not being italic.
- Then, I create a new query by appending all transactions together. Here too, data is staged.
- Finally, I create another query that joins customer data and all transaction data, aggregate it, and load it to an output destination. Here too, my source queries have been staged and the dataflows engine will leverage the warehouse compute to enhance performance.
When authoring a dataflow, a query containing the lightning icon signals that the enhanced compute engine will be used during refresh.
When I refresh the dataflow, here is what happens behind the scenes:
- Phase 1: All three queries that get data from my data source are ingested in parallel into my dataflow’s staging storage.
- Phase 2: As soon as both upstream transaction queries complete ingestion, the append query will read, compute, and write back to staging the results of the query using Dataflow’s enhanced compute engine.
- Phase 3: The join and aggregate query will start executing as soon as both its dependencies have completed, and leverage dataflow’s enhanced compute engine to join the data from staging, aggregate it, and write it back to staging location.
- Phase 4: The dataflow engine will load the resulting data into the output destination.
One significant improvement in dataflow gen2’s warehouse compute engine architecture is seen when the dataflow engine orchestrates reading from staging, transforming, and writing back results into staging or supported destinations. The engine leverages the DataflowStagingWarehouse to not only fold/push down the computation, but also to write the data results back to staging/destination.
Controlling staging for Dataflows Gen2
You might not always want to stage your data before loading it to an output destination. Here are a few examples where you might want to load data directly to a destination:
- If your data source does not contain big data volumes
- If you are not joining data from different data sources
- If you are not performing compute/memory intensive transformations such joining or aggregating large data volumes
Example: Load data from a 100K row CSV file to an Azure SQL database
In this scenario, I would like to ingest a file containing 100k records to my Azure SQL. I would also like to apply a few transformations like renaming and removing columns and filter out columns that contains invalid data. For these types of transformations and data volume, staging data is not required. I can disable staging for any query by right clicking the query and deselecting the “Enable Load” action. The query name will become italic, and in the diagram view, the query’s outline will be greyed out as seen in the above image.
Behind the scenes, controlling the staging setting of a query will change orchestration from its default setting:
To orchestration that extracts, transforms, and loads data from source to destination in memory:
Configuring a query’s staging setting
In the Power Query Editor, you can enable or disable staging by right clicking the query and selecting the enable load option. By default, its enabled, as indicated by the checkmark:
It’s important to note that in Dataflow Gen2, disabling load (staging) does not mean data will not be loaded to a destination. This setting controls whether data will or will not be staged. Loading to a destination is managed via the query’s data destination control at the bottom right corner of the editing experience:
Understanding the refresh history experience
Today, a dataflow must contain at least one load enabled query in order to publish. We are updating the dataflow experience to allow publishing dataflows with no load-enabled queries as long as at least data destination is configured. This change will by available in all production regions by August 2023.
Let’s review the difference between the refresh history of the above dataflow, when it is refreshed with and without staging enabled.
As you can see below, when staging is disabled, the refresh history will only contain the activity of loading to the output destination:
And when staging is enabled, refresh history will contain both an entry for loading data into staging and the activity of loading data from staging into the output destination:
When staging is enabled, ingestion will take more time. If there is no need to have data staged, it should be disabled.
When creating dataflows that load data to Fabric Lakehouse or Warehouse, for best performance, it is recommended to co-locate the dataflow and destinations in the same workspace.
The Dataflow Gen2 architecture brings together a range of technologies that help you extract, transform and load data into a variety of Fabric and Azure destinations, with no code and little effort – making it easy to get started with data analytics in Fabric.
Over the next few weeks, we’ll dive deeper into more Data Factory topics.