Microsoft Fabric Updates Blog

Demystifying Data Ingestion in Fabric: Fundamental Components for Ingesting Data into a Fabric Lakehouse using Fabric Data Pipelines

✎ Co-Author – Abhishek Narain

Overview

Building an effective Lakehouse starts with establishing a robust ingestion layer. Ingestion refers to the process of collecting, importing, and processing raw data from various sources into the data lake. Data ingestion is fundamental to the success of a data lake as it enables the consolidation, exploration, and processing of diverse and raw data. It forms the backbone for downstream analytics, machine learning, and reporting activities, equipping organizations with the flexibility and agility to extract meaningful insights from their data. In this blog, we will walk through the key components of a data Ingestion framework based on experience working with different customers while building a lakehouse in Fabric.

Data Ingestion can be divided in 2 types –

One time Ingestion — One time ingestion/load refers to initial ingestion of historical data to data lake.
Incremental ingestion — Post one time ingestion, incremental ingestion.

Please note that we have only considered batch ingestion. Real time ingestion is not in scope of this blog.

Basic components of Ingestion framework

Below graphic represents the basic building blocks (components) of the ingestion framework.

Basic Building blocks of Metadata Driven Ingestion Framework

Below table details out different components and purpose of each component.

#	Component Name	Purpose
1	Control Table	Control table is used to control and select an item from the source system to be moved into a data lake.
2	Data Ingestion	Data Copy component is used to copy the data from source system to data lake. Typically, same format is used for ingesting data from Source to Bronze Layer/Staging layer but we have the flexibility to change the source type during ingestion.
3	Auditing	Auditing component is used to audit the records ingested into data lake. Auditing includes identifying any errors or issues that may have occurred during the process, as well as reviewing performance metrics to identify areas for improvement.
4	Notification	Notification component is used to notify in case of success or failure events. This is super critical to ensure that the operation team is notified during all critical events.
5	Config Management	Config Management component is used for managing the configuration.
6	Reporting	This component is used for visual reporting of different ingestion metrics.

Control Table – This component is used to control and select an item to be moved into a Lakehouse. This is typically represented using a table stored in a Data warehouse table. Below are the typical columns used to control the ingestion in lakehouse. source type – representing the type of source (Blob, ADLS Gen2, Database, AWS S3), source details – representing the source which could be container details in case of blob or database & table details in case of relational data source, target details specifying the lakehouse or Data warehouse location where the data needs to be ingested. Below is a sample Fabric data pipeline used to look up to this control table to control the ingestion into lakehouse.

Lookup Activity fetches metadata from control table

2. Data Ingestion – This component is essential for copying data from the source. A key consideration is datatype mapping from source to target, as Microsoft Fabric stores data in Delta Lake format. For more information on the datatypes supported by Microsoft Fabric, refer here. There may be scenarios where unsupported data types are encountered. For example, in SQL Server, data types such as geography, money, and datetime2(7) do not have corresponding equivalents in Delta Lake. This must be handled carefully to avoid loss of information when ingesting data into the Microsoft Fabric Lakehouse or warehouse.

3. Auditing – The auditing component is used to inspect the records ingested into the lakehouse. It helps in tracking data lineage, ensuring transparency in data transformations. It logs errors and anomalies, aiding in issue detection and correction. Performance metrics are captured to optimize the transformation processes. It also supports compliance by maintaining historical records of data activities. Additionally, it provides an audit trail for accountability and security.

Fabric’s data warehouse can be used to audit all the information. Some of the key fields which can be audited are –

Field Name	Purpose
source_type	Type of source (Blob,ADLS Gen2,Database,AWS S3)
event_run_id	Captured from data pipeline run id (Data pipeline system variable) e.g. 36ff2fde-dd85-4b1f-982e-651530df7c90.
item_name	Actual Item Captured from source e.g. For files Container/folder name & for Database it will be Database.schemaname.tablename
data_read	Total amount of data retrieved from source data store, including data copied in this run and data resumed from the last failed run.
data_written	The actual amount of data written/committed to the sink. The size may be different from dataRead size, as it relates to how each data store stores the data.
files_read	Total number of files copied from source data store, including files copied in this run and files resumed from the last failed run. Note: In case of database this field will be NA
files_written	The actual amount of data written/committed to the sink. The size may be different from data Read size, as it relates to how each data store stores the data.
rows_read	The actual amount of data read from the source.
rows_written	Number of rows copied to sink. This metric does not apply when copying files as-is without parsing them, for example, when source and sink datasets are binary format type, or other format type with identical settings.
data_consistency_verification	Tell the result whether the data has been verified to be consistent between source and destination store after being copied.
copy_duration	Total amount of duration take to copy the data.
event_start_time	Time when the pipeline started
event_end_time	Time when the pipeline ended.
source_cutoff_time	Source_cutoff_time will be the incremental pipeline execution time. Incase of full load it will be same as source_cutoff_time in the control table. This will be updated on each incremental load.
load_type	Status either Full or Incr (Incremental).
status	Status either success or failure.
event_triggered_by	Status either Manual or Scheduled.
error_details	Contain error codes and messages that pinpoint the location and nature of the problem.

Sample Copy activity with Auditing on Success and Failure

4. Notification – The notification component is utilized to alert in the event of success or failure. This is crucial to ensure the operations team is informed during all critical events. In Microsoft fabric, the Office 365 Outlook activity in Data Factory allows you to send an email with your Office 365 account. We also have the Teams activity in Data Factory which allows you to send a message to a Teams channel or group chat. The message in Office 365 outlook account and Teams can include dynamic expressions to be customized as much as necessary. Please find below a sample Teams notification sample.

5. Config Management – Config management is used to manage configurations, which are essentially the connection details of the source.

6. Reporting – The ingestion reporting dashboard is a single pane of glass for all the displaying all the information captured by audit. One of the advantages of Fabric is that all audit information captured in data warehouse will be displayed using Power BI. Please find below a sample Teams notification sample.

List of key design considerations

When designing our Ingestion framework, adhere to several fundamental principles to ensure its effectiveness, maintainability, and adaptability.

Prioritize simplicity to reduce errors and facilitate understanding and maintenance.
Embrace modularity by dividing the system into smaller, independent components for ease of development and maintenance, while maintaining high cohesion within each module to focus on specific tasks.
Minimize coupling to allow independent modifications, ensuring scalability to handle growing data volumes, and support incremental loading for efficient updates.
Implement comprehensive monitoring and logging, along with robust metadata management, for tracking data and system health.
Aim for high performance through parallel ingestion and maintain extensive documentation to aid understanding, usage, collaboration, and troubleshooting.

Conclusion

I hope this blog helped you in understanding how to design an effective ingestion framework for a lake house using Fabric platform.

Special shout out to Gyani, Winnie, Polly, Hideo, and Dharmendra for their continuous support!

Visit our product blogs

Power BI

Please enter your information to subscribe to the Microsoft Fabric Blog.

Microsoft Fabric Updates Blog

Demystifying Data Ingestion in Fabric: Fundamental Components for Ingesting Data into a Fabric Lakehouse using Fabric Data Pipelines

Overview

Basic components of Ingestion framework

List of key design considerations

Conclusion

Related blog posts

Microsoft Fabric June 2024 Update

Introducing Capacity Pools for Data Engineering and Data Science in Microsoft Fabric