Microsoft Fabric Updates Blog

Demystifying Data Ingestion in Fabric: Fundamental Components for Ingesting Data into a Fabric Lakehouse using Fabric Data Pipelines

Co-Author – Abhishek Narain

Overview

Building an effective Lakehouse starts with establishing a robust ingestion layer. Ingestion refers to the process of collecting, importing, and processing raw data from various sources into the data lake. Data ingestion is fundamental to the success of a data lake as it enables the consolidation, exploration, and processing of diverse and raw data. It forms the backbone for downstream analytics, machine learning, and reporting activities, equipping organizations with the flexibility and agility to extract meaningful insights from their data. In this blog, we will walk through the key components of a data Ingestion framework based on experience working with different customers while building a lakehouse in Fabric.

Data Ingestion can be divided in 2 types –

  • One time Ingestion — One time ingestion/load refers to initial ingestion of historical data to data lake.
  • Incremental ingestion — Post one time ingestion, incremental ingestion.

Please note that we have only considered batch ingestion. Real time ingestion is not in scope of this blog.

Basic components of Ingestion framework

Below graphic represents the basic building blocks (components) of the ingestion framework.

Basic Building blocks of Metadata Driven Ingestion Framework

Below table details out different components and purpose of each component.

#Component NamePurpose
1Control TableControl table is used to control and select an item from the source system to be moved into a data lake.
2Data IngestionData Copy component is used to copy the data from source system to data lake. Typically, same format is used for ingesting data from Source to Bronze Layer/Staging layer but we have the flexibility to change the source type during ingestion.
3AuditingAuditing component is used to audit the records ingested into data lake. Auditing includes identifying any errors or issues that may have occurred during the process, as well as reviewing performance metrics to identify areas for improvement.
4NotificationNotification component is used to notify in case of success or failure events. This is super critical to ensure that the operation team is notified during all critical events.
5Config ManagementConfig Management component is used for managing the configuration.
6ReportingThis component is used for visual reporting of different ingestion metrics.
  1. Control Table – This component is used to control and select an item to be moved into a Lakehouse. This is typically represented using a table stored in a Data warehouse table. Below are the typical columns used to control the ingestion in lakehouse. source type – representing the type of source (Blob, ADLS Gen2, Database, AWS S3), source details – representing the source which could be container details in case of blob or database & table details in case of relational data source, target details specifying the lakehouse or Data warehouse location where the data needs to be ingested. Below is a sample Fabric data pipeline used to look up to this control table to control the ingestion into lakehouse.
Lookup Activity fetches metadata from control table

2. Data Ingestion – This component is essential for copying data from the source. A key consideration is datatype mapping from source to target, as Microsoft Fabric stores data in Delta Lake format. For more information on the datatypes supported by Microsoft Fabric, refer here.  There may be scenarios where unsupported data types are encountered. For example, in SQL Server, data types such as geography, money, and datetime2(7) do not have corresponding equivalents in Delta Lake. This must be handled carefully to avoid loss of information when ingesting data into the Microsoft Fabric Lakehouse or warehouse.

3. Auditing – The auditing component is used to inspect the records ingested into the lakehouse. It helps in tracking data lineage, ensuring transparency in data transformations. It logs errors and anomalies, aiding in issue detection and correction. Performance metrics are captured to optimize the transformation processes. It also supports compliance by maintaining historical records of data activities. Additionally, it provides an audit trail for accountability and security.

Fabric’s data warehouse can be used to audit all the information. Some of the key fields which can be audited are –

Field NamePurpose
source_typeType of source (Blob,ADLS Gen2,Database,AWS S3)
event_run_idCaptured from data pipeline run id (Data pipeline system variable) e.g. 36ff2fde-dd85-4b1f-982e-651530df7c90.
item_nameActual Item Captured from source e.g. For files Container/folder name & for Database it will be Database.schemaname.tablename
data_readTotal amount of data retrieved from source data store, including data copied in this run and data resumed from the last failed run.
data_writtenThe actual amount of data written/committed to the sink. The size may be different from dataRead size, as it relates to how each data store stores the data.
files_readTotal number of files copied from source data store, including files copied in this run and files resumed from the last failed run. Note: In case of database this field will be NA
files_writtenThe actual amount of data written/committed to the sink. The size may be different from data Read size, as it relates to how each data store stores the data.
rows_readThe actual amount of data read from the source.
rows_writtenNumber of rows copied to sink. This metric does not apply when copying files as-is without parsing them, for example, when source and sink datasets are binary format type, or other format type with identical settings.
data_consistency_verificationTell the result whether the data has been verified to be consistent between source and destination store after being copied.
copy_durationTotal amount of duration take to copy the data.
event_start_timeTime when the pipeline started
event_end_timeTime when the pipeline ended.
source_cutoff_timeSource_cutoff_time will be the incremental pipeline execution time. Incase of full load it will be same as source_cutoff_time in the control table. This will be updated on each incremental load.
load_typeStatus either Full or Incr (Incremental).
statusStatus either success or failure.
event_triggered_byStatus either Manual or Scheduled.
error_detailsContain error codes and messages that pinpoint the location and nature of the problem.
Sample Copy activity with Auditing on Success and Failure

4. Notification – The notification component is utilized to alert in the event of success or failure. This is crucial to ensure the operations team is informed during all critical events. In Microsoft fabric, the Office 365 Outlook activity in Data Factory allows you to send an email with your Office 365 account. We also have the Teams activity in Data Factory which allows you to send a message to a Teams channel or group chat. The message in Office 365 outlook account and Teams can include dynamic expressions to be customized as much as necessary. Please find below a sample Teams notification sample.

Microsoft Teams notification

5. Config Management – Config management is used to manage configurations, which are essentially the connection details of the source.

6. Reporting – The ingestion reporting dashboard is a single pane of glass for all the displaying all the information captured by audit. One of the advantages of Fabric is that all audit information captured in data warehouse will be displayed using Power BI. Please find below a sample Teams notification sample.

List of key design considerations

When designing our Ingestion framework, adhere to several fundamental principles to ensure its effectiveness, maintainability, and adaptability.

  • Prioritize simplicity to reduce errors and facilitate understanding and maintenance.
  • Embrace modularity by dividing the system into smaller, independent components for ease of development and maintenance, while maintaining high cohesion within each module to focus on specific tasks.
  • Minimize coupling to allow independent modifications, ensuring scalability to handle growing data volumes, and support incremental loading for efficient updates.
  • Implement comprehensive monitoring and logging, along with robust metadata management, for tracking data and system health.
  • Aim for high performance through parallel ingestion and maintain extensive documentation to aid understanding, usage, collaboration, and troubleshooting.

Conclusion

I hope this blog helped you in understanding how to design an effective ingestion framework for a lake house using Fabric platform. 

Special shout out to Gyani, Winnie, Polly, Hideo, and Dharmendra for their continuous support!

Zugehörige Blogbeiträge

Demystifying Data Ingestion in Fabric: Fundamental Components for Ingesting Data into a Fabric Lakehouse using Fabric Data Pipelines

Oktober 15, 2024 von Someleze Diko

This session is part of the Microsoft Fabric and AI Learning Hackathon which focuses on how you can leverage Copilot in Microsoft Fabric. It will guide you through the various capabilities that Copilot offers for you to use Microsoft Fabric, empowering you to enhance productivity and streamline your workflows. We will dive deep into practical … Continue reading “Microsoft Fabric and AI Learning Hackathon: Copilot in Fabric”

Oktober 9, 2024 von Misha Desai

At Fabric, we’re passionate about contributing to the open-source community, particularly in areas that advance the usability and scalability of machine learning tools. One of our recent endeavors has been making substantial contributions back to the FLAML (Fast and Lightweight AutoML) project, a robust library designed to automate the tedious and complex process of machine … Continue reading “Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML”