Using Azure Databricks with Microsoft Fabric and OneLake
How does Azure Databricks work with Microsoft Fabric? With the recent announcement of Microsoft Fabric, this question might have appeared in your social media feed. This blog post will answer that question and more details on how the two systems can work together.
First, let’s quickly recap the basics of both products. Azure Databricks is a unified set of tools for deploying, sharing, and maintaining enterprise-grade data and AI solutions at scale. Azure Databricks today has widespread adoption from organizations of all sizes for use as a data processing and analytics engine as well as a data science platform. Microsoft Fabric is a unified analytics platform that brings together all the data and analytics tools that organizations need. Fabric brings together experiences such as Data Engineering, Data Factory, Data Science, Data Warehouse, Real-Time Analytics, and Power BI onto a shared SaaS foundation, all seamlessly integrated into a single service. Microsoft Fabric comes with OneLake, an open & governed, unified SaaS data lake that serves as a single place to store organizational data. This document outlines how Azure Databricks can work with OneLake to simplify organizations’ overall data journey.
Use OneLake with existing data lakes
Microsoft OneLake works as a single SaaS data lake for your organization. It removes data silos by connecting to existing data with shortcuts. Shortcuts are a OneLake feature that enables data to be reused without copying it. Shortcuts function as a symbolic link to data, allowing a live connection to the target data from another location. Shortcuts can be created to any data within OneLake, or to external data lakes such as Azure Data Lake Storage Gen2 (ADLS Gen2) or Amazon S3. Learn more details about OneLake shortcuts.
Many data lakes are built today using Azure Databricks as a general-purpose data and analytics processing engine. The data itself is physically stored in ADLS Gen2, but transformed and cleaned using Azure Databricks.
By creating shortcuts to this existing ADLS data, it is made ready for consumption through OneLake and Microsoft Fabric. Power BI in Microsoft Fabric now works in direct lake mode, allowing data to be queried with blazing fast performance directly over data in OneLake. Business users in this architecture can access reports created directly over the same curated data that the data science team is using to train ML models on with Azure Databricks. Direct lake in Power BI simplifies the serving layer and enables improved performance over existing approaches all without copying data.
Since OneLake uses the same APIs as ADLS Gen2 and supports the same Delta parquet format for data storage, Azure Databricks notebooks can be seamlessly updated to use the OneLake endpoints for the data. This keeps the paths consistent across experiences whether the data consumer is querying data through a warehouse in Microsoft Fabric or a notebook in Azure Databricks. Check out Integrate OneLake with Azure Databricks for sample code for querying OneLake data using Azure Databricks.
Use and land data directly in OneLake
In the previous example, we looked at how OneLake and Power BI can be added directly to existing data lakes to create a unified data storage location that is consistent across applications. Now let’s walk through how Azure Databricks can work with data landed directly in OneLake.
The key to building a successful architecture with Microsoft Fabric is to lean on the features of OneLake as a single place to store and manage your data. OneLake data can be used by any service within Microsoft Fabric and even with services outside of Fabric. Data in a on-premise systems can be extracted and loaded into OneLake storage for further processing. Or data in Amazon S3 can be ingested through shortcuts. When building a single data estate on OneLake, you have all the tools you need to consolidate data from a wide variety of sources. Whether you prefer a medallion or a data mesh architecture, OneLake is the ideal platform for building your data lake.
In the following example, a medallion architecture is built natively in OneLake. After the data is cleaned and processed into the gold layer, that data can be reused in several other places through shortcuts. This single copy of the data then appears as if it was natively part of those other data products but without any copies being made. These data products can then be served to end users through Azure Databricks or any compute engine in Fabric. As noted in the previous section, the use of shared APIs between OneLake and ADLS makes it easy to start using OneLake data with any application, all with minimal code changes. This new architecture provides the benefits of fewer data copies and a more consolidated governance solution while still enabling existing users to leverage their preferred apps like Azure Databricks for querying and data science.
Azure Databricks is a powerful tool for data engineering and data science. Many organizations use it extensively, and as Microsoft OneLake is now available in public preview, the two products can co-exist together to simplify analytics workloads. Whether customers have well established Azure Databricks practices or are looking to get started, Microsoft Fabric and OneLake bring rich data management features that can be used to increase the value of Azure Databricks usage.
Get started with Microsoft Fabric
Microsoft Fabric is currently in preview. Try out everything Fabric has to offer by signing up for the free trial—no credit card information required. Everyone who signs up gets a fixed Fabric trial capacity, which may be used for any feature or capability from integrating data to creating machine learning models. Existing Power BI Premium customers can simply turn on Fabric through the Power BI admin portal. After July 1, 2023, Fabric will be enabled for all Power BI tenants.