Microsoft Fabric Updates Blog

Fabric Changing the game – OneLake integration

This blog is part of a series of posts covering many aspects of Fabric. The first was a translation of some posts written for Synapse as a PaaS in which I exported data from an SQLDB into the Data Lake using parquet file format, and now I translated them to be reused in Fabric as a SaaS. Microsoft Fabric changing the game: Exporting data and building the Lakehouse | Microsoft Fabric Blog | Microsoft Fabric

With Fabric, you can create your Bronze, Silver, and Gold Zone with Onelake as the idea behind Fabric is to provide an integrated environment with Analytics from end to end. Or you could just use your Storage Account – ADLS Gen 2 with shortcuts to Fabric, Or even you could integrate Fabric and Databricks.

In this post, I want to discuss those options and how really convenient this integration that we can find in Onelake is suitable for Analytics workloads.

Scenario: Export into the raw/bronze zone. Data will be read from Warehouse and Lakehouse depending on the Business scenario.

If you’re not familiar with terms such as Data Lake, Lakehouse, Bronze, Silver, and Gold, it would be helpful to learn more about them. Here are some references to get started:

For this post I am reusing part of this post: Essential tips for exporting and cleaning data with Spark – Microsoft Community Hub

Onelake – Cases and Scenarios

OneLake is a single, unified, logical data lake for the whole organization. Like OneDrive, OneLake comes automatically with every Microsoft Fabric tenant and is designed to be the single place for all your analytics data. OneLake brings customers:

  • One data lake for the entire organization
  • One copy of data for use with multiple analytical engines

Reference: What is OneLake? – Microsoft Fabric | Microsoft Learn

Once the data land in Onelake, there are some interesting things you could do to integrate the way you work with data. Let’s dive into some examples.

Copy Data inside Onelake

  1. Suppose you need to copy Data from one zone ( bronze, silver, and gold), or Folder to another using Spark. How could you accomplish that?

Because Onelake is fully integrated, once the data lands you could for example use 3 lines of Python code to copy the data from, your raw zone to the silver zone. This example was taken from the docs:

files = mssparkutils.fs.ls('Your directory path')

for file in files:

    print(file.name, file.isDir, file.isFile, file.path, file.size)

Follow my example with some parametrization to make my life easier:

source="Files/Raw/FOLDERNAME/"

destiny ='Files/Silver/'

folder_destination =  'FOLDERNAME'

def Copyfile(source,destiny,folder_destination):

    files = mssparkutils.fs.ls(source)

    for file in files:

        mssparkutils.fs.cp(file.path,destiny+folder_destination+ file.name )

Copyfile (source,destiny,folder_destination)

Follow the docs where you can find the reference on how to use mssparkutils: Introduction to Microsoft Spark utilities – Azure Synapse Analytics | Microsoft Learn

Results as follows in Fig 1 – LH Explorer:

Fig 1- LH Explorer

Please note if you want to copy the relative path or the full path from Onelake, both are supported by mssparkutils and that is how you can obtain them ( Lakehouse -> files -> right button at the folder)  , Fig 2 – Path:

Fig 2- Path

Copy Data to a temporary place

2. Suppose you run into the possible scenarios:

You want to copy the files to a temporary place. Or

You want to work “with open “function from Python in a local folder through the Silver Zone, for example ref: https://docs.python.org/3/library/functions.html#open. Or

You want to copy files from another workspace inside of this Onelake to a temporary folder. So the file will exist for your working process while running the notebook and it will be automatically removed at the end of the process, without affecting the current organization of the files in your Lakehouse. Consider you have something like Fig 3, which shows – two workspaces using the same Onelake.

Fig 3 – two workspaces using the same Onelake.

Following the code example for this scenario.

#Copy csv file to a temporary local folder. mssparkutils.fs.cp('abfss://Workspacename@msit-onelake.pbidedicated.windows.net/LakeHousename/Files/Files/namefile.csv', 'file:/tmp/temp/namefile.csv')
# mount the Blob Storage container, and then read the file by using a mount path with open("/tmp/temp/ namefile.csv'") as f: print(f.read())

Results are shown in Fig. 4 – temp copy. As you can different Lakehouses.

Fig. 4 – temp copy

Copy Data between the lakehouse and warehouse

3. Suppose another scenario – If I need to copy from the Onelake, Lakehouse to the Warehouse. Can I use Copy Into? No. But you can use, Select into or Insert Into.

Example in T-SQL for my table product_purchase_consolidate:

IF EXISTS ( SELECT 1 FROM [WarehouseName].[INFORMATION_SCHEMA].[TABLES] WHERE TABLE_NAME='product_purchase_consolidate')

    DROP TABLE WarehouseName.dbo.product_purchase_consolidate

SELECT       [Color]             ,[FinishedGoodsFlag]             ,[MakeFlag]             ,[ModifiedDate]             ,[OrderQty]             ,[ProductID]             ,[ProductNumber]             ,[SafetyStockLevel]             ,[StockedQty]             ,[UnitPrice]

INTO LakehouseName.dbo.product_purchase_consolidate

FROM [WarehouseName].[dbo].[product_purchase_consolidate]

Onelake: External Integration:

What about the external integration of Onelake as part of one analytics solution?

Integrating with ADLS

  1. So, suppose you want to keep the data inside of the Storage Account. How could you easily access it inside the Fabric workspace? And the answer is: with shortcuts, Create an ADLS shortcut – Microsoft Fabric | Microsoft Learn

Results are – Fig 5 – ADLS shortcut:

Fig 5- ADLS shortcut

Administrators is the name of my storage account connected to my Synapse workspace and just follow the steps documented to access from Fabric. Though, please note at this point in time while I am writing this post and the product is in preview. As you can see through the docs I shared private endpoints are not supported as shortcuts in Fabric. Your storage account will need to be configured to use the public internet for this work.

Transfer Data from ADLS directly into Warehouse

2. Another scenario could be: If I need to transfer data from the Storage Account directly to the Warehouse. How can it be accomplished?

Please note: If you need to create a warehouse, follow the steps mentioned here: Data warehouse tutorial – introduction – Microsoft Fabric | Microsoft Learn.

One of the easiest ways to accomplish this task is using the Copy Into.

So, can I use copy into from ADLS to Warehouse which is in the Onelake? yes

For example, Fig 6 – Copy into:

Fig 6 – Copy into

ADF

Fig 7 -ADF

Onelake Explorer

3. How can I easily visualize the folders and files from my Onelake? And the answer is: Onelake Explorer:OneLake File Explorer – Microsoft Fabric | Microsoft Learn . Fig 8 – OneExplorer shows the results of it in my environment:

Fig 8- OneExplorer

Databricks

4. As we walk through Onelake integration, would it be possible to connect Onelake to Databricks and extend my Analytics solution? Yes. Ref: Integrate OneLake with Azure Databricks – Microsoft Fabric | Microsoft Learn

Open Azure Databricks workspace -> compute -> create a cluster ( except for the authentication, there is nothing special required for this cluster configuration):

Please note: To authenticate to OneLake with your Azure AD identity, you must enable Azure Data Lake Storage credential passthrough on your cluster in the Advanced Options.

Fig. 9 – Cluster

Here I am using the yellow taxi example from the docs, Fig 10 – Databricks Yellow taxi:

Fig 10 – Databricks Yellow taxi.

Now let’s check on the Onelake side, after running the notebook inside of Databricks, Fig 11- Onelake integrated:

Fig 11- Onelake integrated

So, this would suit a scenario where you have part of your process happening on the Databricks side and you want to integrate with Fabric.

Another example is as follows; I am reading some data from a mount ADLS storage inside of Databricks into two data frames. I will follow by joining those data frames and transferring them as parquet files into Fabric inside of a silver table of my Lakehouse called: datbricksOnelake.

from pyspark.sql import SparkSession

from pyspark.sql.functions import sum

#reading dataframes

DF_FactInternetSales_Delta = spark.read.load('/mnt/raw-container/FOLDER/FactInternetSales_Delta/OrderDateKey=20010703/', format='delta')

DF_FactInternetSales_Delta.show(10)

DF_FactInternetSalesReason_Delta = spark.read.load('/mnt/raw-container/FOLDER/FactInternetSalesReason_Delta/', format='delta')

DF_FactInternetSalesReason_Delta.show(10)

# Perform the join and aggregation

join_result = DF_FactInternetSales_Delta.join(DF_FactInternetSalesReason_Delta, "SalesOrderNumber")\

  .groupBy(DF_FactInternetSales_Delta.ProductKey, DF_FactInternetSales_Delta.SalesOrderNumber,DF_FactInternetSales_Delta.CustomerPONumber )\

    .agg(

        sum("OrderQuantity").alias("OrderQuantity"),

        sum("UnitPrice").alias("UnitPrice"),

        sum("SalesAmount").alias("SalesAmount")

    )

join_result.show()

#transfer into Fabric - LH datbricksOnelake

oneLakePath = "abfss://FabricWorkspaceName@onelake.dfs.fabric.microsoft.com/datbricksOnelake.lakehouse/Files/Silver/"

join_result.write.parquet(oneLakePath)

Results are:

Fig 12. Transf Databricks into fabric

Last, but not least. If you want to create the whole table script structure inside of Databricks and then save it on Fabric. Would that be possible?

The answer is: Yes, reusing the same logic as before, but saving as Delta tables. I will follow with an example. Please note this important fact, it must be Delta tables otherwise it may not be supported.

oneLakePath = "abfss:/FabricWorkspaceName@onelake.dfs.fabric.microsoft.com/datbricksOnelake.lakehouse/Tables/FactInternetSales_Consold/"

join_result.write \

  .format("delta") \

  .mode("overwrite") \

  .option("path",  oneLakePath )\

  .saveAsTable("FactInternetSales_Consold")

Results are – Fig 13 – dbtable:

Summary:

In this post, I walked through scenarios and ways to ingest the data inside of Fabric Onelake integration and work with it through the Bronze, Silver, and Gold Zone. You could use T-SQL with Stored Procedures on the Warehouse, Spark on the Lakehouse, integrate with your current Databricks solution, use shortcuts from ADLS to Fabric, and once it is in Fabric everything would be integrated with the Onelake.

That is it, Liliam UK.

Related blog posts

Fabric Changing the game – OneLake integration

April 16, 2024 by Ruixin Xu

We are pleased to share a set of key updates regarding the Copilot in Microsoft Fabric experiences. The information in this blog post has also been shared with Fabric tenant administrators. Below are the highlights of the changes. This change is an important milestone to eventually allow Copilot to reach GA within this year. See … Continue reading “Copilot in MS Fabric: Soon available to more users in your organization“

March 11, 2024 by Someleze Diko

Organizations, data analysts and data scientists need to protect the personally identifiable information (PII) of their clients, such as names, addresses, emails, phone numbers, or social security numbers (SSN), that they use to build reports and dashboards. PII can pose risks to both the data subjects and the data holders and can introduce security breach … Continue reading “Protect PII information in your Microsoft Fabric Lakehouse with Responsible AI”