Use Fabric User Data Functions with Pandas DataFrames and Series in Notebooks
We’ve made a major enhancement to the Notebook Integration with Fabric User Data Functions (UDFs)—you can now use Pandas DataFrames and Series as input and output types, powered by native integration with Apache Arrow!
This enhancement brings higher performance, improved efficiency, and better scalability to your Fabric Notebooks—enabling seamless function reuse for large-scale data processing in Python, PySpark, Scala, and R.
Recap: Notebooks Integration with Fabric UDFs (Preview)
As part of our initial preview, we introduced the ability to:
- Invoke shared UDFs directly from NotebookUtils.
- Use IntelliSense/autocomplete to find and call functions more easily.
- Explore function signatures and metadata using
display(myFunction.functionDetails). - Call UDFs in Python, PySpark, Scala, and R for streamlined, reusable logic across your notebooks.
This helped teams modularize logic, reduce redundancy, and improve productivity across collaborative data science and engineering projects.
What’s New: Pandas Support via Apache Arrow
In this update, Pandas DataFrames and Series are now supported as first-class input and output types for UDFs—enabled by deep integration with Apache Arrow, a highly efficient columnar memory format optimized for analytics workloads.
Benefits of the Arrow Integration:
- High-performance serialization: Skip costly JSON encoding/decoding.
- Zero-copy data sharing: Minimize overhead during UDF execution.
- Scalable: Work with millions of rows in memory with ease.
- Seamless compatibility with your existing Pandas logic.
Instead of manually converting large datasets to JSON, developers can now natively pass Pandas DataFrames to UDFs, operate on them efficiently, and return processed results—all with minimal latency and memory overhead.
Real-World Example: Revenue Aggregation by Driver
Let’s say you want to aggregate total revenue by driver across a dataset with millions of rows. Now, you can pass a Pandas DataFrame into a shared UDF and perform that operation directly:
Sample Code: Invoking Arrow-Enabled UDFs
PySpark / Python
# Get the function
agg_func = notebookutils.udf.getFunctions("AggregateRevenueByDriver")
# Sample input as Pandas DataFrame
import pandas as pd
df = pd.DataFrame({
"driver_id": [1, 2, 1],
"revenue": [100.0, 150.0, 200.0]
})
# Call UDF with DataFrame input and receive DataFrame output
result_df = agg_func.aggregate(df)
# Display result
print(result_df)
Scala
val aggFunc = notebookutils.udf.getFunctions("AggregateRevenueByDriver")
// Sample input
val input = Seq(
(1, 100.0),
(2, 150.0),
(1, 200.0)
).toDF("driver_id", "revenue")
// Call UDF and get DataFrame output
val result = aggFunc.aggregate(input)
// Show result
result.show()
R
agg_func <- notebookutils.udf.getFunctions("AggregateRevenueByDriver")
# Sample input
df <- data.frame(
driver_id = c(1, 2, 1),
revenue = c(100.0, 150.0, 200.0)
)
# Call the UDF
result <- agg_func$aggregate(df)
# View result
print(result)
Use Case Highlights
With this Arrow-powered enhancement, you can:
- Run fast, interactive analysis on large-scale datasets.
- Simplify cross-team collaboration by sharing tested UDFs across notebooks.
- Accelerate development-to-production workflows for real-time metrics, feature engineering, and aggregation tasks.
Try the new UDF functionality today by using NotebookUtils in your Fabric Notebook. Start by registering a Pandas-compatible UDF, then pass in your DataFrames and enjoy lightning-fast results with Apache Arrow under the hood.
Get Started
For more information, refer to the NotebookUtils for Fabric documentation.