Microsoft Fabric Updates Blog

External data materialization in Fabric Data Warehouse

Data warehouses are at the core of modern analytics, enabling organizations to turn raw data into actionable insights. Fabric Data Warehouse empowers data professionals to analyze various types of data from the classic DW table to the Parquet, CSV, and JSONL files.

One of the most important decisions when working with external data is how to access it efficiently – by reading external data from the original location or ingesting data in the warehouse.

External files can be exposed as views or ingested and materialized as tables

You can reference external file using a view or ingest external file in a table. In this post, I’ll use the term materialization to refer to the one-time ingestion of the full content of an external file into a Data Warehouse table.

Important: Don’t confuse external data materialization discussed in this article with Materialized lake views. While materialized lake views cache the results of queries defined as Spark views, external data materialization refers to the process of ingesting and caching external data into a DW table.

Referencing external files directly and accessing a materialized copy of data in DW table have their pros and cons. This blog post explores these concepts in detail and provides practical guidance for making the right choice.

Understanding Virtualization: Accessing External Data with Views

Data virtualization allows you to access external data sources without physically moving or copying the data into your warehouse. In Fabric Data Warehouse, you can add virtualization capabilities on top of external Parquet, CSV, and JSONL files using SQL views and the OPENROWSET function that is referencing these files. For example, you can create a view that references external Parquet files containing COVID case data stored in some OneLake or ADLS:

CREATE VIEW cases AS
SELECT * FROM OPENROWSET(
BULK '/Files/bing_covid_cases/*.parquet',
DATA_SOURCE = 'MyLakehouse'
)

This view will automatically inspect the schema of the underlying files using the built-in schema inference mechanism in the OPENROWSET() function and expose all Parquet columns as the columns in the view that are matching the underlying Parquet columns by name and type. One key benefit of virtualization is that upstream users interact with a clear, consistent interface for accessing underlying files—without needing to understand the file format, structure, or storage details. This abstraction simplifies data access, promotes reuse, and reduces the risk of errors caused by misinterpreting raw data.

This approach lets you virtualize access to the latest version of the underlying files in some lakehouse. Whenever the external file is updated, querying the view returns the most recent data. It’s an ideal solution when you want to treat files like tables, abstracting away the underlying source and format while keeping your data fresh.

Limitations of Virtualization: Performance Considerations

While virtualization offers flexibility, it comes with trade-offs. Reading external data directly can be slower than querying the real Data Warehouse tables, especially for large files or frequent, complex queries. Performance may depend on file format and speed of underlying storage. Unfortunately, this is the trade-off that you need to accept in most of the virtualization scenarios because you are getting benefit of accessing external data at real-time without additional ingestion, but you need to pay the price in performance.

Solution – materialization!

For scenarios where performance over external data is critical, materialization is solution. You can ingest the content of external files and physically materialize them as tables in the Data Warehouse, making the data readily available for querying and analysis. This involves copying data from an external source into a physical table within your Fabric Data Warehouse using the CREATE TABLE AS SELECT (CTAS) statement. Here’s how you could materialize the COVID cases files used in previous example:

CREATE TABLE cases AS
SELECT * FROM OPENROWSET(
BULK '/Files/bing_covid_cases/*.parquet',
DATA_SOURCE = 'MyLakehouse'
)

This command is almost identical to the previous example, but it is creating a TABLE instead of VIEW. You don’t need to define the table schema because the schema inference in the OPENROWSET() function will automatically detect the underlying column and their types.

By creating a table, CTAS will automatically ingest the content of the external files into the table and materialize the content of the file as table content. The subsequent queries run directly against the warehouse data, delivering the best possible performance, especially for repeated or complex analytics.

The trade-off is that the table reflects the data as it existed at the time of creation—if the external file updates, you’ll need to refresh or recreate the table to capture new data.

Performance and Schema Tuning: Optimizing Materialized Tables

When materializing data, it’s generally advisable to define your table schema explicitly rather than relying solely on automatic schema inference. This approach helps ensure accurate data types and minimizes unexpected behavior during query execution.

In the case of Parquet files, explicit schema definition may not be necessary. The Parquet format is self-describing, so if your underlying columns are well-defined, the inferred schema will typically be reliable.

However, for formats like CSV or JSONL (where OPENROWSET() must infer types based on sampled data), or when your Parquet schema includes overly broad types (e.g., VARCHAR(MAX) or unlimited-length strings), explicitly specifying the schema can help avoid inefficiencies and improve query performance.

For example, you can specify the schema for your COVID cases table:

CREATE TABLE cases AS
SELECT * FROM OPENROWSET(
    BULK '/Files/bing_covid_cases/*.parquet',
    DATA_SOURCE = 'MyLakehouse'
)
WITH ( updated DATE, confirmed INT, recovered INT, country VARCHAR(100))

Explicit schema definition becomes particularly important when you need to limit the number of exposed columns or when working with generic textual data—such as CSV or JSONL files—that lack strict typing. In these cases, defining column types manually helps ensure consistency and prevents issues caused by ambiguous or overly permissive type inference.

Where possible, it’s best to optimize source files with appropriate data types. However, you can also refine and enforce type definitions during table creation to align with your data modeling and performance goals.

Conclusion

There is no universally right or wrong approach when it comes to virtualization versus materialization. Each method has its own advantages and disadvantages, so the key is understanding the criteria for choosing between them and being aware of the trade-offs involved. Your decision should be guided by factors such as performance, maintainability, cost, and the specific use case.

  • Use virtualization when you need fresh, up-to-date data, especially in scenarios where stale data is unacceptable and continuous ingestion pipelines are impractical or costly. It’s ideal for ad hoc exploration, lightweight reporting, and minimizing storage overhead, particularly when working with evolving datasets or conducting early-stage analysis. By querying data directly from the source, virtualization ensures real-time access without the complexity of continuous ingestion, making it a good choice when agility and data freshness are top priorities.
  • Use materialization when performance is critical, queries are complex or frequent, or when you need to join external data with materialized tables efficiently. Unlike virtualization, which prioritizes agility and freshness, materialization provides optimized query execution, faster performance, and greater reliability for production workloads and large-scale analytics. By materializing external data as a table, you ensure consistent results across repeated queries—making it the preferred choice for stable, high-performance analytics.

Fabric Data Warehouse provides powerful tools for both virtualizing and materializing external data. By understanding the strengths and limitations of each approach, you can design architecture that balances freshness, performance, and cost. Start with virtualization for agility and materialization when you need speed and advanced analytics. With these strategies, you’ll unlock the full potential of your data warehouse.

Learn more about these functionalities in Fabric Data Warehouse documentation articles:

Postingan blog terkait

External data materialization in Fabric Data Warehouse

Desember 16, 2025 berdasarkan Alex Powers

As 2025 ends, we’re taking a moment to reflect on Microsoft Fabric’s second year in the market and the collective progress made alongside our community, customers, and partners. What began as a unified vision for data and AI has grown into a platform adopted by more than 28,000 organizations worldwide, anchored by OneLake and shaped … Continue reading “Microsoft Fabric 2025 holiday recap: Unified Data and AI Innovation”

Desember 3, 2025 berdasarkan Pradeep Srikakolapu

Deployment Challenges While Solutions Are in Development Microsoft Fabric has revolutionized data analytics with its unified platform, but deploying complex architectures with cross-dependencies remains a significant challenge for organizations. The good news is that the Microsoft Fabric team is actively working on native warehouse deployment capabilities with DacFx, cross-item dependency resolution, and cross-warehouse reference support. … Continue reading “Bridging the Gap: Automate Warehouse & SQL Endpoint Deployment in Microsoft Fabric”