SQL Telemetry & Intelligence – How we built a Petabyte-scale Data Platform with Fabric
Over the last three years, the SQL Telemetry & Intelligence (T&I) Engineering team has built a 10+ Petabyte Data Lake on Fabric, processing real-time data from globally distributed SQL Server Engines and Control Plane/Data Plane Services.

This article reflects on Modern Data Engineering Best-Practices and foundations that has served us well so far:

Architecture
In 2022, when deciding where to invest our codebase to express business logic, we performed an honest comparison of the various Fabric Engine APIs against our requirements:

Today, our Lakehouse Architecture follows the Lambda pattern with a heavy bias towards Real-time:
Bronze – reflects the source-system for backfills.
Silver – exploded schema to maximize columnar compression.
Gold – Kimball STAR schema with SCD2 (Dimension) tables, synchronous pre-commit quality checks in a Write-Audit-Publish pattern, with post-commit latency SLAs, Anomaly Detection checks for Natural/Business Keys.

1. OpenTelemetry – Our First Party (1P) services, as well as Customer-facing (3P) services are instrumented with OpenTelemetry.
2. Event Hub – Real-time Control plane events from Azure Resource Notifications.
3. Azure Data Explorer – Immutable time-series database.
4. Event Hub to Delta Lake Mirroring – Similar to Confluent’s Tableflow, we wrote a highly-performant C#/Rust service hosted on AKS with KEDA with 27+ GB/minute throughput. Refer to Data Ingestion: Introducing delta-dotnet for Delta Lake YouTube seminar.
5. Azure Data Explorer – APPEND-only to Delta Lake – written directly into ADLS using Continuous Export.
6. Open Mirroring – APPEND-only to Delta Lake – The OTeL Collector is written in Go –and does not have a matured Delta Lake SDK. Using Open Mirroring, we APPEND-only flush Parquet files into the Landing Zone with massive concurrency.
7. Spark Streaming – Applying schema-on-read by supplying versioned sample payloads to transform tables, we also use Spark Streaming to combine many small files from regional tables (maxFilesPerTrigger) to generate optimally sized parquet files.
The key benefits of using Spark Streaming across the board are:
- <15-30 seconds latency, with <100 milliseconds coming in Spark 4.1.
- All state inside isolated checkpoints with limited blast radius.
- All transformations are horizontally scalable.
- Stateful transformations (via GroupState).
- forEachBatch can fork to multiple sinks against a single read.
- Incremental processing without using brittle watermark columns.
8. Kimball SCD2 tables with Transaction grain
Our strict Kimball Dimensional Model offers these guarantees:
- Unique Primary Keys.
- Valid Foreign Key references.
- Idempotent – replaying previously seen transactions are guaranteed to be no-op.
- Reliable SCD2 columns based on source time (is_row_effective, start_date, end_date etc.).
Bringing relational-database style Primary Key enforcements into a large STAR schema was tricky – we designed our tables taking inspiration from Spark’s Storage Partition Join, learning from Hyperspace’s B-Tree indices and taking heavy advantage of BROADCAST joins.
9. Kimball Periodic Snapshot tables for DirectLake
A single DirectLake Semantic Model that describes a full-360 view of SQL Server’s business using:
- Periodic Snapshot FACT tables at the daily/weekly/monthly grain.
- Removing high-cardinality dimensions to make pre-aggregations more effective.
10. Eventhouse Acceleration
While most of our silver tables are stored as Delta Lake on OneLake, a second copy of popular tables are easily made available in Eventhouse using Query Acceleration Policy, where Azure Managed Grafana powers many of our live site KPI dashboards on top of the KQL endpoint.
11. Interactive Exploration with SSMS and Notebooks
Self-explanatory.
12. DirectLake Semantic Model
A mixture of Tabular Editor 3, Power BI Desktop, DAX Studio, and Fabric Web Editor are used to model relationships and measures – all contributions must go through Pull Requests via TMDL.
Problems & Solutions
Operating a hyperscale data platform inevitably brings complex scalability challenges. In this section – we discuss specific Data Engineering problems and solutions.
VSCode Devcontainer for local development
Spark has a reputation of being difficult to run locally.
We invested in designing a VSCode Devcontainer that contains all of our development dependencies, including pinned versions of JDK, Spark Engine, and VSCode Extensions. This enables every new developer to have a consistent development environment:

Once the code works locally, we rapidly build the whl/jar, upload and test it in Fabric Spark via the VS Code Extension to invoke a Spark Job Definition, all without leaving the IDE:

Fabric CI/CD at scale
Each member of the team must spin up a private fork of our entire Fabric Production stack, on-demand, with zero pre-requisites.
Our observations:
- Fabric Deployment Pipelines are low touch and low customizability.
- Fabric Terraform Provider does not have 100% API coverage surface area.
- Fabric-cli project does not also have 100% API coverage surface area.
- Fabric-cicd package is limited to git integrated APIs.
For regions or restricted clouds where Fabric is unavailable today, we currently maintain 40+ Production Synapse Workspaces using the tremendously helpful synapse-workspace-deployment project.
Using inspiration from it, we authored a fabric-workspace-deployment app, that wraps the fabric-cli, fabric-cicd, and API models in a GitOps manifest to achieve 100% automation:


Within 10 minutes, the app spins up isolated capacities, workspaces, Autoscale billing etc. to give a new Engineer an exact functional replica of Production:

The application code is available for inspiration, but the API contract is not fit for external consumption yet, as we are still making many breaking changes as we move fast to standardize T&I’s Fabric consumption.
Rapid test coverage
The key design characteristic in being able to maintain robust Data Processing code that is near-regression proof is – to force all changes to data to flow through a simple interface that operates purely on the DataFrame API:

Testing is an entire discipline area, so we will not go into any level of detail, but we mention two golden tips that will yield huge dividends to host production workloads on Fabric confidently, and specifically Spark:
- Parallelized testing significantly speeds up test execution while using sbt forking to eliminate noisy-neighbor tests.
- To dramatically speed up single-node Spark execution speeds on smaller datasets, set spark.sql.shuffle.partitions = 1.
By making these two small changes we were able to reduce our test runtime by 67%:

Kimball STAR Schema
After referring to The Data Warehouse Toolkit, 3rd Edition, we modelled all relevant data modelling concepts using Object-Oriented Programming principles:

The additional benefit of consistent interfaces is, we’re able to automatically generate DBML (Database Markup Language), which allows consistent generation of ERDs:

By applying the following best practices, we’ve been able to significantly reduce the time it takes for loads to occur:
- Liquid Clustering on Kimball key columns to reduce integrity enforcement runtime.
- Instead of using JOINs to check for orphaned keys, use WHERE EXISTS.

OpenTelemetry with Custom Spark Metrics
Given our deep investments in OpenTelemetry, to instrument our Spark Cluster with Custom Metrics, we took advantage of Spark Plugins. The use case is the ability to instantly gain deep visibility into a problematic Spark Jobs (e.g. ERROR 137) and understanding the memory profile with pinpoint precision:


For more information, refer to How to deeply instrument a Spark Cluster with OpenTelemetry (feat. real time Power BI report):

Data Quality and Anomaly Detection
We found Deequ to have the richest API coverage with highest performance, including Anomaly Detection, flexible Metric Stores (including Delta tables), and simple API constructs that can be extended to fit a particular codebase’s needs without requiring upstream changes.
Furthermore, recent features in Deequ to include Data Quality Definition Language (DQDL) have been phenomenal. DQDL allows us to define a terse set of constraints about a table in a few lines of declarative code:

This simple, easy to follow API allows you to declare and document data quality expectations about the table right alongside the table definition.
Data Platform SLAs
Our Data Platform is built on top of operational data arriving from every major Azure region on Earth. Several issues can occur at any time – It’s important to offer end users a single pane of glass with regards to:
- What the expected SLAs are for a particular dataset.
- If SLAs are breached, what the justifications and outliers are.
- When the expected mitigation is coming.
- A top-level trend of the entire Data Platform.
A simple YAML file that is git-checked in, with a simple YAML -> native object -> DataFrame converter:

SLA failures are evaluated as a Spark DataFrame, and visible by dimension (in this case, region):
Standardizing SLA evaluation through Spark SQL/DataFrame allows us to have robust test coverage with edge-cases captured for complex, maintainable business logic that operates on arbitrary sizes of data.
In an SLA breach:
- A health check job fails.
- An incident is fired via Fabric Activator.
- An on-call engineer investigates the alert, pushes an updated YAML file into OneLake via a PR. The stakeholders are kept up to date via a Delta table visualized in Power BI.

Fearless, idempotent backfills
Despite having all the unit test coverage in the world, it’s always possible to merge harmless looking changes—such as using an INNER rather than LEFT JOIN, that can cause unnatural drops in business metrics, such as unique natural keys.
Backfilling the lost data to undo the damage is tricky if the source tables are extremely large. To solve this:
- We capture all transformation logic in self-contained functions allows us to chain them together to backfill a particular table.
- Having Primary Keys enforced in the GOLD zone allows us to replay transactions from SILVER—without fearing repercussions of duplicates.


This is idempotency in action—saving the Lakehouse from what would have been an extremely expensive full reprocess with downstream ripples that would have reset many Streaming Checkpoints:

Autoscale Billing for Spark enables large, unpredictable workloads
On an average day, we require about 8,000 cores at peak – as jobs are queued, they grab the cores, run, and exit. At other times, for example on November 23rd below, very large Spark Pools needed to be allocated for backfills over 2 days:

Before Fabric Spark Autoscale Billing, even multiple F2048s would not have been sufficient for this workload profile. With Autoscale Billing, we simply allocate a small SKU against the workspace and use Autoscale Billing in true PAYG fashion.
Incremental View Maintenance to avoid fully reprocessing Petabytes of data during aggregations
The pre-aggregated periodic-snapshot generation in our STAR schema is not incremental:

While it is technically possible to take any SQL query, study the AST (Abstract Syntax Tree) and rewrite it such that it’s computed incrementally (linkedin/coral on GitHub), this differential rewrite is a stateful operation, as it must keep track of the previous commit that was processed per table:

Simply rewriting the query to be incremental isn’t guaranteed to be performant. Depending on the query, columns involved, clustering, table statistics, table/file layout, it’s possible for the original query plan that processes the full dataset to be more efficient than incremental.
It’s a fascinating optimization problem, that can sometimes benefit from Machine Learning over a historical heuristics feedback loop (i.e. evaluating cost over time to learn from the past). Fabric Materialized Lake View Optimal Refresh is precisely trying to solve this for all Fabric users and significantly improve COGS efficiency at scale.
For more information about this interesting problem space, refer to this academic paper: DBSP: Automatic Incremental View Maintenance for Rich Query Languages.
What’s Next?
In our experience, the best part about Fabric is how cohesively it brings together different, highly-specialized computes – such as Spark and the Analysis Services Engine – by unifying state on OneLake. Direct Lake and Spark go hand-in-hand.
Applying the best-practices that has been ingrained in us about Parquet from years of Data Engineering feels extremely intuitive – there’s a minimal amount of new things to learn.
Our Semantic Model is delightful – we’re able to enjoy incredible performance and squeeze fantastic scale (a single model with 100s of tables) out of the Analysis Services Engine via Direct Lake. The trick is to first apply Spark against the massive transaction grain fact tables and perform pre-aggregation using Kimball’s periodic snapshot modelling patterns.
With the core Production Medallion Architecture laid out end-to-end next, we want to polish our Data Platform further to enable a delightful self-serve model using dbt on Fabric, preferably with the fabric-spark adapter to keep local development and PR tests as consistent as possible.
We also aim to apply consistent AI/ML across our data estate, starting with rich, stateful Anomaly Detection on our well-established KPIs and Metrics.
If you’d like a deeper dive on any specific topic discussed in this post, please leave a comment!