How does Fabric make Spark Notebooks Instant?
Discover how Microsoft Fabric’s Forecasting Service system reduces Spark startup latency and cloud costs through proactive AI and ML-driven resource provisioning.
Context & Relevance
Waiting minutes for a Spark cluster to become available can throttle analytics velocity, delay insights, and drive-up cloud spend. In a world where data teams expect near‐instant execution and seamless burst capacity, that latency ultimately limits innovation.
Within Microsoft Fabric, a unified platform that supports integrated data engineering, analytics, and AI workloads, reducing startup latency while optimizing cost is mission critical. To address this challenge and enable virtuous scaling at cloud-optimized cost, we built Fabric Forecasting Service: a machine-learning backed, optimization-driven system for proactively managing starter pools so that compute is available just in time, and idle waste is minimized.
In this blog we explain the technical architecture, algorithms, implementation details and observed outcomes of Forecasting Service which is designed to serve scalable data science workloads in production at Microsoft scale.
If you use the default Starter Pool, a Spark session usually starts in few seconds. That’s not luck. Behind the scenes, Fabric keeps a small fleet of Spark clusters already running and continuously right-sizes that fleet so most requests land on a warm cluster. When traffic spikes, we refill the starter pool quickly. If the starter pool is briefly drained or your workspace needs special networking or environments, we fall back to on-demand start.
Why it Matters
- For Data Engineers: Faster cluster spin-up and consistent execution times.
- For Cloud Operators: Lower operational cost through predictive pooling.
- For Product Teams: Improved SLA compliance and system resilience.
By integrating ML-driven provisioning into Fabric’s compute layer, Forecasting Service redefines how large-scale data platforms manage elasticity and performance at scale.
What you’ll notice as a user
- Fast starts by default- With the Starter Pool and no extra libraries, notebooks typically start in a few seconds because the cluster and session already exist.
- When it takes longer- Adding custom libraries or Spark properties, requires a short personalization step. If a starter pool is momentarily fully used, we create a new cluster.
- Private Link or Managed VNet- Workspaces that don’t use Starter Pools (they run in dedicated networks), so starts are on-demand.
Typical cold-start ranges in these cases are ~2–5 minutes (plus time to install libraries if any).
For example:
- Finance analysts experience inconsistent latency during market-hour data refreshes.
- Product telemetry pipelines face SLA breaches due to cluster warm-up lag.
Traditional “static pooling” keeps clusters pre-warmed but wastes massive compute when demand dips. Forecasting Service closes this gap by balancing performance and cost dynamically.

Solution Overview: What is Forecasting Service?
Forecasting Service is Microsoft Fabric’s proactive resource provisioning engine that is built directly into big data infrastructure platform.
It uses a hybrid ML + optimization pipeline to predict demand patterns and auto-tune starter pools, maintaining optimal starter pool size based on real-time workloads.
Think of this as inventory management for clusters:
1. Keep starter pool of ready-to-use clusters/sessions. When you start a notebook and grab one, we immediately request another to re-hydrate the starter pool. That’s how we preserve the instant start.
2. Continuously right-size the starter pool. We forecast near-term demand from recent telemetry and then compute the target starter pool size that balances experience (no wait) against cost (idle time). The decision is a small, fast linear program that explicitly trades wait time vs idle time, so it’s explainable and easy to tune.
3. Act fast, recover fast. A pool worker recommends the latest target: if usage rises, we scale; when a starter pool instance is consumed, we re-hydrate without delay. The worker talks to our existing services that create clusters and sessions.
Pool hit you get a running starter pool instance.
Pool miss we create one; you see a short cold-start.
Architecture Overview- What runs behind the scenes

Architecture Overview- What runs behind the scenes
- Starter Pool+ Re-hydration- We maintain a target number of ready clusters/sessions. Each time one is used, we immediately submit a create request to top the starter pool back up. The algorithm explicitly minimizes both customer wait and cluster idle time.
- Predict, then optimize- A lightweight time‑series forecaster predicts the short‑term request rate. We use a hybrid (SSA+) approach centered on Singular Spectrum Analysis (SSA) with deep‑model enhancements and a cost‑aware loss; the predicted demand feeds a Sample Average Approximation (SAA) linear program that picks the target starter pool size. The end‑to‑end loop runs frequently and refreshes the resource recommendation.
- Production architecture- Recommendations are stored centrally and read by a Pool Worker that calls our Big Data Infra Platform Services (orchestrates jobs/sessions and provisions and stitches VMs) to create/delete starter pool instances. Telemetry flows into the predictor; a simple hyper-parameter tuning loop runs less frequently to keep the cost – experience trade-off healthy.
Key Innovations
- Hybrid AI/ML Forecasting (SSA+)- Combines time-series forecasting (Singular Spectrum Analysis) with a shallow neural network to predict demand spikes with high accuracy and low latency.
- Optimization Engine (SAA Optimizer)- Uses linear programming to minimize total idle (cost) and wait (latency) time, delivering Pareto-efficient balance between performance and COGS.
- Self-Adaptive Hyperparameter Tuning- Continuously adjusts sensitivity thresholds to maintain SLA under shifting workload conditions.
- Seamless Integration with Fabric Services- Tightly integrated with Big Data Infrastructure Platform Services for automatic starter pool creation, rehydration, and telemetry monitoring.
Components
- ML Predictor- Fetches time-series data from Azure Data Explorer and predicts resource request rate.
- SAA Optimizer- Computes target starter pool size using linear programming.
- Forecasting Worker- Runs inference pipelines and persists recommendations to Azure Cosmos DB.
- Pool Worker- Executes cluster creation/deletion via Big Data Infrastructure Platform and maintains starter pool equilibrium.
- Telemetry Dashboard- Tracks pool hit rate, COGS, and latency metrics in real-time.
Results at Fabric Scale
Targeting a high pool-hit rate, this approach has shown reduction in idle cluster time versus static pre-provisioning, keeping experiences snappy while optimizing cutting COGS. It’s been deployed across all Fabric regions since Nov 2023.

Conclusion
Fabric Forecasting Service brings infrastructure intelligence to the heart of the analytics platform. Through forecasting, optimization and feedback-driven automation, Fabric unlocks near-instant compute availability while driving down cost.
The underlying principle: treat compute capacity as a first-class elastic resource, one that learns and adapts automatically, rather than remain a manual dial. This architecture empowers scalable data science and data engineering teams to iterate faster, reduce waste and deliver business impact more reliably.
References
- Learn more about Microsoft Fabric Spark compute
- Intelligent Pooling: Proactive Resource Provisioning in Large-scale Cloud Service (PVLDB 2024). Deep dive into forecasting, optimization, robustness, and production results
- Apache Spark compute for Data Engineering and Data Science – Microsoft Fabric | Microsoft Learn
Post Authors
Kunal Parekh, Senior Product Manager, Azure Data, Microsoft
Yiwen Zhu, Principal Researcher, Azure Data, Microsoft Research
Subru Krishnan, Principal Architect, Azure Data, Microsoft Spain
Aditya Lakra, Software Engineer, Azure Data, Microsoft
Harsha Nagulapalli, Principal Engineering Manager, Azure Data, Microsoft
Sumeet Khushalani, Princiapal Engineering Manager, Azure Data, Microsoft
Arijit Tarafdar, Principal Group Engineering Manager, Azure Data, Microsoft