Microsoft Fabric Updates Blog

Announcing the Fabric Apache Spark Diagnostic Emitter: Collect Logs and Metrics

Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring and troubleshooting Spark applications, enhancing your visibility into application performance.

What Does the Diagnostic Emitter Do?

The Fabric Apache Spark Diagnostic Emitter enables Apache Spark applications to emit critical logs and metrics that can be used for real-time monitoring, analysis, and troubleshooting. Whether you’re sending logs to Azure Event Hubs, Azure Storage, or Azure Log Analytics, this emitter simplifies the process, allowing you to collect data seamlessly and store it in your preferred destinations.

Key Benefits of the Apache Spark Diagnostic Emitter

  • Centralized Monitoring: Send logs and metrics to Azure Event Hubs, Azure Log Analytics, or Azure Storage for real-time data streaming, deep analysis and querying, as well as long-term retention.
  • Flexible Configuration: Easily configure Spark to emit logs and metrics to one or more destinations, with support for connection strings, Azure Key Vault integration, and more.
  • Comprehensive Metrics: Collect a wide range of logs and metrics, including driver and executor logs, event logs, and detailed Spark application metrics.

Below is a quick step-by-step guide for the one-time configuration of the destination for collecting logs and metrics.

Step 1: Create your Azure resources as destination

To begin, you’ll need an Azure Event Hubs instance, an Azure Log Analytics workspace, or an Azure Blob Storage account, based on your preference. If you don’t already have one, you can quickly create one via the Azure portal.

Step 2: Configure Your Fabric Environment Artifact for Apache Spark

Next, you’ll need to create a Fabric Environment Artifact in Microsoft Fabric and configure it with the required Spark properties.

Here are some example key configuration properties available for the diagnostic emitter:

  • spark.synapse.diagnostic.emitters: Comma-separated names of diagnostic emitters.
  • spark.synapse.diagnostic.emitter.<destination>.type: The destination type (e.g., AzureEventHub).
  • spark.synapse.diagnostic.emitter.<destination>.categories: The log categories to be collected (e.g., DriverLog, ExecutorLog, EventLog, Metrics).
  • spark.synapse.diagnostic.emitter.<destination>.secret: The Azure Event Hubs connection string.
  • spark.synapse.diagnostic.emitter.<destination>.secret.keyVault: Azure Key Vault name for storing the connection string.

For a full list of configuration options, refer to the official documentation below.

Step 3: Attach the Environment Artifact

Once configured, attach your environment artifact to a Notebook or Spark Job Definition.

  • For Notebooks or Spark jobs: Navigate to the specific notebook or Spark job definition and attach the environment with the configured Spark properties.
  • To set the environment as the default for the workspace: Go to your Workspace Settings in Microsoft Fabric, find the Spark settings, and select the configured environment.

After this configuration, you can run your Notebooks or Spark jobs as you normally do. You can now efficiently collect and analyze logs and metrics from your Apache Spark applications using your preferred destination. This feature simplifies monitoring and debugging, allowing you to focus on your core business logic. Additionally, you can query, aggregate, and create custom alerts in Azure Monitor by querying logs and metrics at regular intervals, with alerts triggered based on your defined criteria.

Log Data Sample

Here is a sample log record in JSON format, showing how Spark logs and metrics are captured:

jsonCopy code{
  "timestamp": "2024-09-06T03:09:37.235Z",
  "category": "Log|EventLog|Metrics",
  "fabricLivyId": "<fabric-livy-id>",
  "applicationId": "<application-id>",
  "applicationName": "<application-name>",
  "executorId": "<driver-or-executor-id>",
  "properties": {
    "message": "Initialized BlockManager: BlockManagerId(1, vm-04b22223, 34319, None)",
    "logger_name": "org.apache.spark.storage.BlockManager",
    "level": "INFO"
  }
}

Stay tuned for more updates, and happy coding!

Related documents:

Související příspěvky blogu

Announcing the Fabric Apache Spark Diagnostic Emitter: Collect Logs and Metrics

října 31, 2024 autor Jovan Popovic

Fabric Data Warehouse is a modern data warehouse optimized for analytical data models, primarily focused on the smaller numeric, datetime, and string types that are suitable for analytics. For the textual data, Fabric DW supports the VARCHAR type that can store up to 8KB of text, which is suitable for most of the textual values … Continue reading “Announcing public preview of VARCHAR(MAX) and VARBINARY(MAX) types in Fabric Data Warehouse”

října 29, 2024 autor Dandan Zhang

Managed private endpoints allow Fabric experiences to securely access data sources without exposing them to the public network or requiring complex network configurations. We announced General Availability for Managed Private Endpoint in Fabric in May of this year. Learn more here: Announcing General Availability of Fabric Private Links, Trusted Workspace Access, and Managed Private Endpoints. … Continue reading “APIs for Managed Private Endpoint are now available”