Microsoft Fabric Updates Blog

Building a Custom Sparklens JAR for Microsoft Fabric

Problem Statement

In the previous blog on Profiling Microsoft Fabric Spark Notebooks with Sparklens, we covered how to run Sparklens to profile and tune the performance of your spark notebooks in Microsoft Fabric. In that blog, we used a custom Sparklens JAR. The Sparklens JARs available in the Maven Central repo supports only the Spark version 2.X, which is not compatible with Microsoft Fabric. In this blog, you will learn how to build the sparklens JAR for Spark 3.X, which can be used in Microsoft Fabric.

Prerequisite Reading

To learn what is Sparklens and how to run it on Microsoft Fabric Spark Notebook and optimize performance, please check out this blog: Profiling Microsoft Fabric Spark Notebooks with Sparklens

Discussion

Sparklens is an open-source Spark profiling tool to profile Spark jobs and Notebooks. Latest JARs in Maven Central repo support Spark 2.X and doesn’t work with Spark 3.X. Here are modifications you need to make to run on Spark 3.X. 

Note: Sparklens is not owned/maintained by Microsoft, it’s crucial you implement all necessary security measures, similar to the precautions taken when using any package or library. Please check out Sparklens License details here.

Steps to run Sparklens on Spark 3.X:

1. Setup the Build Tool:

Sparklens is developed in Scala. To package a Scala project, you can use build tools like sbt (simple build tool). Ensure you have sbt installed on your local machine. This blog uses sbt version 0.13.18.

2. Prepare Your Development Environment:

Use your preferred IDE to make necessary changes. For this blog, Visual Studio Code is used. Open the terminal and navigate to the Sparklens directory:

cd sparklens

3. Clone the Repository:

Clone the Sparklens GitHub repository to your local machine from the following link: qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com).

git clone https://github.com/qubole/sparklens.git

4. Modify plugins.sbt:

Update the plugins.sbt file to comment out the existing addSbtPlugin

(addSbtPlugin(“org.spark-packages” % “sbt-spark-package” % “0.2.4”)):

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")

resolvers += "Spark Package Main Repo" at "https://dl.bintray.com/spark-packages/maven"

// addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.4")

5. Update build.sbt:

Make the following changes to the build.sbt file:

  • Comment out spName, sparkVersion, and spAppendScalaVersion as they use the := operator, which is for setting keys in earlier sbt versions. Instead, declare these three as variables.
  • Comment out the line that uses sparkVersion.version and replace it with sparkVersion since sparkVersion is a String and does not have a version property.
  • Change the Scala version to 2.12.0 and the Spark version to 3.0.0. Add the spark-sql 3.0.0 library dependency.

Here is the updated sections in the build.sbt:

name := "sparklens"
organization := "com.qubole"

scalaVersion := "2.12.0"

crossScalaVersions := Seq("2.10.6", "2.12.0")

// spName := "qubole/sparklens"

// sparkVersion := "2.0.0"

// spAppendScalaVersion := true

val spName = "qubole/sparklens"

val sparkVersion = "3.0.0"

val spAppendScalaVersion = true


// libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion.version % "provided"

libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"

6. Update QuboleJobListener.scala:

In QuboleJobListener.scala (src/main/scala/com/qubole/sparklens/QuboleJobListener.scala), change attemptId to attemptNumber() as shown in this code snippet:

override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
    val stageTimeSpan = stageMap(stageCompleted.stageInfo.stageId)
    if (stageCompleted.stageInfo.completionTime.isDefined) {
      stageTimeSpan.setEndTime(stageCompleted.stageInfo.completionTime.get)
    }
    if (stageCompleted.stageInfo.submissionTime.isDefined) {
      stageTimeSpan.setStartTime(stageCompleted.stageInfo.submissionTime.get)
    }

    if (stageCompleted.stageInfo.failureReason.isDefined) {
      //stage failed
      val si = stageCompleted.stageInfo
      failedStages += s""" Stage ${si.stageId} attempt ${si.attemptNumber()} in job ${stageIDToJobID(si.stageId)} failed.
                      Stage tasks: ${si.numTasks}
                      """
      stageTimeSpan.finalUpdate()
    }else {
      val jobID = stageIDToJobID(stageCompleted.stageInfo.stageId)
      val jobTimeSpan = jobMap(jobID)
      jobTimeSpan.addStage(stageTimeSpan)
      stageTimeSpan.finalUpdate()
    }
  }

7. Update HDFSConfigHelper.scala:

In the HDFSConfigHelper.scala (src\main\scala\com\qubole\sparklens\helper\HDFSConfigHelper.scala), SparkHadoopUtil class has been changed to a private class in Spark 3. Modify this as shown below:

import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkConf
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.SparkSession

object HDFSConfigHelper {
  def getHadoopConf(sparkConfOptional: Option[SparkConf]): Configuration = {
    if (sparkConfOptional.isDefined) {
      val spark = SparkSession.builder.config(sparkConfOptional.get).getOrCreate()
      spark.sparkContext.hadoopConfiguration
    } else {
      val spark = SparkSession.builder.getOrCreate()
      spark.sparkContext.hadoopConfiguration
    }
  }
}

8. Compile the Revised Code: Run “sbt compile” to compile the project.

9. Package the Compiled Code: Run “sbt package” to package the project as a JAR file.

10. You can now use the JAR (target/scala-2.12/sparklens_2.12-0.3.2.jar) and run profiling on Microsoft Fabric Notebook: Profiling Microsoft Fabric Spark Notebooks with Sparklens.

Further Reading

qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com)

Profiling Microsoft Fabric Spark Notebooks with Sparklens | Microsoft Fabric Blog | Microsoft Fabric

Relaterade blogginlägg

Building a Custom Sparklens JAR for Microsoft Fabric

oktober 9, 2024 från Misha Desai

At Fabric, we’re passionate about contributing to the open-source community, particularly in areas that advance the usability and scalability of machine learning tools. One of our recent endeavors has been making substantial contributions back to the FLAML (Fast and Lightweight AutoML) project, a robust library designed to automate the tedious and complex process of machine … Continue reading “Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML”

september 25, 2024 från Santhosh Kumar Ravindran

We’re excited to introduce high concurrency mode for notebooks in pipelines, bringing session sharing to one of the most popular orchestration mechanisms for enterprise data ingestion and transformation. Notebooks will now automatically be packed into an active high concurrency session without compromising performance or security, while paying for a single session. Key Benefits: Why Use … Continue reading “Introducing High Concurrency Mode for Notebooks in Pipelines for Fabric Spark”