Migration of Standalone Apache Spark Applications to Azure Databricks

Apache Spark is a large-scale open-source data processing framework. Databricks is a Unified Analytics Platform that builds on top of Apache Spark to enable provisioning of clusters and add highly scalable data pipelines. Further Databricks includes an integrated workspace for collaboration in an easy to-use environment. For the user, it becomes handy to schedule any locally developed Spark code to go to production without re-engineering. In this blog, we will explore how to migrate standalone Apache Spark applications to Azure Databricks.

Below is a diagrammatic representation of Analytics system architecture on standalone Apache Spark cluster (VM Based architecture)

Standalone-Apache-Spark-cluster

Technologies used:
  1. Apache Spark 3.1.1
  2. Elasticsearch
  3. RabbitMQ
  4. Python Scripts & Crontab
Here is the data flow:
  • The business services and Apps publishes analytics events to a specific queue of RabbitMQ.
  • The Spark application listens to the same queue and dumps them into Elasticsearch.
  • Analyze data jobs that are triggered by Crontab every midnight. First, crontabb invokes the python script. Python script calculates the current date and last analyze date, computes the analyze range, and submits the Spark jobs accordingly.
  • Spark job performs data massaging and aggregation in different time intervals such as hourly, daily, etc. This analyzed data is stored in relevant Elasticsearch indexes.
  • Analyzed data is queried from the dashboard with graphs.
  • The Angular app shows the analyzed data in charts and/or tabular format as needed.

Migration to Azure Databricks

Every component from the VM-based system is now converted into separate containers running on the Kubernetes cluster except Apache Spark.

Scheduler scripts can be moved to the Databricks notebook and triggered using Databricks jobs but we are not doing it because they are not compute intensive and don’t want to run on a Databricks cluster.

Elasticsearch managed service is not available in the Azure cloud. Although Elastic cloud be deployed on Azure, we used our own k8s Elasticsearch deployment for the time being.

Only Spark applications will be moved to Databricks as those are compute intensive, and Databricks will replace Apache Spark.

Architecture:

Spark-to-Databricks-Migration

VNet Peering:

RabbitMQ and Elasticsearch should be accessible from the Databricks cluster. VNet Peering can help here to get accessibility.

Code Changes:

Use getOrCreate() method to create Spark session so that it will use Databricks already available Spark session as

val session = SparkSession.builder().config(conf).getOrCreate()

Do not close the Spark session at the application end. We need conditions for a session.stop() like

val IS_DATABRICKS: Boolean = sys.env.contains("DATABRICKS_RUNTIME_VERSION")

if (IS_DATABRICKS) {
log.info("For Databricks, no need to stop Spark session")
} else {
log.info("stopping Spark session")
session.stop()
}

Compile and build Spark application jar. Then, create Databricks job and upload the application jar.

Open Databricks workspace → Jobs → Create Job
Databricks-Create-Job

Specify cluster configuration like nodes, workers, ENV variables.

Databricks Cluster

Run the jobs with required parameters and verify whether it is working properly.

Databricks REST APIs can be used to access the jobs from our containers. Create the access token from the Databricks portal to use these APIs. The below table describes the used Databricks APIs with their purpose.

Header: Authorization: Bearer
Method API Body Description
GET /api/2.0/jobs/list Available job list.

Used to get job_id from user friendly job name

GET /api/2.0/jobs/runs/list?active_only=true&offset=1&limit=10 Active job list.

Used to ensure that the same job won’t run simultaneously.

POST /api/2.0/jobs/run-now {
“job_id”: <job_id_here>,
“jar_params”: [“startDate”, “endDate”, …]
}
Run the specified job with parameters. It returns run_id.
GET /api/2.0/jobs/runs/get?run_id=<run_id_here> Poll on run_id to check the job status until it finishes.
Conclusion:

Databricks, built on top of Apache Spark delivers scalable and reliable data lakes with end to end data security and workflow automation. GS Lab has extensive experience working with Data Engineering tools and product engineering DNA that has helped in 2000+ booming product releases.

Shivaji Mutkule
Author
Shivaji Mutkule | Lead Software Engineer

Shivaji has 7+ years of experience in Software Development. He is an experienced FullStack Developer and works on cutting-edge technologies in the healthcare domain. Shivaji possesses industry experiences in Web Development, Analytics, Microservices, DevOps, and Azure Cloud Networking. He has completed M.E. in Computer Science and Engineering from Government College of Engineering, Aurangabad(GECA). His area of interest includes Web development, Data Mining, and Machine Learning.