Migration to Spark managed by Kubernetes
Spark can run on clusters managed by Kubernetes. Here are the steps to create and run the applications on Spark managed by Kubernetes.
Ensure that the spark session is closed at the end; otherwise, spark application pods will remain running forever.
Create an application jar that will be used in docker image creation.
2. Spark setup:
Download the spark zip from the official site and set it up. Below steps are used on CentOS 7 to setup Spark.
# create directory
mkdir -p /opt/apache
# download the spark 3.1.1
# unzip spark and move it to spark directory
tar -xzf spark-3.1.1-bin-hadoop2.7.tgz
mv spark-3.1.1-bin-hadoop2.7/ spark
# initialize SPARK_HOME and path variables
3. Create Docker Image:
Spark (with version 2.3 or higher) ships with a Dockerfile that can be used or customized based on an individual application’s needs. It can be found in the kubernetes/dockerfiles/ directory.
Spark also ships with a bin/docker-image-tool.sh script that can be used to build and publish the Docker images to use with the Kubernetes.
Copy application jar file to $SPARK_HOME/jars/ directory. Only one application jar file allowed in one image. Run below commands to build and push the docker image to the repository.
docker-image-tool.sh -r repo.com/path/spark -t spark-analyze-latest build
docker-image-tool.sh -r repo.com/path/spark -t spark-analyze-latest push
We need to create separate docker images for each application in our case analyze and streaming applications. These images will be used in the /spark-submit command later.