PySpark Docker Image creation and run Pyspark test cases

Abhijit Jadhav
2 min readMay 22, 2024

Let's see how we can create the docker image of PySpark and further use the created docker image to run your PySpark test cases.

Create a Dockerfile.pyspark and add the below code to it

FROM openjdk:8-jdk-alpine

ENV SPARK_VERSION=3.3.1 \
HADOOP_VERSION=3.4.0 \
PYSPARK_PYTHON=python3 \
PYSPARK_DRIVER_PYTHON=python3

RUN apk add --no-cache \
bash \
curl \
python3 \
py3-pip

RUN curl -O https://downloads.apache.org/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz && \
tar -xzf hadoop-${HADOOP_VERSION}.tar.gz && \
mv hadoop-${HADOOP_VERSION} /usr/local/hadoop && \
rm hadoop-${HADOOP_VERSION}.tar.gz

RUN curl -O https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz && \
tar -xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz && \
mv spark-${SPARK_VERSION}-bin-hadoop3 /usr/local/spark && \
rm spark-${SPARK_VERSION}-bin-hadoop3.tgz

ENV HADOOP_HOME=/usr/local/hadoop \
SPARK_HOME=/usr/local/spark \
PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH

RUN pip3 install pyspark

CMD ["pyspark"]

Now build the docker image by using the below command

docker build -t pyspark-custom-image -f Dockerfile.pyspark . 

Okay so now you will have a docker image created. Its time to write some test cases.

Write a simple code to create data frame in create_dataframe.py file

from pyspark.sql import SparkSession

def create_dataframe(json_data):
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
columns = ["Name", "Age"]
df = spark.createDataFrame(json_data, columns)
return df

if __name__ == "__main__":
json_data = [
{
"Name" : "Alice",
"Age": 34
},
{
"Name": "Bob",
"Age": 45
},
{
"Name": "Catherine",
"Age": 29
}
]
df = create_dataframe(json_data)
df.show()

And the test case of the same in the test_create_dataframe.py file

import unittest
from create_dataframe import create_dataframe


class TestCreateDataFrame(unittest.TestCase):

def test_create_dataframe(self):
json_data = [
{
"Name": "Foo",
"Age": 34
},
{
"Name": "bar",
"Age": 45
}
]
df = create_dataframe(json_data)
columns = ["Name", "Age"]

self.assertEqual(df.count(), len(json_data))
self.assertEqual(df.columns, columns)

Now it is time to create a Docker image using the above pyspark-custom-image.

Create a Dockerfile.application and add the below code

FROM pyspark-custom-image

RUN pip3 install pytest pytest-cov coverage

COPY create_dataframe.py /app/create_dataframe.py
COPY test_create_dataframe.py /app/test_create_dataframe.py

WORKDIR /app

# CMD ["coverage", "run", "-m", "pytest", "test_example.py"] && coverage report

CMD ["sh", "-c", "coverage run -m pytest test_example.py && coverage html && coverage report"]

Now its time to build this docker image

docker build -t my-application -f Dockerfile.application .

Lets run some tests now.

docker run -t my-application

Hurray !!! we are now successfully able to create a Pyspark image and able to unit testing using a custom image.

--

--

Abhijit Jadhav

Full Stack Java Developer and AI Enthusiast loves to build scalable application with latest tech stack