PySpark Docker Image creation and run Pyspark test cases

2 min readMay 22, 2024

Let's see how we can create the docker image of PySpark and further use the created docker image to run your PySpark test cases.

Create a Dockerfile.pyspark and add the below code to it

FROM openjdk:8-jdk-alpine

ENV SPARK_VERSION=3.3.1 \
    HADOOP_VERSION=3.4.0 \
    PYSPARK_PYTHON=python3 \
    PYSPARK_DRIVER_PYTHON=python3

RUN apk add --no-cache \
    bash \
    curl \
    python3 \
    py3-pip

RUN curl -O https://downloads.apache.org/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz && \
    tar -xzf hadoop-${HADOOP_VERSION}.tar.gz && \
    mv hadoop-${HADOOP_VERSION} /usr/local/hadoop && \
    rm hadoop-${HADOOP_VERSION}.tar.gz

RUN curl -O https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz && \
    tar -xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz && \
    mv spark-${SPARK_VERSION}-bin-hadoop3 /usr/local/spark && \
    rm spark-${SPARK_VERSION}-bin-hadoop3.tgz

ENV HADOOP_HOME=/usr/local/hadoop \
    SPARK_HOME=/usr/local/spark \
    PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH

RUN pip3 install pyspark

CMD ["pyspark"]

Now build the docker image by using the below command

docker build -t pyspark-custom-image -f Dockerfile.pyspark .

Okay so now you will have a docker image created. Its time to write some test cases.

Write a simple code to create data frame in create_dataframe.py file

from pyspark.sql import SparkSession

def create_dataframe(json_data):
    spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
    columns = ["Name", "Age"]
    df = spark.createDataFrame(json_data, columns)
    return df

if __name__ == "__main__":
    json_data = [
        {
            "Name" : "Alice",
            "Age": 34
        },
        {
            "Name": "Bob",
            "Age": 45
        },
        {
            "Name": "Catherine",
            "Age": 29
        }
    ]
    df = create_dataframe(json_data)
    df.show()

And the test case of the same in the test_create_dataframe.py file

import unittest
from create_dataframe import create_dataframe


class TestCreateDataFrame(unittest.TestCase):

    def test_create_dataframe(self):
        json_data = [
            {
                "Name": "Foo",
                "Age": 34
            },
            {
                "Name": "bar",
                "Age": 45
            }
        ]
        df = create_dataframe(json_data)
        columns = ["Name", "Age"]

        self.assertEqual(df.count(), len(json_data))
        self.assertEqual(df.columns, columns)

Now it is time to create a Docker image using the above pyspark-custom-image.

Create a Dockerfile.application and add the below code

FROM pyspark-custom-image

RUN pip3 install pytest pytest-cov coverage

COPY create_dataframe.py /app/create_dataframe.py
COPY test_create_dataframe.py /app/test_create_dataframe.py

WORKDIR /app

# CMD ["coverage", "run", "-m", "pytest", "test_example.py"] && coverage report

CMD ["sh", "-c", "coverage run -m pytest test_example.py && coverage html && coverage report"]

Now its time to build this docker image

docker build -t my-application -f Dockerfile.application .

Lets run some tests now.

docker run -t my-application

Hurray !!! we are now successfully able to create a Pyspark image and able to unit testing using a custom image.

PySpark Docker Image creation and run Pyspark test cases

Written by Abhijit Jadhav