PySpark Docker Image creation and run Pyspark test cases
Let's see how we can create the docker image of PySpark and further use the created docker image to run your PySpark test cases.
Create a Dockerfile.pyspark
and add the below code to it
FROM openjdk:8-jdk-alpine
ENV SPARK_VERSION=3.3.1 \
HADOOP_VERSION=3.4.0 \
PYSPARK_PYTHON=python3 \
PYSPARK_DRIVER_PYTHON=python3
RUN apk add --no-cache \
bash \
curl \
python3 \
py3-pip
RUN curl -O https://downloads.apache.org/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz && \
tar -xzf hadoop-${HADOOP_VERSION}.tar.gz && \
mv hadoop-${HADOOP_VERSION} /usr/local/hadoop && \
rm hadoop-${HADOOP_VERSION}.tar.gz
RUN curl -O https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz && \
tar -xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz && \
mv spark-${SPARK_VERSION}-bin-hadoop3 /usr/local/spark && \
rm spark-${SPARK_VERSION}-bin-hadoop3.tgz
ENV HADOOP_HOME=/usr/local/hadoop \
SPARK_HOME=/usr/local/spark \
PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH
RUN pip3 install pyspark
CMD ["pyspark"]
Now build the docker image by using the below command
docker build -t pyspark-custom-image -f Dockerfile.pyspark .
Okay so now you will have a docker image created. Its time to write some test cases.
Write a simple code to create data frame in create_dataframe.py
file
from pyspark.sql import SparkSession
def create_dataframe(json_data):
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
columns = ["Name", "Age"]
df = spark.createDataFrame(json_data, columns)
return df
if __name__ == "__main__":
json_data = [
{
"Name" : "Alice",
"Age": 34
},
{
"Name": "Bob",
"Age": 45
},
{
"Name": "Catherine",
"Age": 29
}
]
df = create_dataframe(json_data)
df.show()
And the test case of the same in the test_create_dataframe.py
file
import unittest
from create_dataframe import create_dataframe
class TestCreateDataFrame(unittest.TestCase):
def test_create_dataframe(self):
json_data = [
{
"Name": "Foo",
"Age": 34
},
{
"Name": "bar",
"Age": 45
}
]
df = create_dataframe(json_data)
columns = ["Name", "Age"]
self.assertEqual(df.count(), len(json_data))
self.assertEqual(df.columns, columns)
Now it is time to create a Docker image using the above pyspark-custom-image
.
Create a Dockerfile.application
and add the below code
FROM pyspark-custom-image
RUN pip3 install pytest pytest-cov coverage
COPY create_dataframe.py /app/create_dataframe.py
COPY test_create_dataframe.py /app/test_create_dataframe.py
WORKDIR /app
# CMD ["coverage", "run", "-m", "pytest", "test_example.py"] && coverage report
CMD ["sh", "-c", "coverage run -m pytest test_example.py && coverage html && coverage report"]
Now its time to build this docker image
docker build -t my-application -f Dockerfile.application .
Lets run some tests now.
docker run -t my-application
Hurray !!! we are now successfully able to create a Pyspark image and able to unit testing using a custom image.