pyspark3-docker

A single node PySpark3 docker container based on OpenJDK. Using Python 3, PySpark 3.0.3 with Spark 3.1.2 and Hadoop 2.7.

The image is set up to allow for any extensive Python3 with Spark development for testing, local development and pipelines.

Image includes AWS tools for Python:

Running Docker Image for AWS Development:

The docker image may assume local AWS configurations and secrets on run for specific python scripts:

docker run --rm=true -v ~/.aws:/root/.aws <etc...>

This image can be extended to run any PySpark .py script using python3.

Set up your local docker container that will run your scripts, in our case it could be scripts/main.py and one could set up the data in data/*:

FROM dirkscgm/pyspark3:latest

WORKDIR /app

COPY scripts/* scripts/
COPY data/* data/

ENTRYPOINT ["python3"]
CMD ["scripts/main.py"]

Build the local image:

docker build -t pyspark3 .

Run the container with the CMD set to the main entrypoint of the spark application:

docker run --rm=true pyspark3

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
data		data
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt