Skip to content

PySpark3 Docker container for testing & development. With OpenJDK, Spark 3.1.2, and Hadoop 2.7.

License

Notifications You must be signed in to change notification settings

ByteMeDirk/pyspark3-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyspark3-docker

A single node PySpark3 docker container based on OpenJDK. Using Python 3, PySpark 3.0.3 with Spark 3.1.2 and Hadoop 2.7.

The image is set up to allow for any extensive Python3 with Spark development for testing, local development and pipelines.

Publish to Docker

Image includes AWS tools for Python:

Running Docker Image for AWS Development:

The docker image may assume local AWS configurations and secrets on run for specific python scripts:

docker run --rm=true -v ~/.aws:/root/.aws <etc...>

This image can be extended to run any PySpark .py script using python3.

Example

Set up your local docker container that will run your scripts, in our case it could be scripts/main.py and one could set up the data in data/*:

FROM dirkscgm/pyspark3:latest

WORKDIR /app

COPY scripts/* scripts/
COPY data/* data/

ENTRYPOINT ["python3"]
CMD ["scripts/main.py"]

Build the local image:

docker build -t pyspark3 .

Run the container with the CMD set to the main entrypoint of the spark application:

docker run --rm=true pyspark3

About

PySpark3 Docker container for testing & development. With OpenJDK, Spark 3.1.2, and Hadoop 2.7.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published