Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to 2024 RIKEN tutorial for DYAD #29

Merged
merged 35 commits into from
Apr 12, 2024
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
9db7e92
Updates Dockerfile.spawn and requirements.txt to handle the newest ve…
ilumsden Apr 4, 2024
a4c730b
Moves copy/move of tutorial materials into the root user block
ilumsden Apr 4, 2024
4adc40d
Adds intro, tutorial code cells, and extra dependencies to DYAD porti…
ilumsden Apr 5, 2024
41bb218
Adds cleanup for the data generation directory in DYAD portion of tut…
ilumsden Apr 5, 2024
c4e3f8c
Current progress and debugging
ilumsden Apr 10, 2024
223afdc
DLIO version of the Docker file.
hariharan-devarajan Apr 10, 2024
fe854e1
Starts to break the tutorial down into modules
ilumsden Apr 10, 2024
f036fa2
Minor text changes to 02_flux_scheduling.ipynb
ilumsden Apr 10, 2024
560a48e
Adds a description of the scheduling policies used in flux tree example
ilumsden Apr 10, 2024
223861e
Adds an image for flux tree
ilumsden Apr 10, 2024
72d56e4
Adds attribution to images
ilumsden Apr 10, 2024
26f6834
Adds module labels throughout tutorial
ilumsden Apr 10, 2024
89980ed
Adds Flux logo to other notebooks
ilumsden Apr 10, 2024
eaf4b1a
Makes a couple of fixes for DLIO use case
ilumsden Apr 11, 2024
50d7cd6
Last few bugfixes in DYAD notebook
ilumsden Apr 11, 2024
0ab14fc
Minor changes to Flux scheduling notebook to correct job waiting
ilumsden Apr 11, 2024
13cc28c
Moves the YouTube video in the intro to a code cell so it will work i…
ilumsden Apr 11, 2024
b1a3143
Adds DYAD to LD_LIBRARY_PATH before launching DLIO
ilumsden Apr 11, 2024
07033e1
Renames dyad_dlio.ipynb to 04_dyad_dlio.ipynb
ilumsden Apr 11, 2024
03f62e6
Updates Dockerfile.spawn to get the DLIO use case working
ilumsden Apr 11, 2024
9af1117
Adds a tutorial-specific copy of the DYAD Torch data loader
ilumsden Apr 11, 2024
9d1fecd
Minor bugfixes after moving DLIO extensions into the repo
ilumsden Apr 11, 2024
1613907
Simplifies the DYAD Torch data loader
ilumsden Apr 11, 2024
a874218
Adds reference to LC table and flux proxy optional section
ilumsden Apr 11, 2024
5428b9a
Adds a step to docker-builds to remove unneeded stuff from runner
ilumsden Apr 11, 2024
58ed703
Small consistency tweaks
ilumsden Apr 11, 2024
fc7e37b
Adds module 3
ilumsden Apr 11, 2024
f84cd2c
Small summary change
ilumsden Apr 11, 2024
96897bb
Editing and revisions
ilumsden Apr 12, 2024
0d01733
Finishes the DYAD notebook
ilumsden Apr 12, 2024
f24bbed
Adds the supplement and updates the conclusions
ilumsden Apr 12, 2024
14108ca
Tweaks a figure width
ilumsden Apr 12, 2024
55e27e2
Removes dead changes for testing DLIO
ilumsden Apr 12, 2024
7051f86
Tries uncommenting the rm on apt lists
ilumsden Apr 12, 2024
0cb3fb0
Removes the old DYAD notebook from previous years since it's been sup…
ilumsden Apr 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .github/workflows/docker-builds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,18 @@ jobs:
steps:
- name: Clone the code
uses: actions/checkout@v3

# Note: only works on Ubuntu runner
- name: Remove unneeded stuff in runner to make space for Docker image
uses: jlumbroso/[email protected]
with:
tool-cache: false
android: true
dotnet: true
haskell: true
large-packages: true
docker-images: false
swap-storage: true

- name: GHCR Login
if: (github.event_name != 'pull_request')
Expand Down
117 changes: 74 additions & 43 deletions 2024-RIKEN-AWS/JupyterNotebook/docker/Dockerfile.spawn
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM fluxrm/flux-sched:jammy
FROM fluxrm/flux-sched:focal

# Based off of https://github.com/jupyterhub/zero-to-jupyterhub-k8s/tree/main/images/singleuser-sample
# Local usage
Expand All @@ -9,32 +9,51 @@ USER root
ENV NB_USER=jovyan \
NB_UID=1000 \
HOME=/home/jovyan
# VENV_DIR=/home/jovyan/.flux_tutorial_venv

RUN adduser \
--disabled-password \
--gecos "Default user" \
--uid ${NB_UID} \
--home ${HOME} \
--force-badname \
${NB_USER}
--disabled-password \
--gecos "Default user" \
--uid ${NB_UID} \
--home ${HOME} \
--force-badname \
${NB_USER}

RUN apt-get update \
&& apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
ca-certificates \
dnsutils \
iputils-ping \
python3-pip \
tini \
# requirement for nbgitpuller
git \
&& rm -rf /var/lib/apt/lists/*
# && apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
gcc-10 \
g++-10 \
ca-certificates \
dnsutils \
iputils-ping \
python3.9 \
python3.9-dev \
python3-pip \
python3-venv \
openmpi-bin \
openmpi-common \
libopenmpi-dev \
liblz4-dev \
tini \
# requirement for nbgitpuller
git
#&& rm -rf /var/lib/apt/lists/*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be uncommented to clean up extra files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, although I haven't tested it since we got DLIO working. I can always uncomment it, and let the CI take a crack at building the image. If it works for the CI, it works for me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build on CI worked with this uncommented so it should be fine.

COPY ./requirements_venv.txt ./requirements_venv.txt
RUN python3 -m pip install -r requirements_venv.txt

# ENV DLIO_PROFILER_ENABLE=0
# RUN dlio_benchmark workload=unet3d_a100 ++workload.dataset.data_folder=/root/data ++workload.workflow.generate_data=True ++workload.workflow.train=False ++workload.dataset.record_length=4096 ++workload.dataset.record_length_stdev=0 ++workload.dataset.record_length_resize=0 ++workload.reader.batch_size=1 ++workload.dataset.num_files_train=16 ++workload.reader.read_threads=1
#
# RUN dlio_benchmark workload=unet3d_a100 ++workload.dataset.data_folder=/root/data ++workload.workflow.generate_data=False ++workload.workflow.train=True ++workload.dataset.record_length=4096 ++workload.dataset.record_length_stdev=0 ++workload.dataset.record_length_resize=0 ++workload.reader.batch_size=1 ++workload.dataset.num_files_train=16 ++workload.reader.read_threads=1

COPY ./requirements.txt ./requirements.txt
RUN ln -s /usr/bin/python3 /usr/bin/python && \
python -m pip install -r requirements.txt && \
python -m pip install ipython==7.34.0 && \
python -m IPython kernel install
RUN python3 -m pip install -r requirements.txt && \
python3 -m pip install ipython==7.34.0 && \
python3 -m IPython kernel install

COPY ./tutorial /home/jovyan/flux-tutorial-2024

# This is code to install DYAD
# This was added to the RADIUSS 2023 tutorials on AWS
Expand All @@ -43,28 +62,31 @@ RUN git clone https://github.com/openucx/ucx.git \
&& git checkout v1.13.1 \
&& ./autogen.sh \
&& ./configure --disable-optimizations --enable-logging --enable-debug --disable-assertions --enable-mt --disable-params-check \
--without-go --without-java --disable-cma --without-cuda --without-gdrcopy --without-verbs --without-knem --without-rmdacm \
--without-rocm --without-xpmem --without-fuse3 --without-ugni --prefix=/usr CC=$(which gcc) CXX=$(which g++) \
--without-go --without-java --disable-cma --without-cuda --without-gdrcopy --without-verbs --without-knem --without-rmdacm \
--without-rocm --without-xpmem --without-fuse3 --without-ugni --prefix=/usr CC=$(which gcc) CXX=$(which g++) \
&& make -j \
&& sudo make install \
&& cd .. \
&& rm -rf ucx

RUN git clone https://github.com/TauferLab/dyad.git \
# RUN $VENV_SOURCE \
RUN git clone https://github.com/flux-framework/dyad.git \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is commented out too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

&& cd dyad \
&& git checkout ucx \
&& cp -r docs/demos/ecp_feb_2023 .. \
&& ./autogen.sh \
&& ./configure --prefix=/usr CC=$(which gcc) CXX=$(which g++) --enable-dyad-debug \
&& make -j \
&& sudo make install \
&& cd .. \
&& git checkout tutorial-riken-2024 \
&& mkdir build \
&& cd build \
&& cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DDYAD_ENABLE_UCX_DATA=ON .. \
&& sudo make install -j \
&& cd ../pydyad \
&& python3 -m build --wheel . \
&& pip install $(ls ./dist/*.whl | head -1) \
&& cd ../.. \
&& rm -rf dyad

RUN mv ecp_feb_2023 /opt/dyad_demo \
&& cd /opt/dyad_demo \
&& CC=$(which gcc) CXX=$(which g++) make all \
&& cd ..
# ENV DLIO_PROFILER_ENABLE=0
#
# RUN dlio_benchmark workload=unet3d_a100 ++workload.dataset.data_folder=/root/data ++workload.workflow.generate_data=False ++workload.workflow.train=True ++workload.dataset.record_length=4096 ++workload.dataset.record_length_stdev=0 ++workload.dataset.record_length_resize=0 ++workload.reader.batch_size=1 ++workload.dataset.num_files_train=16 ++workload.reader.read_threads=1


# This adds the flux-tree command, which is provided in flux-sched source
# but not installed alongside production flux-core
Expand All @@ -79,11 +101,17 @@ RUN python3 -m pip install jupyter_app_launcher && \
COPY ./docker/jupyter-launcher.yaml /usr/local/share/jupyter/lab/jupyter_app_launcher/config.yaml
ENV JUPYTER_APP_LAUNCHER_PATH /usr/local/share/jupyter/lab/jupyter_app_launcher

# No permission errors here
USER ${NB_USER}
# Give jovyan user permissions to tutorial materials
RUN chmod -R 777 ~/flux-tutorial-2024

WORKDIR $HOME
COPY ./docker/flux-icon.png $HOME/flux-icon.png

# RUN ${VENV_SOURCE} && \
# pip install --upgrade --force-reinstall cffi && \
# python3 -m ipykernel install --user --name 'dyad_venv' --display-name 'DYAD Venv' && \
# jupyter kernelspec list

# note that previous examples are added via git volume in config.yaml
ENV SHELL=/usr/bin/bash
ENV FLUX_URI_RESOLVE_LOCAL=t
Expand All @@ -96,11 +124,14 @@ COPY ./docker/entrypoint.sh /entrypoint.sh

# This is for a local start
COPY ./docker/start.sh /start.sh
CMD ["flux", "start", "--test-size=4", "jupyter", "lab"]

# This won't be available in K8s, but will be for a single container build
COPY ./tutorial /home/jovyan/flux-tutorial-2024
RUN mkdir -p $HOME/.local/share && \
chmod 777 $HOME/.local/share

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is going to go here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff seems to be a little confused there. Two things are happening:

  1. The COPY directive was moved towards the top of the file
  2. We added a chmod to /home/jovyan/.local/share because we were getting permission denied errors on that directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is happening because you are running all of these as USER root. Running the add user command does not switch the user, it just adds them. So the commands to install (with sudo) if I'm reading this right are being run by root, and the COPY directives would have root permissions too. If you want the jovyan user to have ownership of that space I would also COPY, etc. in the context of USER jovyan. Generally the pattern to follow is:

  1. Do all system installs as root
  2. Change to the user that should own files
  3. Then copy them in

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. To be honest, I'm not entirely sure why it's complaining about $HOME/.local/share anymore. To my knowledge, nothing is being copied into there.

# Previous command for non-kubernetes
# CMD PATH=$HOME/.local/bin:$PATH \
# flux start --test-size=4 /home/fluxuser/.local/bin/jupyterhub-singleuser
USER ${NB_USER}

# RUN dlio_benchmark workload=unet3d_a100 ++workload.dataset.data_folder=${HOME}/data ++workload.workflow.generate_data=True ++workload.workflow.train=False ++workload.dataset.record_length=4096 ++workload.dataset.record_length_stdev=0 ++workload.dataset.record_length_resize=0 ++workload.reader.batch_size=1 ++workload.dataset.num_files_train=16 ++workload.reader.read_threads=1
#
# RUN dlio_benchmark workload=unet3d_a100 ++workload.dataset.data_folder=${HOME}/data ++workload.workflow.generate_data=False ++workload.workflow.train=True ++workload.dataset.record_length=4096 ++workload.dataset.record_length_stdev=0 ++workload.dataset.record_length_resize=0 ++workload.reader.batch_size=1 ++workload.dataset.num_files_train=16 ++workload.reader.read_threads=1

CMD ["flux", "start", "--test-size=4", "jupyter", "lab"]
9 changes: 9 additions & 0 deletions 2024-RIKEN-AWS/JupyterNotebook/requirements_venv.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Used for the DYAD notebook
Pygments
build
ipykernel
jsonschema
cffi
ply
pyyaml
dlio_benchmark @ git+https://github.com/argonne-lcf/dlio_benchmark.git
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
model: unet3d

framework: pytorch

workflow:
generate_data: False
train: True
checkpoint: False

dataset:
data_folder: data/unet3d/
format: npz
num_files_train: 16
num_samples_per_file: 1
record_length: 4096

reader:
data_loader: pytorch
batch_size: 1
read_threads: 1
file_shuffle: seed
sample_shuffle: seed
multiprocessing_context: spawn
data_loader_classname: dyad_torch_data_loader.DyadTorchDataLoader
data_loader_sampler: index

train:
epochs: 1
computation_time: 1

checkpoint:
checkpoint_folder: checkpoints/unet3d
checkpoint_after_epoch: 5
epochs_between_checkpoints: 2
model_size: 499153191
Loading
Loading