Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] best practice to build a docker image to train merlin tensorflow models? #1233

Open
dking21st opened this issue Dec 20, 2023 · 0 comments
Assignees

Comments

@dking21st
Copy link

❓ Questions & Help

I'm trying several different docker base - merlin / tensorflow / rapidsai / nvidia... but I kept fail to avoid version issue & driver issue.
I'm new to docker and airflow, and seems like I'm keep getting confused.
I will appreciate if anyone can see my requirements & dockerfile to see what am I doing wrong.

Details

Python Version: 3.10
Driver: 
CUDA - 11.8
CUDNN - 8
Libraries: 
RUN pip install tensorflow==2.12.0
RUN pip install --no-cache-dir \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu11==23.12.* dask-cudf-cu11==23.12.*
RUN pip install -U git+https://github.com/NVIDIA-Merlin/models==23.04
RUN pip install -U git+https://github.com/NVIDIA-Merlin/[email protected]
RUN pip install -U git+https://github.com/NVIDIA-Merlin/[email protected]
RUN pip install -U git+https://github.com/NVIDIA-Merlin/[email protected]
RUN pip install merlin-systems==23.04
RUN pip install tf2onnx==1.15.1

dockerfile that worked best so far, but failed due to cudf issue:

FROM --platform=linux/amd64 tensorflow/tensorflow:2.12.0-gpu as prod

WORKDIR /ads_content

COPY ./data-airflow .
COPY ./ads/images/requirements.txt .

RUN apt-get update && yes|apt-get upgrade

# Add sudo
RUN apt-get -y install sudo

# Adding wget and bzip2
RUN apt-get install -y wget bzip2

RUN apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev gcc-x86-64-linux-gnu

WORKDIR /root

# Set requirements
RUN pip install --upgrade pip
RUN apt-get install -y git

#RAPIDs
RUN pip install --no-cache-dir \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu11==23.4.* dask-cudf-cu11==23.4*

RUN pip install -U git+https://github.com/NVIDIA-Merlin/[email protected]
RUN pip install -U git+https://github.com/NVIDIA-Merlin/[email protected]
RUN pip install -U git+https://github.com/NVIDIA-Merlin/[email protected]
RUN pip install -U git+https://github.com/NVIDIA-Merlin/[email protected]
RUN pip install merlin-systems==23.04
RUN pip install tf2onnx==1.15.1

RUN pip install -r /ads_content/requirements.txt

WORKDIR /ads_content

ENTRYPOINT ["python3"]

Problem of this image was that version of cudf is limited to 23.04, as this image uses Python 3.8 (which is not supporting cudf 23.12). This low version of cudf blocked me from using gpu on data processing.

Other base images, like rapidsai / merlin, I'm always experiencing driver issue.
I see that merlin is recommended to use with image nvcr.io/nvidia/merlin/merlin-tensorflow:23.06.
Can someone share me an example docker file, on how to use existing container from merlin / or somewhere else to train tensorflow model successfully on airflow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants