kotatsuyaki’s site

Running Tensorflow, 2.4 on RTX 30 Series with Docker

Published on

Recently I’ve got my hands on a brand-new RTX 3070 GPU to perform machine learning tasks. However, the environment setup wasn’t as smooth as I expected, due to the hardware being too recent.

Until recently I’ve been using the Docker images from Nvidia NGC exclusively, because they were the only things that actually work, without going through the horrible compiling process which takes ages. Later on I found that a new stable release TF 2.4 was out, and that it seems to support CUDA 11 out-of-the-box.

Here’s a short note of how to make it work.

  1. Pull the tensorflow/tensorflow:2.4.0-gpu image from Docker hub.

  2. Spin up a container with that image: docker run -itd --rm --network=host --shm-size 16G --gpus all -v $(pwd):/data/

  3. Apply temporary fix for the Value 'sm_86' is not defined for option 'gpu-name' issue.

    • Download CUDA 11.1 installer runfile.
    • chmod +x it.
    • Run the runfile with --tar mxvf as the arguments.
    • Replace the ptxas binary inside the Docker image (which is CUDA 11.0) with the 11.1 version. cp $(find . -name 'ptxas') /usr/local/cuda/bin/ptxas

    Before this fix, there’s a lot of warning messages like this during the trainign process, and the training of the first epoch is hugely affected (about 17 seconds, while it should take only 7 seconds).

    Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'

    After applying this hacky fix, the issue seems to be gone.

Sources