Running Tensorflow, 2.4 on RTX 30 Series with Docker

Published on 2020-12-18

Recently I’ve got my hands on a brand-new RTX 3070 GPU to perform machine learning tasks. However, the environment setup wasn’t as smooth as I expected, due to the hardware being too recent.

OS: Gentoo Linux X86-64
GPU: RTX 3070

Until recently I’ve been using the Docker images from Nvidia NGC exclusively, because they were the only things that actually work, without going through the horrible compiling process which takes ages. Later on I found that a new stable release TF 2.4 was out, and that it seems to support CUDA 11 out-of-the-box.

Here’s a short note of how to make it work.

Pull the tensorflow/tensorflow:2.4.0-gpu image from Docker hub.
Spin up a container with that image: docker run -itd --rm --network=host --shm-size 16G --gpus all -v $(pwd):/data/
Apply temporary fix for the Value 'sm_86' is not defined for option 'gpu-name' issue.
- Download CUDA 11.1 installer runfile.
- chmod +x it.
- Run the runfile with --tar mxvf as the arguments.
- Replace the ptxas binary inside the Docker image (which is CUDA 11.0) with the 11.1 version. cp $(find . -name 'ptxas') /usr/local/cuda/bin/ptxas
Before this fix, there’s a lot of warning messages like this during the trainign process, and the training of the first epoch is hugely affected (about 17 seconds, while it should take only 7 seconds).
```
Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'
```
After applying this hacky fix, the issue seems to be gone.

Running Tensorflow, 2.4 on RTX 30 Series with Docker

Sources