开发环境:
Ubuntu 20.04.4 LTS + Anaconda3 + Tensorflow2.4 + RTX 3090
警告信息:
2022-07-13 09:38:01.471614: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-07-13 09:38:03.812569: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'
2022-07-13 09:38:03.812923: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Unimplemented: /usr/local/cuda-11.0/bin/ptxas ptxas too old. Falling back to the driver to compile.
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2022-07-13 09:38:03.961728: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'
有很多文章分析了问题原因,主要是说低版本的CUDA对新出来的显卡支持不好,升CUDA版本到11.1及以上即可。又因为Tensorflow与CUDA和cuDNN版本强相关,还得参考
Build from source on Windows | TensorFlow (google.cn)
选择合适的版本。
因为代码是基于TF2.4写的,直接创建新conda环境,问题没解决。
conda create -n tf2.4 python=3.8 cudatoolkit=11.2 cudnn=8.1 -c conda-forge
conda activate tf2.4
conda install tensorflow-gpu=2.4 -c conda-forge
因为系统路径下已安装/usr/local/cuda-11.1,想着直接用,问题没解决。
pip install tensorflow-gpu==2.4.0
export PATH=/usr/local/cuda-11.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-11.1
升级Tensorflow到更高版本,问题还是没有解决。
pip uninstall tensorflow-gpu
pip install tensorflow-gpu==2.6.0
查了一下cuDNN库,发现默认使用的是/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5,可能需要升一下级,但是没有root权限。
最后解决方案:
去官网下载合适的
CUDA Toolkit
和
cuDNN
安装到自己工作目录。安装 CUDA Toolkit 时不要装驱动程序,cuDNN的包不需要安装,直接解压并放到 CUDA Toolkit 对应的子目录里即可。最后修改好环境变量。
#
# 下载安装包
#
wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
wget https://developer.download.nvidia.cn/compute/machine-learning/cudnn/secure/8.1.0.77/11.2_20210127/cudnn-11.2-linux-x64-v8.1.0.77.tgz
#
# 安装 CUDA Toolkit
#
chmod 755 cuda_11.2.2_460.32.03_linux.run
./cuda_11.2.2_460.32.03_linux.run --toolkit --installpath=/home/tanxingjun/programs/nvidia/cuda-11.2
#
# 安装 cuDNN
#
tar -zxvf cudnn-11.2-linux-x64-v8.1.0.77.tgz
mv lib64/* /home/tanxingjun/programs/nvidia/cuda-11.2/lib64
mv include/* /home/tanxingjun/programs/nvidia/cuda-11.2/include
#
# 设置路径
#
export PATH=/home/tanxingjun/programs/nvidia/cuda-11.2/bin:/home/tanxingjun/programs/nvidia/cuda-11.2/lib64:$PATH
export LD_LIBRARY_PATH=/home/tanxingjun/programs/nvidia/cuda-11.2/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/home/tanxingjun/programs/nvidia/cuda-11.2
疑问:有些文章提到Tensorflow2.4+cuda-11.1的组合就可以解决,但我实测无效,Tensorflow必须升级到更高版本,即使有更高版本的cuda库,Tensorflow2.4还是会去加载cuda-11.0
2022-07-13 09:37:54.477347: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0