标签归档:ducda9.x

从零开始搭建深度学习服务器: 1080TI四卡并行(Ubuntu16.04+CUDA9.2+cuDNN7.1+TensorFlow+Keras)

Deep Learning Specialization on Coursera

这个系列写了好几篇文章,这是相关文章的索引,仅供参考:

最近公司又弄了一套4卡1080TI机器,配置基本上和之前是一致的,只是显卡换成了技嘉的伪公版1080TI:技嘉GIGABYTE GTX1080Ti 涡轮风扇108TTURBO-11GD

部件	型号	价格	链接	备注
CPU	英特尔(Intel)酷睿六核i7-6850K 盒装CPU处理器 	4599	http://item.jd.com/11814000696.html	
散热器	美商海盗船 H55 水冷	449	https://item.jd.com/10850633518.html	
主板	华硕(ASUS)华硕 X99-E WS/USB 3.1工作站主板	4759	
内存	美商海盗船(USCORSAIR) 复仇者LPX DDR4 3000 32GB(16Gx4条)  	2799 * 2	https://item.jd.com/1990572.html	
SSD	三星(SAMSUNG) 960 EVO 250G M.2 NVMe 固态硬盘	599	https://item.jd.com/3739097.html		
硬盘	希捷(SEAGATE)酷鱼系列 4TB 5900转 台式机机械硬盘 * 2 	629 * 2	https://item.jd.com/4220257.html	
电源	美商海盗船 AX1500i 全模组电源 80Plus金牌	3699	https://item.jd.com/10783917878.html
机箱	美商海盗船 AIR540 USB3.0 	949	http://item.jd.com/12173900062.html
显卡	技嘉(GIGABYTE) GTX1080Ti 11GB 非公版高端游戏显卡深度学习涡轮 * 4 7400 * 4    https://item.jd.com/10583752777.html

这台深度学习主机大概是这样的:

深度学习主机

安装完Ubuntu16.04之后,我又开始了CUDA、cuDnn等深度学习环境和工具的安装之旅,时隔大半年,又有了很多变化,特别是CUDA9.x和cuDnn7.x已经成了标配,这里记录一下。

安装CUDA9.x

注:如果还需要安装Tensorflow1.8,建议这里安装CUDA9.0,我在另一台机器上遇到了一点问题,怀疑和我这台机器先安装CUDA9.0,再安装CUDA9.2有关。

依然从英伟达官方下载当前的CUDA版本,我选择了最新的CUDA9.2:

点选完对应Ubuntu16.04的CUDA9.2 deb版本之后,英伟达官方主页会给出安装提示:

Installation Instructions:
`sudo dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb`
`sudo apt-key add /var/cuda-repo-/7fa2af80.pub`
`sudo apt-get update`
`sudo apt-get install cuda`

在下载完大概1.2G的cuda deb版本之后,实际安装命令是这样的:

sudo dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
sudo apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda

官方CUDA下载下载页面还附带了一个cuBLAS 9.2 Patch更新,官方强烈建议安装:

This update includes fix to cublas GEMM APIs on V100 Tensor Core GPUs when used with default algorithm CUBLAS_GEMM_DEFAULT_TENSOR_OP. We strongly recommend installing this update as part of CUDA Toolkit 9.2 installation.

可以用如下方式安装这个Patch更新:

sudo dpkg -i cuda-repo-ubuntu1604-9-2-local-cublas-update-1_1.0-1_amd64.deb 
sudo apt-get update  
sudo apt-get upgrade cuda

CUDA9.2安装完毕之后,1080TI的显卡驱动也附带安装了,可以重启机器,然后用 nvidia-smi 命令查看一下:

最后在在 ~/.bashrc 中设置环境变量:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda

运行 source ~/.bashrc 使其生效。

安装cuDNN7.x

同样去英伟达官网的cuDNN下载页面:https://developer.nvidia.com/rdp/cudnn-download,最新版本是cuDNN7.1.4,有三个版本可以选择,分别面向CUDA8.0, CUDA9.0, CUDA9.2:

cudnn7.1.4 cuda9.2 ubuntu16.04

下载完cuDNN7.1的压缩包之后解压,然后将相关文件拷贝到cuda的系统路径下即可:

tar -zxvf cudnn-9.2-linux-x64-v7.1.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/ -d 
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

安装TensorFlow 1.8

TensorFlow的安装变得越来越简单,现在TensorFlow的官网也有中文安装文档了:https://www.tensorflow.org/install/install_linux?hl=zh-cn , 我们Follow这个文档,用Virtualenv的安装方式进行TensorFlow的安装,不过首先要配置一下基础环境。

首先在Ubuntu16.04里安装 libcupti-dev 库:

这是 NVIDIA CUDA 分析工具接口。此库提供高级分析支持。要安装此库,请针对 CUDA 工具包 8.0 或更高版本发出以下命令:

$ sudo apt-get install cuda-command-line-tools
并将其路径添加到您的 LD_LIBRARY_PATH 环境变量中:

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64
对于 CUDA 工具包 7.5 或更低版本,请发出以下命令:

$ sudo apt-get install libcupti-dev

然而我运行“sudo apt-get install cuda-command-line-tools”命令后得到的却是:

E: 无法定位软件包 cuda-command-line-tools

Google后发现其实在安装CUDA9.2的时候,这个包已经安装了,在CUDA的路径下这个库已经有了:

/usr/local/cuda/extras/CUPTI/lib64$ ls
libcupti.so  libcupti.so.9.2  libcupti.so.9.2.88

现在只需要将其加入到环境变量中,在~/.bashrc中添加如下声明并令source ~/.bashrc另其生效即可:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64

剩下的就更简单了:

sudo apt-get install python-pip python-dev python-virtualenv 
virtualenv --system-site-packages tensorflow1.8
source tensorflow1.8/bin/activate
easy_install -U pip
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade tensorflow-gpu

强烈建议将清华的pip源写到配置文件里,这样就更方便快捷了。

最后测试一下TensorFlow1.8:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2018-06-17 12:15:34.158680: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-17 12:15:34.381812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
totalMemory: 10.91GiB freeMemory: 5.53GiB
2018-06-17 12:15:34.551451: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
totalMemory: 10.92GiB freeMemory: 5.80GiB
2018-06-17 12:15:34.780350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
totalMemory: 10.92GiB freeMemory: 5.80GiB
2018-06-17 12:15:34.959199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
totalMemory: 10.92GiB freeMemory: 5.80GiB
2018-06-17 12:15:34.966403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2018-06-17 12:15:36.373745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-17 12:15:36.373785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3 
2018-06-17 12:15:36.373798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y Y Y 
2018-06-17 12:15:36.373804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N Y Y 
2018-06-17 12:15:36.373808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   Y Y N Y 
2018-06-17 12:15:36.373814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   Y Y Y N 
2018-06-17 12:15:36.374516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5307 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
2018-06-17 12:15:36.444426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 5582 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1)
2018-06-17 12:15:36.506340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 5582 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
2018-06-17 12:15:36.614736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 5582 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1
2018-06-17 12:15:36.689345: I tensorflow/core/common_runtime/direct_session.cc:284] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1

安装Keras2.1.x

Keras的后端支持TensorFlow, Theano, CNTK,在安装完TensorFlow GPU版本之后,继续安装Keras非常简单,在TensorFlow的虚拟环境中,直接"pip install keras"即可,安装的版本是Keras2.1.6:

Installing collected packages: h5py, scipy, pyyaml, keras
Successfully installed h5py-2.7.1 keras-2.1.6 pyyaml-3.12 scipy-1.1.0

测试一下:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
Using TensorFlow backend.

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

本文链接地址:从零开始搭建深度学习服务器: 1080TI四卡并行(Ubuntu16.04+CUDA9.2+cuDNN7.1+TensorFlow+Keras) http://www.52nlp.cn/?p=10334