创建gpu挂载/dev/nvidia开机启动进程

由于GPU机器重启后gpu的device并不会主动挂载,所以需要开机后执行一个脚本,开机自动挂载,以便于后面Docker进行挂载。执行的脚本gpu-service如下:

#!/bin/bash

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

需要加入新的system服务,方法为

touch /etc/systemd/system/gpu.service
chmod 664 /etc/systemd/system/gpu.service

修改gpu.service文件为

[Unit]
Description=auto run gpu construct
[Service]
Type=simple
ExecStart=/usr/sbin/gpu-service
[Install]
WantedBy=multi-user.target

将gpu-service脚本拷贝到/usr/sbin/gpu-service

mv gpu-service usr/sbin/
chmod 554 /usr/sbin/gpu-service

通过systemctl命令,将gpu-service作为开机自启动命令

systemctl daemon-reload
systemctl enable gpu.service

Install Nvidia CUDA on Aliyun ECS

1.首先确认硬件存在Nvidia GPU,并且操作系统版本兼容,我们选取的是CentOS 7版本,所以没有问题

lspci | grep -i nvidia

2.安装cuda toolkit
从官网下载,http://developer.nvidia.com/cuda-downloads,下载后选择对应的平台,我选取的是centos7,对应了cuda toolkit为9。

rpm -i cuda-repo-rhel7-9-0-local-9.0.176-1.x86_64.rpm
yum install cuda

这样安装好了cuda。

3.加入PATH等

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

4.reboot服务器,然后验证一下

[root@emr-worker-1 release]# nvidia-smi -L
GPU 0: Tesla M40 (UUID: GPU-5ac5582f-b7cb-225a-b698-2c1da3cb1646)

至此,安装完成。