TensorFlow源码编译问题汇总

根据官网指南,按照官网说明,首先要安装protobuf 3.0+的版本。在configure过程中,按照说明,一步步点击需要的部分。然后通过bazel编译,当然bazel版本要使用0.6以下的版本。
通过bazel编译

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

然后报错

ERROR: /mnt/disk1/taokelu/tensorflow-1.3.0/tensorflow/tools/pip_package/BUILD:134:1: error loading package 'tensorflow/contrib/session_bundle': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf//': java.io.IOException: Error downloading [https://github.com/google/protobuf/archive/0b059a3d8a8f8aa40dde7bea55edca4ec5dfea66.tar.gz, http://mirror.bazel.build/github.com/google/protobuf/archive/0b059a3d8a8f8aa40dde7bea55edca4ec5dfea66.tar.gz] to /root/.cache/bazel/_bazel_root/d3cc9e5e7119c18dd166b716d8b55c4b/external/protobuf/0b059a3d8a8f8aa40dde7bea55edca4ec5dfea66.tar.gz: Checksum was e5fdeee6b28cf6c38d61243adff06628baa434a22b5ebb7432d2a7fbabbdb13d but wanted 6d43b9d223ce09e5d4ce8b0060cb8a7513577a35a64c7e3dad10f0703bf3ad93 and referenced by '//tensorflow/tools/pip_package:build_pip_package'.

这个错误是sha错误,解决方法是去掉sha比较。

sed -i '@https://github.com/google/protobuf/archive/0b059a3d8a8f8aa40dde7bea55edca4ec5dfea66.tar.gz@d' tensorflow/workspace.bzl

TensorFlow on Docker构建问题汇总

1.Docker仓库缓存位置

默认Docker会存储在/var/lib/docker/,如果系统盘过小,很容易导致磁盘写满。为了改变存储位置,需要修改启动脚本。对于CentOS来说,修改/usr/lib/systemd/system/docker.service加入如下一行,-g

ExecStart=/usr/bin/dockerd-current 
          -g /mnt/disk1/docker_home 

2.打开管理端口

由于安全原因,默认现在是不打开2375端口,为了使用Docker-java等管理工具,需要打开端口,方法同上,修改/usr/lib/systemd/system/docker.service

    --userland-proxy-path=/usr/libexec/docker/docker-proxy-current 
    -H tcp://0.0.0.0:2375 -H unix://var/run/docker.sock   

3.环境变量等问题

通过Commit等方式或者传入的环境变量或多或少有问题,需要通过Dockerfile写法,进行设置。比如TensorFlow,需要配置HADOOP_HDFS_HOME,LD_LIBRARY_PATH以及CLASSPATH等来读取HADOOP数据,但是通过-e传递参数方式,并不起作用。

4.Nvidia驱动安装问题

如果希望在Docker内部能够使用GPU,则应该在宿主机(host)以及Docker Container内部都安装相同的cuda版本以及cudnn版本。同时,在启动container的时候需要将GPU设备映射到container,需要映射的设备有

--device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm

但是,有个问题,如果重启的时候,这三个设备默认没有加载,通过以下脚本启动加载。

#!/bin/bash

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

5.整体的Dockerfile

FROM centos:7.3.1611

RUN     yum update -y
RUN     yum install -y java-1.8.0-openjdk-devel.x86_64
RUN     yum install -y vim
RUN     yum install -y wget
RUN     yum -y install epel-release
RUN     yum install -y python-pip
RUN     yum -y install python-devel
RUN     pip install --upgrade pip

ADD ./hadoop-2.7.2-1.2.8.tar.gz /usr/local

RUN     mkdir /install

COPY ./cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm /install
COPY ./cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm /install

RUN     rpm -i /install/cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm
RUN     yum -y install cuda
RUN     rpm -i /install/cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm
RUN     yum -y install cuda-cublas-8-0

ADD ./cudnn-8.0-linux-x64-v6.0.tar.gz /install

RUN     cp /install/cuda/include/cudnn.h /usr/local/cuda/include/
RUN     cp -d /install/cuda/lib64/libcudnn* /usr/local/cuda/lib64/
RUN     chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*


ENV JAVA_HOME /etc/alternatives/java_sdk_1.8.0
ENV HADOOP_HOME /usr/local/hadoop-2.7.2-1.2.8
ENV HADOOP_HDFS_HOME $HADOOP_HOME
ENV LD_LIBRARY_PATH /usr/local/cuda/lib64:${JAVA_HOME}/jre/lib/amd64/server:$LD_LIBRARY_PATH
ENV PATH $JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH