创建gpu挂载/dev/nvidia开机启动进程

由于GPU机器重启后gpu的device并不会主动挂载,所以需要开机后执行一个脚本,开机自动挂载,以便于后面Docker进行挂载。执行的脚本gpu-service如下:

#!/bin/bash

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

需要加入新的system服务,方法为

touch /etc/systemd/system/gpu.service
chmod 664 /etc/systemd/system/gpu.service

修改gpu.service文件为

[Unit]
Description=auto run gpu construct
[Service]
Type=simple
ExecStart=/usr/sbin/gpu-service
[Install]
WantedBy=multi-user.target

将gpu-service脚本拷贝到/usr/sbin/gpu-service

mv gpu-service usr/sbin/
chmod 554 /usr/sbin/gpu-service

通过systemctl命令,将gpu-service作为开机自启动命令

systemctl daemon-reload
systemctl enable gpu.service

TensorFlow源码编译问题汇总

根据官网指南,按照官网说明,首先要安装protobuf 3.0+的版本。在configure过程中,按照说明,一步步点击需要的部分。然后通过bazel编译,当然bazel版本要使用0.6以下的版本。
通过bazel编译

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

然后报错

ERROR: /mnt/disk1/taokelu/tensorflow-1.3.0/tensorflow/tools/pip_package/BUILD:134:1: error loading package 'tensorflow/contrib/session_bundle': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf//': java.io.IOException: Error downloading [https://github.com/google/protobuf/archive/0b059a3d8a8f8aa40dde7bea55edca4ec5dfea66.tar.gz, http://mirror.bazel.build/github.com/google/protobuf/archive/0b059a3d8a8f8aa40dde7bea55edca4ec5dfea66.tar.gz] to /root/.cache/bazel/_bazel_root/d3cc9e5e7119c18dd166b716d8b55c4b/external/protobuf/0b059a3d8a8f8aa40dde7bea55edca4ec5dfea66.tar.gz: Checksum was e5fdeee6b28cf6c38d61243adff06628baa434a22b5ebb7432d2a7fbabbdb13d but wanted 6d43b9d223ce09e5d4ce8b0060cb8a7513577a35a64c7e3dad10f0703bf3ad93 and referenced by '//tensorflow/tools/pip_package:build_pip_package'.

这个错误是sha错误,解决方法是去掉sha比较。

sed -i '@https://github.com/google/protobuf/archive/0b059a3d8a8f8aa40dde7bea55edca4ec5dfea66.tar.gz@d' tensorflow/workspace.bzl

TensorFlow on Docker构建问题汇总

1.Docker仓库缓存位置

默认Docker会存储在/var/lib/docker/,如果系统盘过小,很容易导致磁盘写满。为了改变存储位置,需要修改启动脚本。对于CentOS来说,修改/usr/lib/systemd/system/docker.service加入如下一行,-g

ExecStart=/usr/bin/dockerd-current 
          -g /mnt/disk1/docker_home 

2.打开管理端口

由于安全原因,默认现在是不打开2375端口,为了使用Docker-java等管理工具,需要打开端口,方法同上,修改/usr/lib/systemd/system/docker.service

    --userland-proxy-path=/usr/libexec/docker/docker-proxy-current 
    -H tcp://0.0.0.0:2375 -H unix://var/run/docker.sock   

3.环境变量等问题

通过Commit等方式或者传入的环境变量或多或少有问题,需要通过Dockerfile写法,进行设置。比如TensorFlow,需要配置HADOOP_HDFS_HOME,LD_LIBRARY_PATH以及CLASSPATH等来读取HADOOP数据,但是通过-e传递参数方式,并不起作用。

4.Nvidia驱动安装问题

如果希望在Docker内部能够使用GPU,则应该在宿主机(host)以及Docker Container内部都安装相同的cuda版本以及cudnn版本。同时,在启动container的时候需要将GPU设备映射到container,需要映射的设备有

--device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm

但是,有个问题,如果重启的时候,这三个设备默认没有加载,通过以下脚本启动加载。

#!/bin/bash

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

5.整体的Dockerfile

FROM centos:7.3.1611

RUN     yum update -y
RUN     yum install -y java-1.8.0-openjdk-devel.x86_64
RUN     yum install -y vim
RUN     yum install -y wget
RUN     yum -y install epel-release
RUN     yum install -y python-pip
RUN     yum -y install python-devel
RUN     pip install --upgrade pip

ADD ./hadoop-2.7.2-1.2.8.tar.gz /usr/local

RUN     mkdir /install

COPY ./cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm /install
COPY ./cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm /install

RUN     rpm -i /install/cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm
RUN     yum -y install cuda
RUN     rpm -i /install/cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm
RUN     yum -y install cuda-cublas-8-0

ADD ./cudnn-8.0-linux-x64-v6.0.tar.gz /install

RUN     cp /install/cuda/include/cudnn.h /usr/local/cuda/include/
RUN     cp -d /install/cuda/lib64/libcudnn* /usr/local/cuda/lib64/
RUN     chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*


ENV JAVA_HOME /etc/alternatives/java_sdk_1.8.0
ENV HADOOP_HOME /usr/local/hadoop-2.7.2-1.2.8
ENV HADOOP_HDFS_HOME $HADOOP_HOME
ENV LD_LIBRARY_PATH /usr/local/cuda/lib64:${JAVA_HOME}/jre/lib/amd64/server:$LD_LIBRARY_PATH
ENV PATH $JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

Linux 各种包download only

一、pip download

pip donwload package

二、yum downloadonly

首先安装包downloadonly包

yum install yum-plugin-downloadonly

使用方法:

yum install --downloadonly --downloaddir=<directory> <package>

解决Jersey 2.x Jersey 1.x冲突问题– UriBuilder问题

最近在使用Java Docker来控制Docker的过程中发现了一个问题,因为Java Docker使用的是Jersey 2.x,maven pom如下:

<dependency>
    <groupId>com.github.docker-java</groupId>
    <artifactId>docker-java</artifactId>
    <version>3.0.13</version>
</dependency>

而集群中Hadoop的依赖都是Jersey 1.x。所以在引用的过程中提示如下错误:

Exception in thread "main" java.lang.AbstractMethodError: javax.ws.rs.core.UriBuilder.uri(Ljava/lang/String;)Ljavax/ws/rs/core/UriBuilder;
        at javax.ws.rs.core.UriBuilder.fromUri(UriBuilder.java:119)
        at org.glassfish.jersey.client.JerseyWebTarget.<init>(JerseyWebTarget.java:71)
        at org.glassfish.jersey.client.JerseyClient.target(JerseyClient.java:290)
        at org.glassfish.jersey.client.JerseyClient.target(JerseyClient.java:76)
        at com.github.dockerjava.jaxrs.JerseyDockerCmdExecFactory.init(JerseyDockerCmdExecFactory.java:237)
        at com.github.dockerjava.core.DockerClientImpl.withDockerCmdExecFactory(DockerClientImpl.java:161)
        at com.github.dockerjava.core.DockerClientBuilder.build(DockerClientBuilder.java:45)

很明显这是Jersey版本mismatch导致的,在解决这一过程中,走了不少弯路,包括采用了maven shade plugin来解决,采用exclude一些jar包来解决等,都没有解决问题。
最后还是从错误出发,这个错误的原因是fromUri这个方法是一个抽象方法,没有实现,也就是只有抽象方法,没有实现类。
再把代码看一下,在Jersery 2.x中的实现为:

public abstract UriBuilder uri(String var1);

它的实现类为org.glassfish.jersey.uri.internal.JerseyUriBuilder,看到这一部分,大概了解到原因是缺少了实现类。于是加入相应的dependency到pom中

        <dependency>
            <groupId>org.glassfish.jersey.core</groupId>
            <artifactId>jersey-server</artifactId>
            <version>2.23.1</version>
        </dependency>
        <dependency>
            <groupId>org.glassfish.jersey.containers</groupId>
            <artifactId>jersey-container-servlet-core</artifactId>
            <version>2.23.1</version>
        </dependency>

加入后再打包,解决这一问题。

Install Nvidia CUDA on Aliyun ECS

1.首先确认硬件存在Nvidia GPU,并且操作系统版本兼容,我们选取的是CentOS 7版本,所以没有问题

lspci | grep -i nvidia

2.安装cuda toolkit
从官网下载,http://developer.nvidia.com/cuda-downloads,下载后选择对应的平台,我选取的是centos7,对应了cuda toolkit为9。

rpm -i cuda-repo-rhel7-9-0-local-9.0.176-1.x86_64.rpm
yum install cuda

这样安装好了cuda。

3.加入PATH等

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

4.reboot服务器,然后验证一下

[root@emr-worker-1 release]# nvidia-smi -L
GPU 0: Tesla M40 (UUID: GPU-5ac5582f-b7cb-225a-b698-2c1da3cb1646)

至此,安装完成。

近况

我近期已经从新浪离职,重新加入了阿里巴巴集团,目前在阿里云EMR团队中工作,工作的主要内容还是围绕着大数据底层架构展开。

Maven-Shade-Plugin 介绍

在我们像集群中提交任务的时候,常常因为app的jar包与Hadoop CLASSPATH中jar包冲突导致了No such method错误等等。
为了解决这一问,采用Maven-shade-plugin,这个插件的主要作用是将打包过程中内部的jar包重新命名为新的名称,防止由于引用不同版本导致的问题,例如:

<project>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.0.0</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <relocations>
                <relocation>
                  <pattern>org.codehaus.plexus.util</pattern>
                  <shadedPattern>org.shaded.plexus.util</shadedPattern>
                  <excludes>
                    <exclude>org.codehaus.plexus.util.xml.Xpp3Dom</exclude>
                    <exclude>org.codehaus.plexus.util.xml.pull.*</exclude>
                  </excludes>
                </relocation>
              </relocations>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
  ...
</project>

Some Notes About Ops in Druid OLAP System

I have talked about Druid a lot in blog, including the architecture of druid and how to change time zone in druid. This post will focus on the basic operation in druid which operate everyday.

First, how to write the Hadoop map reduce index spec file


{
  "dataSchema" : {
    "dataSource" : "ingestion_test",
    "parser" : {
      "type" : "hadoopyString",
      "parseSpec" : {
        "format" : "tsv",
        "timestampSpec" : {
          "column" : "dt",
          "format" : "posix"
        },
        "dimensionsSpec" : {
          "dimensions": ["grade","src_flag","school_id","gender_desc","prov_name","city_name"]
        },
        "delimiter":"\u0001",
        "listDelimiter":"\u0002",
        "columns":  ["dt","uid","grade","src_flag","school_id","gender_desc","prov_name","city_name","flag"]
      }
    },
    "metricsSpec" : [
              {
                    "type": "count",
                    "name": "count_druid"
              },
              {
                    "type": "hyperUnique",
                    "name": "uv",
                    "fieldName" : "uid"
              },
              {
                    "type": "longSum",
                    "name": "count",
                    "fieldName" : "flag"
              }
    ],
    "granularitySpec" : {
      "type" : "uniform",
      "segmentGranularity" : "HOUR",
      "queryGranularity" : "NONE",
      "intervals" : [ "2017-3-13/2017-3-14" ]
    }
  },
  "ioConfig" : {
    "type" : "hadoop",
    "inputSpec" : {
      "type" : "static",
      "paths" : "/data//000000_0"
    },
    "metadataUpdateSpec" : {
                "type":"mysql",
                "connectURI":"jdbc:mysql://ip:3306/druid",
                "password" : "password",
                "user" : "user",
                "segmentTable" : "druid_segments"
    },
    "segmentOutputPath" : "hdfs://ns1/user/druid/localStorage"
  },
  "tuningConfig" : {
    "type" : "hadoop",
    "workingPath": "hdfs://ns1/user/druid/localStorage/workdir",
    "partitionsSpec" : {
      "type" : "hashed",
      "numShards" : 3
    },
    "shardSpecs" : { },
    "leaveIntermediate" : false,
    "cleanupOnFailure" : true,
    "overwriteFiles" : false,
    "ignoreInvalidRows" : false,
    "jobProperties" : { },
    "combineText" : false,
    "persistInHeap" : false,
    "ingestOffheap" : false,
    "bufferSize" : 134217728,
    "aggregationBufferRatio" : 0.5,
    "rowFlushBoundary" : 300000,
    "useCombiner" : true,
    "buildV9Directly" : true
  }
}

In the spec file ,you can assign reduce number in numShareds parameters.

Second, the example spec file which directly write to druid using tranquility


{
  "dataSources": [
    {
      "spec": {
        "dataSchema": {
          "dataSource": "main_static_log_tranq1",
          "parser": {
            "type": "string",
            "parseSpec": {
              "format": "json",
              "timestampSpec": {
                "column": "timestamp",
                "format": "posix"
              },
              "dimensionsSpec": {
                "dimensions": ["typeSignII",  "typeSignI", "typeSignIII", "typeSignIV",  "responseCode",  "processTotalTime", "serverIp", "terminal", "type", "service"],
                "dimensionExclusions": [],
                "spatialDimensions": []
              }
            }
          },
          "metricsSpec": [
            {
              "type": "count",
              "name": "count"
            },{
              "type": "doubleSum",
              "name": "mProcessTotalTime",
              "fieldName" : "mProcessTotalTime"
            }
          ],
          "granularitySpec": {
            "type": "uniform",
            "segmentGranularity": "SIX_HOUR",
            "queryGranularity": "MINUTE"
          }
        },
        "tuningConfig": {
          "type": "realtime",
          "maxRowsInMemory": 100000,
          "intermediatePersistPeriod": "PT10m",
          "windowPeriod": "PT60m"
        }
      },
      "properties" : {
            "task.partitions" : "1",
            "task.replicants" : "2"
      }
    }
  ],
  "properties": {
    "zookeeper.connect": "10.39.2.161:2181",
    "druid.selectors.indexing.serviceName": "overlord",
    "druid.discovery.curator.path": "/druid/discovery",
    "druidBeam.overlordPollPeriod": "PT20S"
  }
}

You can assign partitions and replications using task.partitions and task.replicants parameters.

How to Submit a Merge Job to Index Service in Druid

Normally, we use index service instead of realtime nodes in Druid to ingest realtime data. If you have multiple partitions in one time, and each of them is small, you have to merge them together to form a big segments to boost query effeciency.
For example we have two segments in the same time interval, just as below

two segments on the same time interval
What we care is to merge them together. Here is how to write the merge.json file and submit:


{
    "type": "merge",
    "dataSource": "main_static_log_tt",
    "aggregations": [
                {
                    "type": "count",
                    "name": "count"
                },{
                    "type": "doubleSum",
                    "name": "mProcessTotalTime",
                    "fieldName" : "mProcessTotalTime"
                }
    ],
    "rollup": "false",
    "segments": [
{"dataSource":"main_static_log_tt","interval":"2017-03-27T10:05:00.000Z/2017-03-27T10:06:00.000Z","version":"2017-03-27T10:05:00.000Z","loadSpec":{"type":"local","path":"/data0/test/file/main_static_log_tt/2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z/2017-03-27T10:05:00.000Z/0/index.zip"},"dimensions":"processTotalTime,responseCode,serverIp,typeSignI,typeSignII,typeSignIII,typeSignIV","metrics":"count,mProcessTotalTime","shardSpec":{"type":"none"},"binaryVersion":9,"size":129991,"identifier":"main_static_log_tt_2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z_2017-03-27T10:05:00.000Z"},
{"dataSource":"main_static_log_tt","interval":"2017-03-27T10:05:00.000Z/2017-03-27T10:06:00.000Z","version":"2017-03-27T10:05:00.000Z","loadSpec":{"type":"local","path":"/data0/test/file/main_static_log_tt/2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z/2017-03-27T10:05:00.000Z/1/index.zip"},"dimensions":"processTotalTime,responseCode,serverIp,typeSignI,typeSignII,typeSignIII,typeSignIV","metrics":"count,mProcessTotalTime","shardSpec":{"type":"none"},"binaryVersion":9,"size":190243,"identifier":"main_static_log_tt_2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z_2017-03-27T10:05:00.000Z_1"}
    ]
}

Remember to change the shardSpec type to none, because the merge function only merge that type, it ignore hash or linear type. But we can avoid it, we just change the type to none, it has some problem, later post i will talk about how to change the code and make it work.
After edit the json file, you can submit to your overlord node as below:

curl http://host:port/druid/indexer/v1/task -H "Content-Type:application/json" -X POST --data @merge.json

Kill the Job

Some time, you just want to take a test and later on you can kill the task to free the slot in index service. Here is how to write kill.json file and submit it:


{
    "type": "kill",
    "id": "sbsina",
    "dataSource": "main_static_log_tt",
    "interval": "2017-03-22T07:47:00.000Z/2017-03-28T07:48:00.000Z"
}

Submit it:

curl http://host:port/druid/indexer/v1/task -H "Content-Type:application/json" -X POST --data @kill.json

Disable Middle Manager to Update

submit a post to middle manager http port:

curl -X POST http://ip:port/druid/worker/v1/disable

谈谈HftpFileSystem

最近在排查一个distcp反复失败的问题,我们知道distcp是对两个集群进行数据拷贝的工具,对于远程集群或者版本不一致的集群,我们使用hftp拉取数据,使用的方法如下:

hadoop distcp hftp://sourceFS:50070/src hdfs://destFS:50070/dest

Hftp的Schema是hftp开头的,实现的类为HftpFileSystem。
那Hftp的读取流程是如何呢,比如读取一个文件,流程如下:
流程说明
仔细说明一下hftp的流程:
1、首先Hftp会拼出NameNode的http URL,所有的Hftp交互都是通过http端口进行的,NameNode默认的http端口是50070,代码为

f = f.makeQualified(getUri(), getWorkingDirectory());
String path = "/data" + ServletUtil.encodePath(f.toUri().getPath());
String query = addDelegationTokenParam("ugi=" + getEncodedUgiParameter());
URL u = getNamenodeURL(path, query);
return new FSDataInputStream(new RangeHeaderInputStream(connectionFactory, u));

这个时候还没有与NameNode进行连接。当openInputStream的时候才会进行实际连接,代码为

protected InputStream openInputStream() throws IOException {
// Use the original url if no resolved url exists, eg. if
// it's the first time a request is made.
final boolean resolved = resolvedURL.getURL() != null;
final URLOpener opener = resolved? resolvedURL: originalURL;

final HttpURLConnection connection = opener.connect(startPos, resolved);

2、NameNode通过NamenodeWebHdfsMethods类来处理所有的HTTP请求,对于hftp发送的Open请求,NameNode的处理逻辑是根据位置选择一个最适合的DataNode,然后重定向把URL设置成该DataNode的http端口返回给Hftp client端。

final URI uri = redirectURI(namenode, ugi, delegation, username, doAsUser,
fullpath, op.getValue(), offset.getValue(), -1L, offset, length, bufferSize);
return Response.temporaryRedirect(uri).type(MediaType.APPLICATION_OCTET_STREAM).build();

至此,完成了Client与NameNode的交互,Client然后与DataNode进行交互。
3、Client根据重定向的URL与需要读取数据的DataNode建立连接,注意这一步走的还是HTTP端口,DataNode默认的HTTP端口是50075。

resolvedURL.setURL(getResolvedUrl(connection));

InputStream in = connection.getInputStream();

4、在DataNode内部通过DatanodeWebHdfsMethods类来处理HTTP请求,对于OPEN的请求,处理方式为

case OPEN:
{
final int b = bufferSize.getValue(conf);
final DFSClient dfsclient = newDfsClient(nnId, conf);
HdfsDataInputStream in = null;
try {
in = new HdfsDataInputStream(dfsclient.open(fullpath, b, true));
in.seek(offset.getValue());
} catch(IOException ioe) {
IOUtils.cleanup(LOG, in);
IOUtils.cleanup(LOG, dfsclient);
throw ioe;
}

final long n = length.getValue() != null ?
Math.min(length.getValue(), in.getVisibleLength() - offset.getValue()) :
in.getVisibleLength() - offset.getValue();

/**
* Allow the Web UI to perform an AJAX request to get the data.
*/
return Response.ok(new OpenEntity(in, n, dfsclient))
.type(MediaType.APPLICATION_OCTET_STREAM)
.header("Access-Control-Allow-Methods", "GET")
.header("Access-Control-Allow-Origin", "*")
.build();
}

内部如果打开了本地读那么就生成了一个本地读(BlockReaderLocal)的DFSClient来读取数据,如果没有开启本地读,那么就生成一个Socket连接来读取数据。
5,6、返回给client inputstream用于数据读取。至此一个Hftp读取的流程结束。

我们遇到的问题

当执行distcp的时候有一端为hftp,发现任务都fail read一个文件,该文件所在的机器状态不太正常,查看网络连接,50075端口有很多CLOSE_WAIT,并且datanode进程写入、读取都非常慢,判断当时情况网络连接不正常,因为NameNode的HTTP接口只返回了一个DataNode地址,并不是像RPC接口一样,返回DataNode列表,如果一个出现问题,还可以读取其他的DataNode。在这个时候由于这台出问题的DataNode导致了反复无法读取数据,当时解决的方法是重启该进程,使得读取到其他的DataNode上。
至于深层原因,根当时的DN状态有关,由于没有保存完整堆栈信息,只能在下次出错的时候继续排查,但是基本思路是DN的socket server响应慢,由于内部逻辑导致的。