TensorFlow so file conclict with PyArrow new version

These days, i try to use tensorflow to train on ecs machine. It was okay at the first days, every thing is fine, suddenly, i found i cannot use TensorFlow anymore, even very simple command, like ‘import tensorflow’, can not be executed.
Python crash with Segmentation fault, so i just enable faulthandler.
The error is as below:

Fatal Python error: Segmentation fault

Current thread 0x00007f5f2acd5740 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 922 in create_module
  File "<frozen importlib._bootstrap>", line 571 in module_from_spec
  File "<frozen importlib._bootstrap>", line 658 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 684 in _load
  File "/usr/lib/conda/envs/python3.6/lib/python3.6/imp.py", line 343 in load_dynamic
  File "/usr/lib/conda/envs/python3.6/lib/python3.6/imp.py", line 243 in load_module
  File "/usr/lib/conda/envs/python3.6/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28 in swig_import_helper
  File "/usr/lib/conda/envs/python3.6/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 32 in <module>

The code is simple:

_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)

just import a so file.

It confused me for two days, i have to check what i have done before tensorflow cannot be used. Finally , i remember i just upgraded pyarrow from 0.10.0 to 0.14.0.
So i just downgraded from 0.14.0 to 0.10.0, everything turned to be good.
It is very hard for people to know what is the main reason python throw Segmentation Error. If i don’t remember i upgrade the pyarrow, i think i would not thought it is confict between so file.

Hadoop 减轻Block Report风暴方法(HDFS Block Report Storm)

对于大的Hadoop集群,停机运维后最大的挑战就是Block Report风暴问题,所谓的Block Report风暴就是在NameNode重启后,会有大量的DataNode来汇报Block,来构建NameNode内部的数据结构,包括Block 和DataNode的映射,DataNode包含的所有Block等等。这块是NameNode内存占用的比较大的一部分,另外,大量的DataNode汇报,往往伴随了直接进入老年代,造成GC或者DataNode Block report反映不及时,超过30s socket时间。NameNode甚至在处理这个Block Report的中途,中断了连接,白白耗费了写锁,而DataNode端会休息一段时间重新发起Block report,如此往复,造成了雪崩效果。这就是所谓的HDFS block report风暴。
为了解决这一个问题,社区引入了一个参数dfs.blockreport.split.threshold,默认是100万,如果单机Block数量在100万以下,那么block report会封装成一个RPC,也就是只有一次RPC发送。而如果超过了这个阈值,在DN端这个逻辑是将一个大的block report切分到单盘去做,如果单盘中有一块盘超时,在重新做全部的盘的block report。原理上说这将会降低造成block report的风险和加快重启过程。我做了一个简单的测试,来验证效果。
首先通过阿里云 EMR 购买了一个集群,集群配置如下:
使用30台D1集群:
NameNode 16核 128G
DataNode 32核 128G 5.5T * 12
然后灌入数据,最后灌入了232290855 files and directories, 215077946 blocks = 447368801 total filesystem object(s).
DataNode单机的Block数量在2000万左右。
数据灌完后,重启NameNode,Image Load时间大概是23分钟。
经过计算,大概一次的Block Report时间会在30s左右,所以一次Block Report的RPC大概率会造成Socket timeout。
1.如果将dfs.blockreport.split.threshold设置成1亿条。
NameNode重启后, DataNode一把汇报所有report

2019-11-11 10:51:49,910 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17b00cec3b356,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 4263 msec to generate and 435410 msecs for RPC and NN processing. Got back no commands.
2019-11-11 10:55:02,672 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17b8514c85fdf,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 4175 msec to generate and 60061 msecs for RPC and NN processing. Got back no commands.
2019-11-11 11:52:57,836 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17ebc082c478e,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 5277 msec to generate and 678 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:17,754 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183dfe776d467,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5238 msec to generate and 542 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:23,126 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e1265a5171,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5367 msec to generate and 563 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:28,440 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e26411ce59,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5347 msec to generate and 547 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:33,719 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e3a2783dc1,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5359 msec to generate and 484 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:38,967 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e4db86a462,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5269 msec to generate and 480 msecs for RPC and NN processing. Got back no commands.

经过了3个小时,也没有汇报成功。

2.将dfs.blockreport.split.threshold设置成100万条.

2019-11-11 14:18:15,588 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18695e8fff624,  containing 12 storage report(s), of which we sent 2. The reports had 21426179 total blocks and used 2 RPC(s). This took 10670 msec to generate and 137642 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:20:04,027 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186c1389c74b2,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5248 msec to generate and 60062 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:20:13,913 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186bf29b0ccc7,  containing 12 storage report(s), of which we sent 2. The reports had 21426179 total blocks and used 2 RPC(s). This took 5339 msec to generate and 78788 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:23:29,457 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186e125f12a27,  containing 12 storage report(s), of which we sent 4. The reports had 21426179 total blocks and used 4 RPC(s). This took 5219 msec to generate and 128366 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:32:14,183 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18745b6b058ae,  containing 12 storage report(s), of which we sent 6. The reports had 21426179 total blocks and used 6 RPC(s). This took 5107 msec to generate and 221167 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:36:13,401 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18778e2f96d16,  containing 12 storage report(s), of which we sent 6. The reports had 21426179 total blocks and used 6 RPC(s). This took 5229 msec to generate and 240599 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:36:43,543 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x187825e7cb772,  containing 12 storage report(s), of which we sent 12. The reports had 21426179 total blocks and used 12 RPC(s). This took 5510 msec to generate and 230014 msecs for RPC and NN processing. Got back no commands.

大概经历20分钟,汇报成功。再经过10分钟左右,心跳正常,能够正常提供服务。大概重启时间20+10+20,在1小时左右。大大缩短了时间。
经过测试,通过调整fs.blockreport.split.threshold 是有效的减少br风暴影响,加快汇报速度的方法。

Hadoop DataNode IO高问题排查

最近,有个客户反应,他通过Flume写入HDFS的数据经常超时,写入失败等。我看了一下他的集群,使用的是Aliyun D1机型,即本地盘,写入速度单盘一般都是100MB+,所以在磁盘性能上应该不存在太多问题。
登录集群后,看了一下基本的IO,CPU等信息,发现有大量的磁盘IO Util到达了100%,通过iotop查看io较多的进程,发现是Hadoop DataNode进场占用较多,进而查看到DataNode的一个du -sk 占用了大量的IO。
默认的DataNode会定期10分钟对磁盘空间进行扫描,使用的是DU命令,但是如果目录存在大量文件,并且文件经常变化,会导致DU需要大量的随机读,并引发IO使用率飙升。
至于这个客户的情况是,他通过flume写入了大量的小文件,每个6TB磁盘写入了150万个文件,一次DU将近了5分钟以上,这样子他大部分的时间都在进行无意义的IO操作,从而有效的IO写入操作延迟,甚至中断连接。
针对这一问题,有两种方式解决:
1.https://issues.apache.org/jira/browse/HADOOP-9884
这个jira基本上是希望通过df来代替du
2.修改fs.du.interval参数
修改这个参数,比如每个小时进行一次磁盘空间大小检查操作

最后用户是通过了第二种方式,将磁盘空间检查时间从10分钟提升到了1小时,基本上解决了写入问题。

TensorFlow Session restore checkpoint

最近写了一些TensorFlow的小程序,遇到了一个session无法restore checkpoint的问题,写法非常简单,使用的是

with tf.train.MonitoredTrainingSession(master = server.target,
                                           is_chief = task_index == 0,
                                           checkpoint_dir= checkpoint_dir,
                                           save_checkpoint_secs=20) as sess:

按理说,tf.train.MonitoredTrainingSession能够save和restore checkpoint,经过测试发现,save是没问题的,但是每次训练都是新的模型,没有持续训练。
经过多次查找,才发现正确的写法。
根据TF官网,save 和restore只需要使用tf.train.Saver()就可以解决问题。但是我根据官网去实现,却报了很多错误,最后发现,通过以下方法才可以,首先要获取latest checkpoint文件,就是最近的checkpoint文件,包括模型参数。
然后将模型restore出来,但是别忘了之前要reset这次训练的graph。
代码如下:

checkpoint_dir = "hdfs://emr-header-1:9000/movie"
saver = tf.train.Saver()
epoch = 0

with tf.train.MonitoredTrainingSession(master = server.target,
                                           is_chief = task_index == 0,
                                           checkpoint_dir= checkpoint_dir,
                                           save_checkpoint_secs=20) as sess:
     tf.reset_default_graph()
     sess.run(init)
     latest_path = tf.train.latest_checkpoint(checkpoint_dir=checkpoint_dir)
     saver.restore(sess, latest_path)

Hadoop CGroup 设置

hadoop 2.8 CGroup的设置,CentOS7 默认的CGroup位置在/sys/fs/cgroup/cpu 等,如果Hadoop不设置自动mount,而是使用默认mount位置的话,会mount到/sys/fs/cgroup/cpu/hadoop-yarn,所以如果要在Hadoop2.8版本请做以下操作:

假设以Hadoop用户启动YARN:
mkdir -p /sys/fs/cgroup/cpu/hadoop-yarn
chown -R hadoop:hadoop /sys/fs/cgroup/cpu/hadoop-yarn

修改如下配置:
yarn.nodemanager.container-executor.class   org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
yarn.nodemanager.linux-container-executor.resources-handler.class  org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler
yarn.nodemanager.linux-container-executor.group hadoop

然后重启Node Manager就可以了

解决Docker容器中 mpi 日志丢失问题

最近使用Docker跑Tensorflow的时候,经常发现通过docker java api获取不到Tensorflow的日志获取日志的代码如下:

  private class DockerLogReader extends LogContainerResultCallback {
    @Override
    public void onStart(Closeable stream) {
      System.out.println("start");
    }

    @Override
    public void onNext(Frame item) {
      System.out.print(new String(item.getPayload()));
    }

    @Override
    public void onError(Throwable throwable) {
      System.out.print(throwable.getMessage());
    }

    @Override
    public void onComplete() {
      super.onComplete();
      System.out.println("Docker exit.");
    }
  }

正常情况下,代码会走到onNext中,从而收集到Docker中正在运行的TensorFlow的日志,由于使用的是Horovod,所以底层使用的是MPI去运行程序。然后在运行过程中,只打印了Python程序中的日志,而TensorFlow中的日志全部丢失了。
为了解决这一问题,单独起了两个Docker container,然后手动启动mpi,mpi的命令为:

mpirun  --allow-run-as-root -np 2 ***** >stdout 2>stderr

将命令重定向后发现stdout,stderr就是 docker java获取的日志,但是console能够打出日志,查询了mpi的文档,发现了一个选项-tag-output,同时将stderr重定向的stdout中

mpirun -tag-output --allow-run-as-root -np 2 ***** 2>&1

经过修改后,日志全部打印出来。成功解决这一问题。

pyspark with jupyter

首先配置jupyter config文件。

jupyter-notebook --generate-config

修改jupyter config文件

c.NotebookApp.port = 18888
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.allow_root = True

当然要配置好spark,emr环境spark已经完全配置正确。配置pyspark参数

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

启动pyspark即可。

pyspark --master yarn

Hadoop NameNode元数据恢复

最近在为用户解决问题的时候发现的,个别用户删除了namenode standby的所有元数据,为了恢复数据可以做以下操作:

1.停止任务
2.namenode 进入safemode

hdfs dfsadmin -safemode enter

3.nameonde存储元数据

hdfs dfsadmin -saveNamespace

4.备份active元数据

备份 /mnt/disk1/hdfs 下所有数据

5.拷贝active数据到standby
将/mnt/disk1/hdfs 数据拷贝到standby
6.重启standby
7.重启成功后,退出safemode

hdfs dfsadmin -safemode leave

8.恢复任务

gperftool安装及使用说明

安装方法
1.从github上下载代码到服务器上
2../autogen.sh
需要提前安装autoconf,libtool,gcc-c++,libunwind。先安装libunwind,下载源码包:

./configure && make && make install
yum install -y autoconf libtool gcc-c++

使用方法

export LD_PRELOAD=/usr/lib/libtcmalloc.so:/usr/lib/libprofiler.so
CPUPROFILE=/tmp/cpu java ****
HEAPPROFILE=/tmp/heap java ****

查看结果:
pprof --text /bin/java /tmp/cpu
pprof --text /bin/java /tmp/heap

解决Mac Os X ssh LC_CTYPE警告问题

自从Mac升级以后,登录到linux服务器上,总会报如下的错误:

warning: setlocale: LC_CTYPE: cannot change locale (UTF-8): No such file or directory

并且中文显示全是乱码,google了一下发现是由于mac ssh过去的时候把LANG环境变量也传递了过去,与服务器的不match导致的。解决方法也很简单,去掉LANG环境变量传输:

sudo vi /etc/ssh/ssh_config
注释掉   SendEnv LANG LC_*