Hadoop 减轻Block Report风暴方法(HDFS Block Report Storm)

对于大的Hadoop集群,停机运维后最大的挑战就是Block Report风暴问题,所谓的Block Report风暴就是在NameNode重启后,会有大量的DataNode来汇报Block,来构建NameNode内部的数据结构,包括Block 和DataNode的映射,DataNode包含的所有Block等等。这块是NameNode内存占用的比较大的一部分,另外,大量的DataNode汇报,往往伴随了直接进入老年代,造成GC或者DataNode Block report反映不及时,超过30s socket时间。NameNode甚至在处理这个Block Report的中途,中断了连接,白白耗费了写锁,而DataNode端会休息一段时间重新发起Block report,如此往复,造成了雪崩效果。这就是所谓的HDFS block report风暴。
为了解决这一个问题,社区引入了一个参数dfs.blockreport.split.threshold,默认是100万,如果单机Block数量在100万以下,那么block report会封装成一个RPC,也就是只有一次RPC发送。而如果超过了这个阈值,在DN端这个逻辑是将一个大的block report切分到单盘去做,如果单盘中有一块盘超时,在重新做全部的盘的block report。原理上说这将会降低造成block report的风险和加快重启过程。我做了一个简单的测试,来验证效果。
首先通过阿里云 EMR 购买了一个集群,集群配置如下:
使用30台D1集群:
NameNode 16核 128G
DataNode 32核 128G 5.5T * 12
然后灌入数据,最后灌入了232290855 files and directories, 215077946 blocks = 447368801 total filesystem object(s).
DataNode单机的Block数量在2000万左右。
数据灌完后,重启NameNode,Image Load时间大概是23分钟。
经过计算,大概一次的Block Report时间会在30s左右,所以一次Block Report的RPC大概率会造成Socket timeout。
1.如果将dfs.blockreport.split.threshold设置成1亿条。
NameNode重启后, DataNode一把汇报所有report

2019-11-11 10:51:49,910 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17b00cec3b356,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 4263 msec to generate and 435410 msecs for RPC and NN processing. Got back no commands.
2019-11-11 10:55:02,672 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17b8514c85fdf,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 4175 msec to generate and 60061 msecs for RPC and NN processing. Got back no commands.
2019-11-11 11:52:57,836 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17ebc082c478e,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 5277 msec to generate and 678 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:17,754 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183dfe776d467,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5238 msec to generate and 542 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:23,126 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e1265a5171,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5367 msec to generate and 563 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:28,440 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e26411ce59,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5347 msec to generate and 547 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:33,719 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e3a2783dc1,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5359 msec to generate and 484 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:38,967 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e4db86a462,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5269 msec to generate and 480 msecs for RPC and NN processing. Got back no commands.

经过了3个小时,也没有汇报成功。

2.将dfs.blockreport.split.threshold设置成100万条.

2019-11-11 14:18:15,588 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18695e8fff624,  containing 12 storage report(s), of which we sent 2. The reports had 21426179 total blocks and used 2 RPC(s). This took 10670 msec to generate and 137642 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:20:04,027 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186c1389c74b2,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5248 msec to generate and 60062 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:20:13,913 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186bf29b0ccc7,  containing 12 storage report(s), of which we sent 2. The reports had 21426179 total blocks and used 2 RPC(s). This took 5339 msec to generate and 78788 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:23:29,457 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186e125f12a27,  containing 12 storage report(s), of which we sent 4. The reports had 21426179 total blocks and used 4 RPC(s). This took 5219 msec to generate and 128366 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:32:14,183 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18745b6b058ae,  containing 12 storage report(s), of which we sent 6. The reports had 21426179 total blocks and used 6 RPC(s). This took 5107 msec to generate and 221167 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:36:13,401 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18778e2f96d16,  containing 12 storage report(s), of which we sent 6. The reports had 21426179 total blocks and used 6 RPC(s). This took 5229 msec to generate and 240599 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:36:43,543 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x187825e7cb772,  containing 12 storage report(s), of which we sent 12. The reports had 21426179 total blocks and used 12 RPC(s). This took 5510 msec to generate and 230014 msecs for RPC and NN processing. Got back no commands.

大概经历20分钟,汇报成功。再经过10分钟左右,心跳正常,能够正常提供服务。大概重启时间20+10+20,在1小时左右。大大缩短了时间。
经过测试,通过调整fs.blockreport.split.threshold 是有效的减少br风暴影响,加快汇报速度的方法。