Hadoop 减轻Block Report风暴方法(HDFS Block Report Storm)

对于大的Hadoop集群,停机运维后最大的挑战就是Block Report风暴问题,所谓的Block Report风暴就是在NameNode重启后,会有大量的DataNode来汇报Block,来构建NameNode内部的数据结构,包括Block 和DataNode的映射,DataNode包含的所有Block等等。这块是NameNode内存占用的比较大的一部分,另外,大量的DataNode汇报,往往伴随了直接进入老年代,造成GC或者DataNode Block report反映不及时,超过30s socket时间。NameNode甚至在处理这个Block Report的中途,中断了连接,白白耗费了写锁,而DataNode端会休息一段时间重新发起Block report,如此往复,造成了雪崩效果。这就是所谓的HDFS block report风暴。
为了解决这一个问题,社区引入了一个参数dfs.blockreport.split.threshold,默认是100万,如果单机Block数量在100万以下,那么block report会封装成一个RPC,也就是只有一次RPC发送。而如果超过了这个阈值,在DN端这个逻辑是将一个大的block report切分到单盘去做,如果单盘中有一块盘超时,在重新做全部的盘的block report。原理上说这将会降低造成block report的风险和加快重启过程。我做了一个简单的测试,来验证效果。
首先通过阿里云 EMR 购买了一个集群,集群配置如下:
使用30台D1集群:
NameNode 16核 128G
DataNode 32核 128G 5.5T * 12
然后灌入数据,最后灌入了232290855 files and directories, 215077946 blocks = 447368801 total filesystem object(s).
DataNode单机的Block数量在2000万左右。
数据灌完后,重启NameNode,Image Load时间大概是23分钟。
经过计算,大概一次的Block Report时间会在30s左右,所以一次Block Report的RPC大概率会造成Socket timeout。
1.如果将dfs.blockreport.split.threshold设置成1亿条。
NameNode重启后, DataNode一把汇报所有report

2019-11-11 10:51:49,910 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17b00cec3b356,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 4263 msec to generate and 435410 msecs for RPC and NN processing. Got back no commands.
2019-11-11 10:55:02,672 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17b8514c85fdf,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 4175 msec to generate and 60061 msecs for RPC and NN processing. Got back no commands.
2019-11-11 11:52:57,836 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x17ebc082c478e,  containing 12 storage report(s), of which we sent 0. The reports had 21507382 total blocks and used 0 RPC(s). This took 5277 msec to generate and 678 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:17,754 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183dfe776d467,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5238 msec to generate and 542 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:23,126 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e1265a5171,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5367 msec to generate and 563 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:28,440 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e26411ce59,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5347 msec to generate and 547 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:33,719 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e3a2783dc1,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5359 msec to generate and 484 msecs for RPC and NN processing. Got back no commands.
2019-11-11 13:26:38,967 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x183e4db86a462,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5269 msec to generate and 480 msecs for RPC and NN processing. Got back no commands.

经过了3个小时,也没有汇报成功。

2.将dfs.blockreport.split.threshold设置成100万条.

2019-11-11 14:18:15,588 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18695e8fff624,  containing 12 storage report(s), of which we sent 2. The reports had 21426179 total blocks and used 2 RPC(s). This took 10670 msec to generate and 137642 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:20:04,027 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186c1389c74b2,  containing 12 storage report(s), of which we sent 0. The reports had 21426179 total blocks and used 0 RPC(s). This took 5248 msec to generate and 60062 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:20:13,913 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186bf29b0ccc7,  containing 12 storage report(s), of which we sent 2. The reports had 21426179 total blocks and used 2 RPC(s). This took 5339 msec to generate and 78788 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:23:29,457 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x186e125f12a27,  containing 12 storage report(s), of which we sent 4. The reports had 21426179 total blocks and used 4 RPC(s). This took 5219 msec to generate and 128366 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:32:14,183 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18745b6b058ae,  containing 12 storage report(s), of which we sent 6. The reports had 21426179 total blocks and used 6 RPC(s). This took 5107 msec to generate and 221167 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:36:13,401 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x18778e2f96d16,  containing 12 storage report(s), of which we sent 6. The reports had 21426179 total blocks and used 6 RPC(s). This took 5229 msec to generate and 240599 msecs for RPC and NN processing. Got back no commands.
2019-11-11 14:36:43,543 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x187825e7cb772,  containing 12 storage report(s), of which we sent 12. The reports had 21426179 total blocks and used 12 RPC(s). This took 5510 msec to generate and 230014 msecs for RPC and NN processing. Got back no commands.

大概经历20分钟,汇报成功。再经过10分钟左右,心跳正常,能够正常提供服务。大概重启时间20+10+20,在1小时左右。大大缩短了时间。
经过测试,通过调整fs.blockreport.split.threshold 是有效的减少br风暴影响,加快汇报速度的方法。

Hadoop DataNode IO高问题排查

最近,有个客户反应,他通过Flume写入HDFS的数据经常超时,写入失败等。我看了一下他的集群,使用的是Aliyun D1机型,即本地盘,写入速度单盘一般都是100MB+,所以在磁盘性能上应该不存在太多问题。
登录集群后,看了一下基本的IO,CPU等信息,发现有大量的磁盘IO Util到达了100%,通过iotop查看io较多的进程,发现是Hadoop DataNode进场占用较多,进而查看到DataNode的一个du -sk 占用了大量的IO。
默认的DataNode会定期10分钟对磁盘空间进行扫描,使用的是DU命令,但是如果目录存在大量文件,并且文件经常变化,会导致DU需要大量的随机读,并引发IO使用率飙升。
至于这个客户的情况是,他通过flume写入了大量的小文件,每个6TB磁盘写入了150万个文件,一次DU将近了5分钟以上,这样子他大部分的时间都在进行无意义的IO操作,从而有效的IO写入操作延迟,甚至中断连接。
针对这一问题,有两种方式解决:
1.https://issues.apache.org/jira/browse/HADOOP-9884
这个jira基本上是希望通过df来代替du
2.修改fs.du.interval参数
修改这个参数,比如每个小时进行一次磁盘空间大小检查操作

最后用户是通过了第二种方式,将磁盘空间检查时间从10分钟提升到了1小时,基本上解决了写入问题。

Hadoop CGroup 设置

hadoop 2.8 CGroup的设置,CentOS7 默认的CGroup位置在/sys/fs/cgroup/cpu 等,如果Hadoop不设置自动mount,而是使用默认mount位置的话,会mount到/sys/fs/cgroup/cpu/hadoop-yarn,所以如果要在Hadoop2.8版本请做以下操作:

假设以Hadoop用户启动YARN:
mkdir -p /sys/fs/cgroup/cpu/hadoop-yarn
chown -R hadoop:hadoop /sys/fs/cgroup/cpu/hadoop-yarn

修改如下配置:
yarn.nodemanager.container-executor.class   org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
yarn.nodemanager.linux-container-executor.resources-handler.class  org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler
yarn.nodemanager.linux-container-executor.group hadoop

然后重启Node Manager就可以了

Hadoop NameNode元数据恢复

最近在为用户解决问题的时候发现的,个别用户删除了namenode standby的所有元数据,为了恢复数据可以做以下操作:

1.停止任务
2.namenode 进入safemode

hdfs dfsadmin -safemode enter

3.nameonde存储元数据

hdfs dfsadmin -saveNamespace

4.备份active元数据

备份 /mnt/disk1/hdfs 下所有数据

5.拷贝active数据到standby
将/mnt/disk1/hdfs 数据拷贝到standby
6.重启standby
7.重启成功后,退出safemode

hdfs dfsadmin -safemode leave

8.恢复任务

谈谈HftpFileSystem

最近在排查一个distcp反复失败的问题,我们知道distcp是对两个集群进行数据拷贝的工具,对于远程集群或者版本不一致的集群,我们使用hftp拉取数据,使用的方法如下:

hadoop distcp hftp://sourceFS:50070/src hdfs://destFS:50070/dest

Hftp的Schema是hftp开头的,实现的类为HftpFileSystem。
那Hftp的读取流程是如何呢,比如读取一个文件,流程如下:
流程说明
仔细说明一下hftp的流程:
1、首先Hftp会拼出NameNode的http URL,所有的Hftp交互都是通过http端口进行的,NameNode默认的http端口是50070,代码为

f = f.makeQualified(getUri(), getWorkingDirectory());
String path = "/data" + ServletUtil.encodePath(f.toUri().getPath());
String query = addDelegationTokenParam("ugi=" + getEncodedUgiParameter());
URL u = getNamenodeURL(path, query);
return new FSDataInputStream(new RangeHeaderInputStream(connectionFactory, u));

这个时候还没有与NameNode进行连接。当openInputStream的时候才会进行实际连接,代码为

protected InputStream openInputStream() throws IOException {
// Use the original url if no resolved url exists, eg. if
// it's the first time a request is made.
final boolean resolved = resolvedURL.getURL() != null;
final URLOpener opener = resolved? resolvedURL: originalURL;

final HttpURLConnection connection = opener.connect(startPos, resolved);

2、NameNode通过NamenodeWebHdfsMethods类来处理所有的HTTP请求,对于hftp发送的Open请求,NameNode的处理逻辑是根据位置选择一个最适合的DataNode,然后重定向把URL设置成该DataNode的http端口返回给Hftp client端。

final URI uri = redirectURI(namenode, ugi, delegation, username, doAsUser,
fullpath, op.getValue(), offset.getValue(), -1L, offset, length, bufferSize);
return Response.temporaryRedirect(uri).type(MediaType.APPLICATION_OCTET_STREAM).build();

至此,完成了Client与NameNode的交互,Client然后与DataNode进行交互。
3、Client根据重定向的URL与需要读取数据的DataNode建立连接,注意这一步走的还是HTTP端口,DataNode默认的HTTP端口是50075。

resolvedURL.setURL(getResolvedUrl(connection));

InputStream in = connection.getInputStream();

4、在DataNode内部通过DatanodeWebHdfsMethods类来处理HTTP请求,对于OPEN的请求,处理方式为

case OPEN:
{
final int b = bufferSize.getValue(conf);
final DFSClient dfsclient = newDfsClient(nnId, conf);
HdfsDataInputStream in = null;
try {
in = new HdfsDataInputStream(dfsclient.open(fullpath, b, true));
in.seek(offset.getValue());
} catch(IOException ioe) {
IOUtils.cleanup(LOG, in);
IOUtils.cleanup(LOG, dfsclient);
throw ioe;
}

final long n = length.getValue() != null ?
Math.min(length.getValue(), in.getVisibleLength() - offset.getValue()) :
in.getVisibleLength() - offset.getValue();

/**
* Allow the Web UI to perform an AJAX request to get the data.
*/
return Response.ok(new OpenEntity(in, n, dfsclient))
.type(MediaType.APPLICATION_OCTET_STREAM)
.header("Access-Control-Allow-Methods", "GET")
.header("Access-Control-Allow-Origin", "*")
.build();
}

内部如果打开了本地读那么就生成了一个本地读(BlockReaderLocal)的DFSClient来读取数据,如果没有开启本地读,那么就生成一个Socket连接来读取数据。
5,6、返回给client inputstream用于数据读取。至此一个Hftp读取的流程结束。

我们遇到的问题

当执行distcp的时候有一端为hftp,发现任务都fail read一个文件,该文件所在的机器状态不太正常,查看网络连接,50075端口有很多CLOSE_WAIT,并且datanode进程写入、读取都非常慢,判断当时情况网络连接不正常,因为NameNode的HTTP接口只返回了一个DataNode地址,并不是像RPC接口一样,返回DataNode列表,如果一个出现问题,还可以读取其他的DataNode。在这个时候由于这台出问题的DataNode导致了反复无法读取数据,当时解决的方法是重启该进程,使得读取到其他的DataNode上。
至于深层原因,根当时的DN状态有关,由于没有保存完整堆栈信息,只能在下次出错的时候继续排查,但是基本思路是DN的socket server响应慢,由于内部逻辑导致的。

Hadoop 2运行distCpV1

记录一下,有些情况,需要使用distcpV1,命令为

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-extras-2.4.0.jar org.apache.hadoop.tools.DistCpV1 $source $dest

Hadoop Web Page Information

Sometimes, we want make sure which branch we compile the hadoop code. Normally, we can find this information through hadoop web page, for example the namenode web page:


If we want to change the information by self, we should modify the file in
$HADOOP_HOME/hadoop-common-project/hadoop-common/src/main/resources/common-version-info.properties

version=${pom.version}
revision=${version-info.scm.commit}
branch=${version-info.scm.branch}
user=${user.name}
date=${version-info.build.time}
url=${version-info.scm.uri}
srcChecksum=${version-info.source.md5}
protocVersion=${protobuf.version}

Just modify the value , you can change the web page of namenode. For example we can change the branch to branch=tag-20160817 as the image above.

Hadoop Balancer问题

最近关注了一下集群Balancer的运转,由于集群非常大,并且有一些长时任务一只运行,磁盘不均衡的问题非常严重,所以快速的均衡数据是非常必要的。
但是从日志上看,我们的Balancer经常hang住。之前的做法是通过一个cron脚本进行检查,hang住一段时间后,自动的重启balancer。
这两天特意看了一下日志和代码,发现hang住的原因是由于npe导致的。具体的代码如下:

             // update locations
             for (String datanodeUuid : blk.getDatanodeUuids()) {
               final BalancerDatanode d = datanodeMap.get(datanodeUuid);
-              if (datanode != null) { // not an unknown datanode
+              if (d != null) { // not an unknown datanode
                 block.addLocation(d);
               }
             }

进行判断的对象错误了,将datanode改为d就好了。原先的做法会导致Balncer提交的copy任务抛出npe错误。

    for (Source source : sources) {
      futures[i++] = dispatcherExecutor.submit(source.new BlockMoveDispatcher());
    }

    // wait for all dispatcher threads to finish
    for (Future<?> future : futures) {
      try {
        future.get();
      } catch (ExecutionException e) {
        LOG.warn("Dispatcher thread failed", e.getCause());
      }
    }

在这部分,由于提交的BlockMoveDispatcher任务抛出了npe,同时在future.get没有设置超时,就会一只hang住。
解决方法:除了修改npe以为,还可以在future.get()设置超时时间,确保不会一直hang住。

ResourceManager dispatcher handler slow because RMStore Synchronized Method

Recently, I have noticed the Async dispatcher in our resource manager get pending for some times.
Here are some log:

2016-05-24 00:46:20,398 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 24000
2016-05-24 00:46:21,008 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 25000
2016-05-24 00:46:21,632 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 26000
2016-05-24 00:46:22,251 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 27000
2016-05-24 00:46:22,873 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 28000
2016-05-24 00:46:23,501 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 29000
2016-05-24 00:46:24,109 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 30000  

As we all know, the async dispatcher in resource manager normally handle the event quickly enough, but from the log ,we can notice the pending situation is serious.
So we investigated this problem, and jstack the rescue manager process during pending. Here is the jstack information:

"AsyncDispatcher event handler" prio=10 tid=0x00007f4d6db10000 nid=0x5bca waiting for monitor entry [0x00007f4d3aa8c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeNewApplication(RMStateStore.java:375)
        - waiting to lock <0x00000003bae88af0> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppNewlySavingTransition.transition(RMAppImpl.java:881)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppNewlySavingTransition.transition(RMAppImpl.java:872)
        at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        - locked <0x0000000394cbae40> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:645)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:82)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:690)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:674)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:662)

"AsyncDispatcher event handler" daemon prio=10 tid=0x00007f4d6d8f6000 nid=0x5c32 in Object.wait() [0x00007f4d3a183000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2031)
        - locked <0x000000032bc7bd58> (a java.util.LinkedList)
        at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2015)
        at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2113)
        - locked <0x000000032bc7ba80> (a org.apache.hadoop.hdfs.DFSOutputStream)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:528)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:329)
        - locked <0x00000003bae88af0> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:625)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:770)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:765)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:662)

It seemed the async dispatcher in resource manager is blocked by the method of storeNewApplication in RMSteateStore.
From the code we know there are two async dispatcher in resource manager process. One is the main dispatcher for whole resource manager to deal with applications submit, scheduler and other staff. The other is the dispatcher in rmstore, the function of rmstore can be explained in this blog. Because rmstore use hdfs or zk for backup, the process time is slow, so it have its own dispatcher in case not to pending the main dispatcher of resource manager.
Unfortunately, we use hdfs for our rmstore back up, deep inside the code

  public synchronized void storeNewApplication(RMApp app) {
    ApplicationSubmissionContext context = app
                                            .getApplicationSubmissionContext();
    assert context instanceof ApplicationSubmissionContextPBImpl;
    ApplicationState appState =
        new ApplicationState(app.getSubmitTime(), app.getStartTime(), context,
          app.getUser());
    dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
  }

The main dispatcher handle the event to store the new application context information, this method only pass the event to the rmstatestore dispatcher and return immediately. But the method is sync. And in the child class of rmstatestore — Filesystemrmstatestore, the code to store application to hdfs is as follow:

  @Override
  public synchronized void storeApplicationStateInternal(ApplicationId appId,
      ApplicationStateData appStateDataPB) throws Exception {
    String appIdStr = appId.toString();
    Path appDirPath = getAppDir(rmAppRoot, appIdStr);
    fs.mkdirs(appDirPath);
    Path nodeCreatePath = getNodePath(appDirPath, appIdStr);

    LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
    byte[] appStateData = appStateDataPB.getProto().toByteArray();
    try {
      // currently throw all exceptions. May need to respond differently for HA
      // based on whether we have lost the right to write to FS
      writeFile(nodeCreatePath, appStateData);
    } catch (Exception e) {
      LOG.info("Error storing info for app: " + appId, e);
      throw e;
    }
  }

This method is also sync. So if dispatcher in rmstatestore occupy the lock, and write to hdfs slow, it is easy to find the main dispatcher get pending to wait for the lock. And the lock is meaningless for main dispatcher, so we can just get rid of the lock.
There are related jira for this problem, the jira is YARN-4398.