nodemanager异常 org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch

  • Post author:
  • Post category:其他


系统环境:ubuntu14.04 server cloudera CDH 5.10 共计28个节点。单节点内存128G。有6台机器是24核,22台机器是32核。有10台机器磁盘容量是1.8T 18台机器磁盘容量是12.8T。


resourceManager,namenode 在同一节点上,且该节点不包括datanode和nodeManager服务。


剩余28节点均包含了datanode和nodeManager服务


最近集群做TB级数据随机化实验,hdfs各节点数据分布不太均衡,10台磁盘容量为1.8T的机器磁盘全部占满,导致这些节点上yarn的nodemanager 没有足够的空间,致使9个nodemanager 停止接受服务,1个nodemanager 服务终止,且无法启动。

考虑将10台低容量磁盘机器datanode角色停止授权,数据分发到其他大容量机器上,并删除hdfs数据。

在停止授权期间,我给集群又添加了一台32核,128G内存 1.8T磁盘的机器,分配了nodeManager角色。


上述工作顺利完成后,重新启动yarn.

大部分机器正常工作,只有上述服务终止的nodemanager依然无法启动,查询日志报错如下:


FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager


Error starting NodeManager


org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch


at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)


at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)


at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)


Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch


at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)


at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)


at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)


at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)


… 5 more

google搜索 org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch 给出的解决方案:


网址:

https://community.hortonworks.com/questions/86301/nodemanager-fails-to-start.html



Maybe a sst file got corrupt can you try to remove the folder of /var/log/hadoop-yarn/nodemanager/recovery-state from failed nodemanagers and check if starts?


These files stays in the system even if you decomission the nodes.


上述文件在节点上并不存在。故继续查看日志:

上午9点56:15.339分 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager


registered UNIX signal handlers for [TERM, HUP, INT]


4月 14, 上午9点56:16.426分 INFO org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService


Using state database at /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state for recovery


4月 14, 上午9点56:16.457分 INFO org.apache.hadoop.service.AbstractService


Service org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch


org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch


at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)


at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)


at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)


at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)


at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)


4月 14, 上午9点56:16.466分 INFO org.apache.hadoop.service.AbstractService


Service NodeManager failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch


org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch


at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)


at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:163)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:227)


at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:544)


at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:591)


Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch


at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)


at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)


at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:944)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:931)


at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)


at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)


… 5 more

发现疑似目录:/var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state下存在:
005615.sst  005638.log  005640.log  CURRENT  LOCK  MANIFEST-004397
移除所有文件。重启nodemanager 成功。
回顾错误原因可能是,我在该nodemanager终止情况下,在集群中添加了新的nodemanager,使得角色数目增加,而启动失败的nodemanager时,它使用存储的状态来恢复,在和数据库校验过程中发现数目不符合而启动失败。因此删除上述目录下的文件。