2.Hadoop 集群HA Standby Namenode损坏故障修复2

  • Post author:
  • Post category:其他


1.检查Standby Namenode 的状态。

[hadoop@big82 current]$

hdfs dfs -ls /

—我的Hdfs上有一个目录:/test1

Found 1 items

drwxr-xr-x   – hadoop supergroup          0 2022-04-16 18:47 /test1

[hadoop@big82 current]$

hdfs haadmin -getAllServiceState


big81:9000                                         active


big82:9000                                         standby    –状态正常。


2.模拟Standby namenode 损坏。


[hadoop@big82 current]$ pwd

/data02/current

[hadoop@big82 current]$ ll

total 16

-rw-rw-r– 1 hadoop hadoop 388 Apr 17 09:50 fsimage_0000000000000000000

-rw-rw-r– 1 hadoop hadoop  62 Apr 17 09:50 fsimage_0000000000000000000.md5

-rw-rw-r– 1 hadoop hadoop   2 Apr 17 09:50 seen_txid

-rw-rw-r– 1 hadoop hadoop 216 Apr 17 09:50 VERSION

[hadoop@big82 current]$


rm -rf *     将Standby Namenode里面的数据全部删除。


此时系统尚未感知到standby namenode 挂了,我们重启standby namenode ,让集群知道Standby namenode 不在了。


3.重启Standby namenode ;


[hadoop@big82 current]$ hdfs –daemon stop namenode

[hadoop@big82 current]$ hdfs –daemon start namenode

[hadoop@big82 current]$ jps

18594 ResourceManager

22088 Jps

17803 DFSZKFailoverController

[hadoop@big82 current]$ hdfs haadmin -getAllServiceState

big81:9000                                         active

2022-04-17 10:12:43,122 INFO ipc.Client: Retrying connect to server: big82/192.168.1.82:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)



big82:9000                                         Failed to connect: Call From big82/192.168.1.82 to big82:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

可以看到Standby Namenode 无法启动。同时Datanode节点的日志中出现:ipc连接异常的报错


2022-04-17 10:12:38,881 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: big82/192.168.1.82:9000. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)


2022-04-17 10:15:30,301 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService

java.net.ConnectException: Call From big91/192.168.1.91 to big82:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

at sun.reflect.GeneratedConstructorAccessor11.newInstance(Unknown Source)


4.修复Standby namenode

[hadoop@big82 current]$

hdfs namenode -bootstrapStandby       —格式化Standby Namenode;


Java HotSpot(TM) 64-Bit Server VM warning: Cannot open file /data02/logs/hadoop/gc.log due to No such file or directory

log4j:ERROR Could not find value for key log4j.appender.DRFAAUDIT

log4j:ERROR Could not instantiate appender named “DRFAAUDIT”.

2022-04-17 10:17:00,047 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = big82/192.168.1.82

STARTUP_MSG:   args = [-bootstrapStandby]

STARTUP_MSG:   version = 3.1.1

………………………………………………………………………………

STARTUP_MSG:   build = https://github.com/apache/hadoop -r 2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c; compiled by ‘leftnoteasy’ on 2018-08-02T04:26Z

STARTUP_MSG:   java = 1.8.0_261

************************************************************/

2022-04-17 10:17:00,059 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]

2022-04-17 10:17:00,066 INFO namenode.NameNode: createNameNode [-bootstrapStandby]

2022-04-17 10:17:00,279 INFO ha.BootstrapStandby: Found nn: fgedunn1, ipc: big81/192.168.1.81:9000

=====================================================

About to bootstrap Standby ID fgedunn2 from:

Nameservice ID: fgeduns

Other Namenode ID: fgedunn1

Other NN’s HTTP address: http://big81:50070

Other NN’s IPC  address: big81/192.168.1.81:9000

Namespace ID: 166178331

Block pool ID: BP-1145621526-192.168.1.81-1650104239176

Cluster ID: CID-6e446eb6-01fa-4f97-9e08-9f5842bf335a

Layout version: -64

isUpgradeFinalized: true

=====================================================

Re-format filesystem in Storage Directory root= /data02; location= null ? (Y or N)

y  –确认重新格式化


2022-04-17 10:17:34,117 INFO common.Storage: Will remove files: []

2022-04-17 10:17:34,129 INFO common.Storage:

Storage directory /data02 has been successfully formatted.   –格式化成功。


2022-04-17 10:17:34,760 INFO

namenode.FSEditLog: Edit logging is async:true  –同步完成。


2022-04-17 10:17:35,015 INFO namenode.TransferFsImage: Opening connection to http://big81:50070/imagetransfer?getimage=1&txid=0&storageInfo=-64:166178331:1650104239176:CID-6e446eb6-01fa-4f97-9e08-9f5842bf335a&bootstrapstandby=true

2022-04-17 10:17:35,060 INFO common.Util: Combined time for file download and fsync to all disks took 0.00s. The file download took 0.00s at 0.00 KB/s. Synchronous (fsync) write to disk of /data02/current/fsimage.ckpt_0000000000000000000 took 0.00s.

2022-04-17 10:17:35,060 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000000000 size 388 bytes.

2022-04-17 10:17:35,077 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at big82/192.168.1.82

************************************************************/


5.启动Standby namenode;


[hadoop@big82 data02]$ cd current

[hadoop@big82 current]$ ll   —可以看到格式之后,重新生成了fsimage

total 16

-rw-rw-r– 1 hadoop hadoop 388 Apr 17 10:17 fsimage_0000000000000000000

-rw-rw-r– 1 hadoop hadoop  62 Apr 17 10:17 fsimage_0000000000000000000.md5

-rw-rw-r– 1 hadoop hadoop   2 Apr 17 10:17 seen_txid

-rw-rw-r– 1 hadoop hadoop 216 Apr 17 10:17 VERSION

[hadoop@big82 current]$


hdfs –daemon start namenode



[hadoop@big82 current]$ jps

18594 ResourceManager

17803 DFSZKFailoverController

22475 Jps



22396 NameNode      –Standby namenode 又启动了。



[hadoop@big82 current]$ hdfs haadmin -getAllServiceState

big81:9000                                         active



big82:9000                                         standby     –备Standby namenode;

6.检查Datanode 是否还有报错。

2022-04-17 10:19:20,659 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1145621526-192.168.1.81-1650104239176 (Datanode Uuid 31d0fd01-959e-434b-998e-016345cf3d7e) service to big82/192.168.1.82:9000 successfully registered with NN

2022-04-17 10:19:20,687 INFO org.apache.hadoop.hdfs.server.datanode.

DataNode: Successfully sent block report 0x95b5123c058e9e1f,  containing 3 storage report(s), of which we sent 3.

The reports had 0 total blocks and used 1 RPC(s). This took 1 msec to generate and 19 msecs for RPC and

NN processing

. Got back no commands

说明Standby namenode可以正常处理了。

7.检查我们hadoop 目录。

[hadoop@big82 current]$ hdfs dfs -ls /

Found 1 items

drwxr-xr-x   – hadoop supergroup          0 2022-04-16 18:47 /test1

至此,修复完成。



版权声明:本文为weixin_43346403原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。