[HDFS] NameNode sshFence 的一个小 bug

2017-01-12

前几天集群在发生异常切换的时候,除了了以下警告日志

1
2
[2017-01-10T01:42:37.234+08:00] [WARN] hadoop.ha.SshFenceByTcpPort.pump(StreamPumper.java 88) [nc -z xxx-xxx-17224.hadoop.xxx.com 8021 via ssh: StreamPumper for STDERR] : nc -z xxx-xxx-17224.hadoop.xxx.com 8021 via ssh: nc: invalid option -- 'z'
[2017-01-10T01:42:37.235+08:00] [WARN] hadoop.ha.SshFenceByTcpPort.pump(StreamPumper.java 88) [nc -z xxx-xxx-17224.hadoop.xxx.com 8021 via ssh: StreamPumper for STDERR] : nc -z xxx-xxx-17224.hadoop.xxx.com8021 via ssh: Ncat: Try `--help' or man(1) ncat for more information, usage options and help. QUITTING.

我们知道tryGracefulFence不成功之后会去fence的那个进程NN,我们看下代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
private boolean doFence(Session session, InetSocketAddress serviceAddr)
throws JSchException {
int port = serviceAddr.getPort();
try {
//这段日志已经出现,所以忽略
LOG.info("Looking for process running on port " + port);
int rc = execCommand(session,
"PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp " + port);
//这段日志没出现,所以代表执行该命令返回值为rc == 1
if (rc == 0) {
LOG.info("Successfully killed process that was " +
"listening on port " + port);
// exit code 0 indicates the process was successfully killed.
return true;
} else if (rc == 1) {
// exit code 1 indicates either that the process was not running
// or that fuser didn't have root privileges in order to find it
// (eg running as a different user)
LOG.info(
"Indeterminate response from trying to kill service. " +
"Verifying whether it is running using nc...");
     //然后通过这个命令去检查端口是否还在的时候,报错了,返回2,并不是执行返回1,而是执行方法有误
rc = execCommand(session, "nc -z " + serviceAddr.getHostName() +
" " + serviceAddr.getPort());
if (rc == 0) {
// the service is still listening - we are unable to fence
LOG.warn("Unable to fence - it is running but we cannot kill it");
return false;
} else {
LOG.info("Verified that the service is down.");
return true;
}
} else {
// other
}
LOG.info("rc: " + rc);
return rc == 0;
} catch (InterruptedException e) {
LOG.warn("Interrupted while trying to fence via ssh", e);
return false;
} catch (IOException e) {
LOG.warn("Unknown failure while trying to fence via ssh", e);
return false;
}
}

然后在CentOS7执行这个命令时

1
2
3
4
5
[XXX@XXX-XXX-17223 ~]$ nc -z
nc: invalid option -- 'z'
Ncat: Try `--help' or man(1) ncat for more information, usage options and help. QUITTING.
[XX@XXX-XXX-17223 ~]$ echo $?
2

意味着这句话返回值不为0,不代表fence成功,因为操作系统的nc版本问题,也有可能不为0,而为2

1
2
rc = execCommand(session, "nc -z " + serviceAddr.getHostName() +
" " + serviceAddr.getPort());

更严格的判断为不为1,也不能判断fence成功

1
2
3
4
5
6
➜  ~ nc -z 127.0.0.1 3200
➜ ~ echo $?
0
➜ ~ nc -z 127.0.0.3 3200
➜ ~ echo $?
1

改代码已经在社区被提出了,也有了解决方案,但是没被合并,详情请看HDFS-3618HDFS-11308