Ambari 操作指南 (Ambari Operations) 之四

继

(Ambari 操作指南 (Ambari Operations) 之三 : https://blog.csdn.net/devalone/article/details/80813176 )

8 启用服务自动启动 (Managing Alerts and Notifications)

—————————————————————————————————————————————–

Ambari 为每一个集群组件和主机使用一套预定义的七种类型的警报(web, port, metric, aggregate, script, server, and recovery). 可以利用这些警报

监控集群健康情况，以及向其他用户报警以帮助识别和处理故障问题。可以修改警报的名称，描述，以及检查周期，也可以禁用以及重新启用警报。

也可以创建一组警报并设置通知目标给每个用户组，这样就可以使用不同的方法通知不同的警报集给不同的用户组。

8.1 理解警报 (Understanding Alerts)

—————————————————————————————————————————————–

Ambari 预定义了一系列警报来监控集群组件和主机。每一个警报由一个警报定义(alert definition)来定义，定义警报类型检查的间隔和阈值。集群创建或

修改时，Ambari 读取警报定义并为指定的项(items)创建警报实例进行监控。例如，如果集群包括 Hadoop Distributed File System (HDFS), 有一个警报

定义用于监控 “DataNode Process”. 集群中为每一个 DataNode 创建一个警报定义的实例。

利用 Ambari Web，通过单击 Alert tab 可以浏览集群上警报定义列表。可以通过当前状态，最后状态变化，以及与警报定义相关联的服务，查找或过滤警报

的定义。可以单击 alert definition name 来查看该警报的详细信息，或修改警报属性(如检查间隔和阈值)，以及该警报定义相关联的警报实例列表。

每个警报实例报告一个警报状态，由严重程度定义。最常用的严重级别为 OK, WARNING, and CRITICAL, 也有 UNKNOWN 和 NONE 的严重级别。警报通知在警报

状态发生变化时发送(如，状态从 OK 变为 CRITICAL)。

8.1.1 警报类型 (Alert Types)

—————————————————————————————————————————————–

警报阈值和阈值的单位取决于警报的状态。下表列出了警报类型，它们可能的状态，以及可以配置什么阈值单位，如果阈值可配置的话

WEB Alert Type ：WEB 警报监视一个给定组件的 web URL, 警报状态由 HTTP 响应代码确定。因此，不能改变 HTTP 的响应代码来确定 WEB 警报

的阈值。可以自定义每个阈值和整个 web 连接超时的响应文本。连接超时被认为是 CRITICAL 警报。阈值单位基于秒。

响应代码对应 WEB 警报的状态如下：

● OK status        ：如果 web URL 响应代码低于 400.

● WARNING status   ：如果 web URL 响应代码等于或高于 400.

● CRITICAL status   ：如果 Ambari 不能连接到某个 web URL.

PORT Alert Type ：PORT 警报检查连接到一个给定端口的响应时间，阈值单位基于秒

METRIC Alert Type ：METRIC 警报检查一个或多个度量的值(如果执行计算)。度量从一个给定组件上的可用的 URL 端点访问。连接超时被认为是 CRITICAL

警报。

阈值是可调整的，并且每一个阈值的单位取决于度量。例如，在 CPU utilization 警报的场景下，单位是百分数；在

RPC latency 警报的场景下，单位为毫秒。

AGGREGATE Alert Type ：AGGREGATE 警报聚合警报状态的数量作为受影响警报数量的百分比。例如，Percent DataNode Process 警报聚合 DataNode Process

警报。

SCRIPT Alert Type ：SCRIPT 警报执行某个脚本来确定其状态，例如 OK, WARNING, 或 CRITICAL. 可以自定义响应文本和属性的值，以及 SCRIPT 警报的

阈值。

SERVER Alert Type ：SERVER 警报执行一个服务器侧的可运行类以确定警报状态，例如，OK, WARNING, 或 CRITICAL

RECOVERY Alert Type ：RECOVERY 警报由 Ambari Agent 处理，用于监控进程重启。警报状态 OK, WARNING, 以及 CRITICAL 基于一个进程自动重启所用时间的

数量。这在要了解进程终止并被 Ambari 自动重启时非常有用。

8.2 修改警报 (Modifying Alerts)

—————————————————————————————————————————————–

警报的通用属性包括名称，描述，检查间隔，以及阈值。

检查间隔定义了 Ambari 检查警报状态的频率。例如，”1 minute” 意思是 Ambari 每分钟检查警报的状态。

阈值的配置选项取决于警报的类型

修改警报的通用属性：

①   在 Ambari Web 上浏览到 Alerts 部分

②   找到警报到定义并单击以查看定义详细信息

③   单击 Edit 来修改名称，描述，检查间隔，以及阈值(如果可用)

④   单击 Save

⑤   在下一次检查间隔时，在所有警报实例上修改生效

8.3 修改警报检查数量 (Modifying Alert Check Counts)

—————————————————————————————————————————————–

Ambari 可以设置警报在分发一个通知之前执行检查的数量。如果警报状态在一个检查期间发生了变化，Ambari 在分发通知之前会尝试检查这个条件一定的

次数(check count)。

警报检查次数不适用于 AGGREATE 警报类型。一个状态的变化对于 AGGREATE 警报导致一个通知分发。

如果环境中经常会用短时的问题导致错误的警报，可以提升检查次数。这种情况下，警报状态的变化仍然会记录，但是作为 SOFT 状态变化。如果在一个指定

的检查次数之后警报条件仍然触发，这个状态的变化被认为是 HARD, 并且通知被发出。

通常对所有警报全局设置检查次数，但如果一个或多个警报实践中有短时问题的情况，也可以对单个的警报设置一覆盖全局设定值。

修改全局警报检查次数：

① 在 Ambari Web 中浏览到 Alerts 部分

② 在 Actions 菜单, 单击 Manage Alert Settings

③ 更新 Check Count 值

④ 单击 Save

对全局警报检查次数对修改可能要求几秒钟后出现在 Ambari UI 的单个警报上

为单个警报覆盖全局警报检查次数：

① Ambari Web 中浏览到 Alerts 部分

② 选择要设置特殊 Check Count 值的警报

③ 在右侧，单击 Check Count property 旁的 Edit 图标

④ 更新 Check Count 值

⑤ 单击 Save

8.4 禁用和再启用警报 (Disabling and Re-enabling Alerts)

—————————————————————————————————————————————–

可以禁用警报。当一个警报禁用时，没有警报实例生效，并且 Ambari 不在执行该警报的检查。因而，没有警报状态变化会记录，并且没有通知发送。

① Ambari Web 中浏览到 Alerts 部分

② 找到警报定义，单击文本旁的 Enabled 或 Disabled 以启用/禁用该警报

③ 另一方法，单击警报以查看定义的详细信息，然后单击 Enabled 或 Disabled 以启用/禁用该警报

④ 提示确认启用/禁用

8.5 预定义的警报 (Tables of Predefined Alerts)

—————————————————————————————————————————————–

8.5.1 HDFS 服务警报 (HDFS Service Alerts)

—————————————————————————————————————————————–

□ 警报名称：NameNode Blocks Health

————————————————————————————————————————————-

警报类型   ：METRIC

描述       ：This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold.

潜在原因   ：Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes.

The corrupt or missing blocks are from files with a replication factor of 1. New replicas cannot be created because the

only replica of the block is missing.

解决方法   ：For critical data, use a replication factor of 3.

Bring up the failed DataNodes with missing or corrupt blocks.

Identify the files associated with the missing or corrupt blocks by running the Hadoop fsck command.

Delete the corrupt files and recover them from backup, if one exists.

□ 警报名称：NFS Gateway Process

————————————————————————————————————————————-

警报类型   ：PORT

描述       ：This host-level alert is triggered if the NFS Gateway process cannot be confirmed as active.

潜在原因   ：NFS Gateway is down.

解决方法   ：Check for a non-operating NFS Gateway in Ambari Web.

□ 警报名称：DataNode Storage

————————————————————————————————————————————-

警报类型   ：METRIC

描述       ：This host-level alert is triggered if storage capacity is full on the DataNode (90% critical). It checks the DataNode

JMX Servlet for the Capacity and Remaining properties.

潜在原因   ：Cluster storage is full.

If cluster storage is not full, DataNode is full.

解决方法   ：If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes.

If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger

disks to the DataNodes. After adding more storage, run the load balancer.

□ 警报名称：DataNode Process

————————————————————————————————————————————-

警报类型   ：PORT

描述       ：This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on

the network for the configured critical threshold, in seconds.

潜在原因   ：DataNode process is down or not responding.

DataNode are not down but is not listening to the correct network port/address.

解决方法   ：Check for non-operating DataNodes in Ambari Web.

Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode, if necessary.

Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.

□ 警报名称：DataNode Web UI

————————————————————————————————————————————-

警报类型   ：WEB

描述       ：This host-level alert is triggered if the DataNode web UI is unreachable.

潜在原因   ：The DataNode process is not running.

解决方法   ：Check whether the DataNode process is running.

□ 警报名称：NameNode Host CPU Utilization

————————————————————————————————————————————-

警报类型   ：METRIC

描述       ：This host-level alert is triggered if CPU utilization of the NameNode exceeds certain thresholds (200% warning,

250% critical). It checks the NameNode JMX Servlet for the SystemCPULoad property. This information is available only if

you are running JDK 1.7.

潜在原因   ：Unusually high CPU utilization might be caused by a very unusual job or query workload, but this is generally the sign

of an issue in the daemon.

解决方法   ：Use the top command to determine which processes are consuming excess CPU.

Reset the offending process.

□ 警报名称：NameNode Web UI

————————————————————————————————————————————-

警报类型   ：WEB

描述       ：This host-level alert is triggered if the NameNode web UI is unreachable.

潜在原因   ：The NameNode process is not running.

解决方法   ：Check whether the NameNode process is running.

□ 警报名称：Percent DataNodes with Available Space

————————————————————————————————————————————-

警报类型   ：AGGREGATE

描述       ：This service-level alert is triggered if the storage is full on a certain percentage of DataNodes(10% warn, 30% critical)

潜在原因   ：Cluster storage is full.

If cluster storage is not full, DataNode is full.

解决方法   ：If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes

If the cluster is full, delete unnecessary data or increase storage by adding either more DataNodes or more or larger disks

to the DataNodes. After adding more storage, run the load balancer.

□ 警报名称：Percent DataNodes Available

————————————————————————————————————————————-

警报类型   ：AGGREGATE

描述       ：This alert is triggered if the number of non-operating DataNodes in the cluster is greater than the configured critical

threshold. This   aggregates the DataNode process alert.

潜在原因   ：DataNodes are down.

DataNodes are not down but are not listening to the correct network port/address.

解决方法   ：Check for non-operating DataNodes in Ambari Web.

Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes.

Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.

□ 警报名称：NameNode RPC Latency

————————————————————————————————————————————-

警报类型   ：METRIC

描述       ：This host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold.

Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

increase for NameNode operations.

潜在原因   ：A job or an application is performing too many NameNode operations.

解决方法   ：Review the job or the application for potential bugs causing it to perform too many NameNode operations.

□ 警报名称：NameNode Last Checkpoint

————————————————————————————————————————————-

警报类型   ：SCRIPT

描述       ：This alert will trigger if the last time that the NameNode performed a checkpoint was too long ago or if the number of

uncommitted transactions is beyond a certain threshold.

潜在原因   ：Too much time elapsed since last NameNode checkpoint.

Uncommitted transactions beyond threshold.

解决方法   ：Set NameNode checkpoint.

Review threshold for uncommitted transactions.

□ 警报名称：Secondary NameNode Process

————————————————————————————————————————————-

警报类型   ：WEB

描述       ：If the Secondary NameNode process cannot be confirmed to be up and listening on the network. This alert is not applicable

when NameNode HA is configured.

潜在原因   ：The Secondary NameNode is not running.

解决方法   ：Check that the Secondary DataNode process is running.

□ 警报名称：NameNode Directory Status

————————————————————————————————————————————-

警报类型   ：METRIC

描述       ：This alert checks if the NameNode NameDirStatus metric reports a failed directory.

潜在原因   ：One or more of the directories are reporting as not healthy.

解决方法   ：Check the NameNode UI for information about unhealthy directories.

□ 警报名称：HDFS Capacity Utilization

————————————————————————————————————————————-

警报类型   ：METRIC

描述       ：This service-level alert is triggered if the HDFS capacity utilization exceeds the configured critical threshold

(80% warn, 90% critical). It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties.

潜在原因   ：Cluster storage is full.

解决方法   ：Delete unnecessary data.

Archive unused data.

Add more DataNodes.

Add more or larger disks to the DataNodes.

After adding more storage, run the load balancer.

□ 警报名称: DataNode Health Summary

————————————————————————————————————————————-

警报类型   : METRIC

描述       : This service-level alert is triggered if there are unhealthy DataNodes.

潜在原因   : A DataNode is in an unhealthy state.

解决方法   : Check the NameNode UI for the list of non-operating DataNodes.

□ 警报名称：HDFS Pending Deletion Blocks

————————————————————————————————————————————-

警报类型   : METRIC

描述       : This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning

and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property.

潜在原因   : Large number of blocks are pending deletion.

解决方法   :

□ 警报名称：HDFS Upgrade Finalized State

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This service-level alert is triggered if HDFS is not in the finalized state.

潜在原因   : The HDFS upgrade is not finalized.

解决方法   : Finalize any upgrade you have in process.

□ 警报名称：DataNode Unmounted Data Dir

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became

unmounted.

潜在原因   : If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well

as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the

root partition, which is undesirable.

解决方法   : Check the data directories to confirm they are mounted as expected.

□ 警报名称：DataNode Heap Usage

————————————————————————————————————————————-

警报类型   : METRIC

描述       : This host-level alert is triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet

for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are percentages.

潜在原因   :

解决方法   :

□ 警报名称：NameNode Client RPC Queue Latency

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This service-level alert is triggered if the deviation of RPC queue latency on client port has grown beyond the specified

threshold within an given period. This alert will monitor Hourly and Daily periods.

潜在原因   :

解决方法   :

□ 警报名称：NameNode Client RPC Processing Latency

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified

threshold within a given period. This alert will monitor Hourly and Daily periods.

潜在原因   :

解决方法   :

□ 警报名称：NameNode Service RPC Queue Latency

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified

threshold within a given period. This alert will monitor Hourly and Daily periods.

潜在原因   :

解决方法   :

□ 警报名称：NameNode Service RPC Processing Latency

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified

threshold within a given period. This alert will monitor Hourly and Daily periods.

潜在原因   :

解决方法   :

□ 警报名称：HDFS Storage Capacity Usage

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified

threshold within a given period. This alert will monitor Daily and Weekly periods.

潜在原因   :

解决方法   :

□ 警报名称：NameNode Heap Usage

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold

within a given period. This alert will monitor Daily and Weekly periods.

潜在原因   :

解决方法   :

8.5.2 HDFS HA 警报 (HDFS HA Alerts)

—————————————————————————————————————————————–

□ 警报名称: JournalNode Web UI

————————————————————————————————————————————-

警报类型   : WEB

描述       : This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening

on the network for the configured critical threshold, given in seconds.

潜在原因   : The JournalNode process is down or not responding.

The JournalNode is not down but is not listening to the correct network port/address.

解决方法   :

□ 警报名称: NameNode High Availability Health

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.

潜在原因   : The Active, Standby or both NameNode processes are down.

解决方法   : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode

host/process using Ambari Web.

On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct

network port.

□ 警报名称: Percent JournalNodes Available

————————————————————————————————————————————-

警报类型   : AGGREGATE

描述       : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured

critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.

潜在原因   : JournalNodes are down.

JournalNodes are not down but are not listening to the correct network port/address.

解决方法 : Check for dead JournalNodes in Ambari Web.

□ 警报名称: ZooKeeper Failover Controller Process

————————————————————————————————————————————-

警报类型   : PORT

描述       : This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the

network.

潜在原因   : The ZKFC process is down or not responding.

解决方法   : Check if the ZKFC process is running.

8.5.3 NameNode HA 警报 (NameNode HA Alerts)

—————————————————————————————————————————————–

□ 警报名称: JournalNode Process

————————————————————————————————————————————-

警报类型 : WEB

描述 : This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening

on the network for the configured critical threshold, given in seconds.

潜在原因 : The JournalNode process is down or not responding.

The JournalNode is not down but is not listening to the correct network port/address.

解决方法 : Check if the JournalNode process is running.

解决方法 : Check for non-operating JournalNodes in Ambari Web.

8.5.4 YARN 警报 (YARN Alerts)

—————————————————————————————————————————————–

□ 警报名称: App Timeline Web UI

————————————————————————————————————————————-

警报类型   : WEB

描述       : This host-level alert is triggered if the App Timeline Server Web UI is unreachable.

潜在原因   : The App Timeline Server is down.

App Timeline Service is not down but is not listening to the correct network port/address.

解决方法 : Check for non-operating App Timeline Server in Ambari Web.

□ 警报名称: Percent NodeManagers Available

————————————————————————————————————————————-

警报类型   : AGGREGATE

描述       : This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold.

It aggregates the results of DataNode process alert checks.

潜在原因   : NodeManagers are down.

NodeManagers are not down but are not listening to the correct network port/address.

解决方法 : Check for non-operating NodeManagers.

Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManagers hosts/processes, as necessary.

Run the netstat -tuplpn command to check if the NodeManager process is bound to the correct network port.

□ 警报名称: ResourceManager Web UI

————————————————————————————————————————————-

警报类型   : WEB

描述       : This host-level alert is triggered if the ResourceManager Web UI is unreachable.

潜在原因   : The ResourceManager process is not running.

解决方法   : Check if the ResourceManager process is running.

□ 警报名称: ResourceManager RPC Latency

————————————————————————————————————————————-

警报类型 : METRIC

描述 : This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold.

Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

increase for ResourceManager operations.

潜在原因 : A job or an application is performing too many ResourceManager operations

解决方法 : Review the job or the application for potential bugs causing it to perform too many ResourceManager operations.

□ 警报名称: ResourceManager CPU Utilization

————————————————————————————————————————————-

警报类型   : METRIC

描述       : This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain thresholds (200% warning,

250% critical). It checks the ResourceManager JMX Servlet for the SystemCPULoad property. This information is only available

if you are running JDK 1.7.

潜在原因   : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

an issue in the daemon.

解决方法   : Use the top command to determine which processes are consuming excess CPU.

Reset the offending process.

□ 警报名称: NodeManager Web UI

————————————————————————————————————————————-

警报类型 : WEB

描述 : This host-level alert is triggered if the NodeManager process cannot be established to be up and listening on the network

for the configured critical threshold, given in seconds.

潜在原因 : NodeManager process is down or not responding.

NodeManager is not down but is not listening to the correct network port/address.

解决方法 : Check if the NodeManager is running.

Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManager, if necessary.

□ 警报名称: NodeManager Health Summary

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This host-level alert checks the node health property available from the NodeManager component.

潜在原因   : NodeManager Health Check script reports issues or is not configured.

解决方法   : Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart

if necessary.

Check in the ResourceManager UI logs (/var/log/hadoop/yarn) for health check errors.

□ 警报名称: NodeManager Health

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This host-level alert checks the nodeHealthy property available from the NodeManager component.

潜在原因   : The NodeManager process is down or not responding.

解决方法   : Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart

if necessary.

8.5.5 MapReduce2 警报 (MapReduce2 Alerts)

—————————————————————————————————————————————–

□ 警报名称: History Server Web UI

————————————————————————————————————————————-

警报类型   : WEB

描述       : This host-level alert is triggered if the HistoryServer Web UI is unreachable.

潜在原因   : The HistoryServer process is not running.

解决方法   : Check if the HistoryServer process is running.

□ 警报名称: History Server RPC latency

————————————————————————————————————————————-

警报类型   : METRIC

描述       : This host-level alert is triggered if the HistoryServer operations RPC latency exceeds the configured critical threshold.

Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

increase for NameNode operations.

潜在原因   : A job or an application is performing too many HistoryServer operations.

解决方法   : Review the job or the application for potential bugs causing it to perform too many HistoryServer operations.

□ 警报名称: History Server CPU Utilization

————————————————————————————————————————————-

警报类型   : METRIC

描述       : This host-level alert is triggered if the percent of CPU utilization on the HistoryServer exceeds the configured

critical threshold.

潜在原因   : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

an issue in the daemon.

解决方法   : Use the top command to determine which processes are consuming excess CPU.

Reset the offending process.

□ 警报名称: History Server Process

————————————————————————————————————————————-

警报类型   : PORT

描述       : This host-level alert is triggered if the HistoryServer process cannot be established to be up and listening on the

network for the configured critical threshold, given in seconds.

潜在原因   : HistoryServer process is down or not responding.

HistoryServer is not down but is not listening to the correct network port/address.

解决方法   : Check the HistoryServer is running.

Check for any errors in the HistoryServer logs (/var/log/hadoop/mapred) and restart the HistoryServer, if necessary.

8.5.6 HBase 服务警报 (HBase Service Alerts)

—————————————————————————————————————————————–

□ 警报名称: Percent RegionServers Available

————————————————————————————————————————————-

警报类型   :

描述       : This service-level alert is triggered if the configured percentage of Region Server processes cannot be determined to be

up and listening on the network for the configured critical threshold. The default setting is 10% to produce a WARN alert

and 30% to produce a CRITICAL alert. It aggregates the results of RegionServer process down checks.

潜在原因   : Misconfiguration or less-thanideal configuration caused the RegionServers to crash.

Cascading failures brought on by some workload caused the RegionServers to crash.

The RegionServers shut themselves own because there were problems in the dependent services, ZooKeeper or HDFS.

GC paused the RegionServer for too long and the RegionServers lost contact with Zookeeper.

解决方法 : Check the dependent services to make sure they are operating correctly.

Look at the RegionServer log files (usually /var/log/hbase/*.log) for further information.

If the failure was associated with a particular workload, try to understand the workload better.

Restart the RegionServers.

□ 警报名称: HBase Master Process

————————————————————————————————————————————-

警报类型   :

描述       : This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for

the configured critical threshold, given in seconds.

潜在原因   : The HBase master process is down.

The HBase master has shut itself down because there were problems in the dependent services, ZooKeeper or HDFS.

解决方法 : Check the dependent services.

Look at the master log files (usually /var/log/hbase/*.log) for further information.

Look at the configuration files (/etc/hbase/conf).

Restart the master.

□ 警报名称: HBase Master CPU Utilization

————————————————————————————————————————————-

描述 : This host-level alert is triggered if CPU utilization of the HBase Master exceeds certain thresholds (200% warning,

250% critical). It checks the HBase Master JMX Servlet for the SystemCPULoad property. This information is only available

if you are running JDK 1.7.

潜在原因 : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

an issue in the daemon.

解决方法 : Use the top command to determine which processes are consuming excess CPU

Reset the offending process.

□ 警报名称: RegionServers Health Summary

————————————————————————————————————————————-

描述 : This service-level alert is triggered if there are unhealthy RegionServers

潜在原因 : The RegionServer process is down on the host.

The RegionServer process is up and running but not listening on the correct network port (default 60030).

解决方法 : Check for dead RegionServer in Ambari Web.

□ 警报名称: HBase RegionServer Process

————————————————————————————————————————————-

描述 : This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the

network for the configured critical threshold, given in seconds.

潜在原因 : The RegionServer process is down on the host.

The RegionServer process is up and running but not listening on the correct network port (default 60030).

解决方法 : Check for any errors in the logs (/var/log/hbase/) and restart the RegionServer process using Ambari Web.

Run the netstat -tuplpn command to check if the RegionServer process is bound to the correct network port.

8.5.7 Hive 警报 (Hive Alerts)

—————————————————————————————————————————————–

□ 警报名称: HiveServer2 Process

————————————————————————————————————————————-

警报类型 :

描述 : This host-level alert is triggered if the HiveServer cannot be determined to be up and responding to client requests.

潜在原因 : HiveServer2 process is not running.

HiveServer2 process is not responding.

解决方法 : Using Ambari Web, check status of HiveServer2 component. Stop and then restart.

□ 警报名称: HiveMetastore Process

————————————————————————————————————————————-

描述       : This host-level alert is triggered if the Hive Metastore process cannot be determined to be up and listening on the

network for the configured critical threshold, given in seconds.

潜在原因   : The Hive Metastore service is down.

The database used by the Hive Metastore is down.

The Hive Metastore host is not reachable over the network.

解决方法   : Using Ambari Web, stop the Hive service and then restart it.

□ 警报名称: WebHCat Server Status

————————————————————————————————————————————-

警报类型   :

描述       : This host-level alert is triggered if the WebHCat server cannot be determined to be up and responding to client requests.

潜在原因   : The WebHCat server is down.

The WebHCat server is hung and not responding.

The WebHCat server is not reachable over the network.

解决方法   : Restart the WebHCat server using Ambari Web.

8.5.8 Oozie 警报 (Oozie Alerts)

—————————————————————————————————————————————–

□ 警报名称: Oozie Server Web UI

————————————————————————————————————————————-

描述       : This host-level alert is triggered if the Oozie server Web UI is unreachable.

潜在原因   : The Oozie server is down.

Oozie Server is not down but is not listening to the correct network port/address.

解决方法   : Check for dead Oozie Server in Ambari Web.

□ 警报名称: Oozie Server Status

————————————————————————————————————————————-

描述       : This host-level alert is triggered if the Oozie server cannot be determined to be up and responding to client requests.

潜在原因   : The Oozie server is down.

The Oozie server is hung and not responding.

The Oozie server is not reachable over the network.

解决方法   : Restart the Oozie service using Ambari Web.

8.5.9 ZooKeeper 警报 (ZooKeeper Alerts)

—————————————————————————————————————————————–

□ 警报名称: Percent ZooKeeper Servers Available

————————————————————————————————————————————-

警报类型   : AGGREGATE

描述       : This service-level alert is triggered if the configured percentage of ZooKeeper processes cannot be determined to be up

and listening on the network for the configured critical threshold, given in seconds. It aggregates the results of

ZooKeeper process checks.

潜在原因   : The majority of your ZooKeeper servers are down and not responding.

解决方法   : Check the dependent services to make sure they are operating correctly.

Check the ZooKeeper logs (/var/log/hadoop/zookeeper.log) for further information.

If the failure was associated with a particular workload, try to understand the workload better.

Restart the ZooKeeper servers from the Ambari UI.

□ 警报名称: ZooKeeper Server Process

————————————————————————————————————————————-

警报类型   : PORT

描述       : This host-level alert is triggered if the ZooKeeper server process cannot be determined to be up and listening on the

network for the configured critical threshold, given in seconds.

潜在原因   : The ZooKeeper server process is down on the host.

The ZooKeeper server process is up and running but not listening on the correct network port (default 2181).

解决方法   : Check for any errors in the ZooKeeper logs (/var/log/hbase/) and restart the ZooKeeper process using Ambari Web.

Run the netstat -tuplpn command to check if the ZooKeeper server process is bound to the correct network port.

8.5.10 Ambari 警报 (Ambari Alerts)

—————————————————————————————————————————————–

□ 警报名称: Host Disk Usage

————————————————————————————————————————————-

警报类型   : SCRIPT

描述       : This host-level alert is triggered if the amount of disk space used on a host goes above specific thresholds (50% warn,

80% crit ).

潜在原因   : The amount of free disk space left is low.

解决方法   : Check host for disk space to free or add more storage.

□ 警报名称: Ambari Agent Heartbeat

————————————————————————————————————————————-

警报类型   : SERVER

描述       : This alert is triggered if the server has lost contact with an agent.

潜在原因   : Ambari Server host is unreachable from Agent host

Ambari Agent is not running

解决方法   : Check connection from Agent host to Ambari Server

Check Agent is running

□ 警报名称: Ambari Server Alerts

————————————————————————————————————————————-

警报类型   : SERVER

描述       : This alert is triggered if the server detects that there are alerts which have not run in a timely manner

潜在原因   : Agents are not reporting alert status

Agents are not running

解决方法   : Check that all Agents are running and heartbeating

8.5.11 Ambari Metrics 警报 (Ambari Metrics Alerts)

—————————————————————————————————————————————–

□ 警报名称: Metrics Collector Process

————————————————————————————————————————————-

描述       : This alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on the configured port for

number of seconds equal   to threshold.

潜在原因   : The Metrics Collector process is not running.

解决方法   : Check the Metrics Collector is running.

□ 警报名称: Metrics Collector –ZooKeeper Server Process

————————————————————————————————————————————-

警报类型   :

描述       : This host-level alert is triggered if the Metrics Collector ZooKeeper Server Process cannot be determined to be up and

listening on the network.

潜在原因   : The Metrics Collector process is not running.

解决方法   : Check the Metrics Collector is running.

□ 警报名称: Metrics Collector –HBase Master Process

————————————————————————————————————————————-

警报类型   :

描述       : This alert is triggered if the Metrics Collector HBase Master Processes cannot be confirmed to be up and listening on

the network for the configured critical threshold, given in seconds.

潜在原因   : The Metrics Collector process is not running.

解决方法   : Check the Metrics Collector is running.

□ 警报名称: Metrics Collector – HBase Master CPU Utilization

————————————————————————————————————————————-

警报类型   :

描述       : This host-level alert is triggered if CPU utilization of the Metrics Collector exceeds certain thresholds.

潜在原因   : Unusually high CPU utilization generally the sign of an issue in the daemon configuration.

解决方法   : Tune the Ambari Metrics Collector.

□ 警报名称: Metrics Monitor Status

————————————————————————————————————————————-

警报类型   :

描述       : This host-level alert is triggered if the Metrics Monitor process cannot be confirmed to be up and running on the network.

潜在原因   : The Metrics Monitor is down.

解决方法   : Check whether the Metrics Monitor is running on the given host.

□ 警报名称: Percent Metrics Monitors Available

————————————————————————————————————————————-

描述       : This is an AGGREGATE alert of the Metrics Monitor Status.

潜在原因   : Metrics Monitors are down.

解决方法   : Check the Metrics Monitors are running.

□ 警报名称: Metrics Collector -Auto-Restart Status

————————————————————————————————————————————-

描述       : This alert is triggered if the Metrics Collector has been auto-started for number of times equal to start threshold in

a 1 hour timeframe. By default if restarted 2 times in an hour, you will receive a Warning alert. If restarted 4 or more

times in an hour, you will receive a Critical alert.

潜在原因   : The Metrics Collector is running but is unstable and causing restarts. This could be due to improper tuning.

解决方法   : Tune the Ambari Metrics Collector.

□ 警报名称: Grafana Web UI

————————————————————————————————————————————-

描述       : This host-level alert is triggered if the AMS Grafana Web UI is unreachable.

潜在原因   : Grafana process is not running.

解决方法   : Check whether the Grafana process is running. Restart if it has gone down.

8.5.12 SmartSenses 警报 (SmartSense Alerts)

—————————————————————————————————————————————–

□ 警报名称: SmartSense Server Process

————————————————————————————————————————————-

描述       : This alert is triggered if the HST server process cannot be confirmed to be up and listening on the network for the

configured critical threshold, given in seconds.

潜在原因   : HST server is not running.

解决方法   : Start HST server process. If startup fails, check the hst-server.log.

□ 警报名称: SmartSense Bundle Capture Failure

————————————————————————————————————————————-

描述       : This alert is triggered if the last triggered SmartSense bundle is failed or timed out.

潜在原因   : Some nodes are timed out during capture or fail during data capture. It could also be because upload to Hortonworks fails.

解决方法   : From the “Bundles” page check the status of bundle. Next, check which agents have failed or timed out, and review their logs.

You can also initiate a new capture.

□ 警报名称: SmartSense Long Running Bundle

————————————————————————————————————————————-

描述       : This alert is triggered if the SmartSense in-progress bundle has possibility of not completing successfully on time.

潜在原因   : Service components that are getting collected may not be running. Or some agents may be timing out during data

collection/upload.

解决方法   : Restart the services that are not running. Force-complete the bundle and start a new capture.

□ 警报名称: SmartSense Gateway Status

————————————————————————————————————————————-

描述       : This alert is triggered if the SmartSense Gateway server process is enabled but is unable to reach.

潜在原因   : SmartSense Gateway is not running.

解决方法   : Start the gateway. If gateway start fails, review hst-gateway.log

8.6 管理通知 (Managing Notifications)

—————————————————————————————————————————————–

利用警报组和通知可以创建警报分组，并为每个分组设置通知目标，通过这种方式可以把一组警报以不同的方式发送给不同的集群参与者。例如，可能想要

Hadoop Operations team 通过 email 接收所有的警报，不管警报是什么状态，同时，想要系统管理员小组只接收 RPC 和 CPU 相关的 Critical 状态的警报，

并且只通过 simple network management protocol(SNMP) 方式接收。

为了实现这些不同的结果，可以用一个警报通知，用于管理对所有警报组的所有的严重级别的 email 通知，用一个不同的警报组来管理 SNMP 方式发送的

Critical 严重性级别的警报通知，只包含 RPC 和 CPU 警报。

8.7 创建和编辑通知 (Creating and Editing Notifications)

—————————————————————————————————————————————–

① Ambari Web 中, 单击 Alerts

② 在 Alerts 页面，单击 Actions 菜单，然后单击 Manage Notifications

③ 在 Manage Alert Notifications 中，单击 + 创建一个新的警报通知

在 Create Alert Notification 中

● 在 Name 文本框，输入通知的名称

● 在 Groups 字段，单击 All 或 Custom 分配通知给所有或设置的组

● 在 Description 字段，输入描述通知的短语

● 在 Method 字段，单击 EMAIL, SNMP (for MIB-based) 或 Custom SNMP 作为 Ambari server 发送通知的方法

④ 完成所选择的通知方法字段定义

● 对于 email 通知，提供有关 SMTP 的信息，如，SMTP server, port ,以及 from 地址，服务器是否要求认证

可以对 SMTP 配置添加自定义的属性，基于Javamail SMTP

Email To           ：由一个或多个 email 地址组成的逗号分隔的列表，用于发送警报给这些 email 地址

SMTP Server           ：用于发送警报 email 的 SMTP server 的 FDQN 或 IP 地址

SMTP Port           ：SMTP server 的 SMTP 端口

Email From           ：一个 email 地址用于发送警报 email 的发送者

Use Authentication   ：确定在进行发送消息之前， SMTP server 是否要求身份验证。也要提供用户名和密码凭证

● 对于 MIB-based SNMP 通知，提供版本，community, 主机和端口，用于 SNMP trap 发送

Version       ：SNMPv1 或 SNMPv2c, 取决于网络环境

Hosts       ：逗号分隔的一个或多个主机 FDQN 列表，用于发送 trap

Port       ：进程用于监听 SNMP traps 的端口

对于 SNMP 通知， Ambari 使用 “MIB”, 一个文本文件警报定义的清单，来传输警报信息。MIB 概述了对象 ID 如何

映射为对象或属性。

可以在 Ambari server 主机上找到集群的 MIB 文件：

/var/lib/ambari-server/resources/APACHE-AMBARI-MIB.txt

● 对于自定义 SNMP 通知，提供版本，community, 主机和端口，用于 SNMP trap 发送。

OID 参数必须配置正确，如果没有自定义，使用 enterprise-specific OID

Version SNMPv1 or SNMPv2c, depending on the network environment

OID 1.3.6.1.4.1.18060.16.1.1

Hosts A comma-separated list of one or more host FQDNs to which to send the

trap

Port The port on which a process is listening for SNMP traps

⑤ 单击 Save

8.8 创建或编辑通知组 (Creating or Editing Alert Groups)

—————————————————————————————————————————————–

① Ambari Web 中, 单击 Alerts

② 在 Alerts 页面，单击 Actions 菜单，然后单击 Manage Alert Groups

③ 在 Manage Alert Groups 中，单击 + 创建一个新的警报组

④ 在 Create Alert Group 中，输入组名称然后单击 Save

⑤ 通过在列表中单击自定义的组，可以添加或删除警报定义，并可以改变该组的通知目标

⑥ 完成分配之后，单击 Save

8.9 分发通知 (Dispatching Notifications)

—————————————————————————————————————————————–

当启用了一个警报并且警报的状态发生变化时(例如，从 OK 变为 CRITICAL, 或从 CRITICAL 变为 OK), Ambari 或者发送一个 email 或 SNMP 通知，取决于

如何配置的通知。

对于 email 通知，Ambari 发送一封 email 包含所有警报状态的变化。例如，如果有两个警报变为 critical, Ambari 发送一封 email 消息：

Alert A is CRITICAL and Ambari B alert is CRITICAL

Ambari 不会发送另外一封 email 通知，直到状态再次发生变化。

对于 SNMP 通知，Ambari 每个警报状态变化发送一个 SNMP trap. 例如，有两个警报状态变为 critical, Ambari 发送两个 SNMP trap, 每个警报一个，然后

这两个警报状态再次变化时，再次发送。

8.10 查看警报状态日志 (Viewing the Alert Status Log)

—————————————————————————————————————————————–

不管 Ambari 是否配置为发送警报通知，它都会将警报状态的变化写入 Ambari server 主机的日志。查看日志：

① 在 Ambari server 主机上，浏览到日志目录

cd /var/log/ambari-server/

② 查看 ambari-alerts.log 文件

③ 日志条目包括状态变化的时间，警报状态，警报定义名称，以及响应文本

8.10.1 自定义通知模板 (Customizing Notification Templates)

—————————————————————————————————————————————–

由 Ambari 产生的通知模板内容取决于通知的类型。Email 和 SNMP 通知都有自定义的模板用于生成内容。本节描述改变用于 Ambari 创建警报通知模板的

必要步骤。

警报模板的 XML 位置

————————————————————————————————————————————-

默认情况下，Ambari 自带有一个 alert-templates.xml 文件。这个文件包含每一个已知类型通知的所有的模板(例如， EMAIL 和 SNMP). 这个文件

打包到 Ambari server 的 .jar 文件，因此模板没有存在于磁盘上。但是，这个文件用于如下文本，作为一个参考示例。

当自定义警报模板时，可以高效得覆盖默认的警报模板的 XML, 如下：

① 在 Ambari server 主机上，浏览到 /etc/ambari-server/conf 目录

② 编辑 ambari.properties 文件

③ 为新模板添加一个位置条目

alerts.template.file=/foo/var/alert-templates-custom.xml

④ 保存文件并重启 Ambari Server

重启 Ambari Server 之后，新模板中定义的任何通知类型都会覆盖打包在 Ambari 中的模板定义。如果选择提供自己的模板文件，只需要定义希望覆盖

的类型。如果一个通知模板类型在自定义的模板中没有找到，Ambari 会使用打包到 JAR 文件中的默认模板。

警报模板的 XML 结构

————————————————————————————————————————————-

模板文件的结构定义如下。每个 <alert-template> 元素声明警报通知要用于什么类型：

<alert-templates>

<alert-template type=”EMAIL”>

<subject>

Subject Content

</subject>

<body>

Body Content

</body>

</alert-template>

<alert-template type=”SNMP”>

<subject>

Subject Content

</subject>

<body>

Body Content

</body>

</alert-template>

</alert-templates>

模板变量

————————————————————————————————————————————-

模板利用 Apache Velocity 来表现所有标记的内容(tokenized content). 下面的变量可用于模板：

$alert.getAlertDefinition() The definition of which the alert is an instance.

$alert.getAlertText() The specific alert text.

$alert.getAlertName() The name of the alert.

$alert.getAlertState() The alert state (OK, WARNING, CRITICAL, or

UNKNOWN)

$alert.getServiceName() The name of the service that the alert is defined for.

$alert.hasComponentName() True if the alert is for a specific service component.

$alert.getComponentName() The component, if any, that the alert is defined for.

$alert.hasHostName() True if the alert was triggered for a specific host.

$alert.getHostName() The hostname, if any, that the alert was triggered for.

$ambari.getServerUrl() The Ambari Server URL.

$ambari.getServerVersion() The Ambari Server version.

$ambari.getServerHostName() The Ambari Server hostname.

$dispatch.getTargetName() The notification target name.

$dispatch.getTargetDescription() The notification target description.

$summary.getAlerts(service,alertStaAte li)st of all alerts for a given service or alert state (OK|

WARNING|CRITICAL|UNKNOWN)

$summary.getServicesByAlertState(Aal elirsttS otaf tael)l services for a given alert state (OK|

WARNING|CRITICAL|UNKNOWN)

$summary.getServices() A list of all services that are reporting an alert in the

notification.

$summary.getCriticalCount() The CRITICAL alert count.

$summary.getOkCount() The OK alert count.

$summary.getTotalCount() The total alert count.

$summary.getUnknownCount() The UNKNOWN alert count.

$summary.getWarningCount() The WARNING alert count.

$summary.getAlerts() A list of all of the alerts in the notification.

示例：Modify Alert EMAIL Subject

————————————————————————————————————————————-

下面示例演示如何改变所有出站 email 通知的主题行(subject line), 包括一个硬编码的标识符：

① 下载 alert-templates.xml 代码作为开始

② 在 Ambari Server 上，保存模板到一个位置，例如，/var/lib/ambariserver/ resources/alert-templates-custom.xml

③ 编辑 alert-templates-custom.xml 文件并修改 <alerttemplate type=”EMAIL”> 模板的主题行

④ 保存文件

⑤ 浏览到 /etc/ambari-server/conf 目录

⑥ 编辑 ambari.properties 文件

⑦ 为新模板文件的位置添加一条目

alerts.template.file=/var/lib/ambari-server/resources/alerttemplates-custom.xml

⑧ 保存文件并重启 Ambari Server

参考：

Ambari 操作指南 (Ambari Operations) 之一

Ambari 操作指南 (Ambari Operations) 之二

Ambari 操作指南 (Ambari Operations) 之三

Ambari 操作指南 (Ambari Operations) 之四

Ambari 操作指南 (Ambari Operations) 之五

Ambari 操作指南 (Ambari Operations) 之六