PowerHA Problem Determination
Here a few notes that I made of some of the more usefull day to day commands to help look at problems in your PowerHA (HACMP) clusters. If you are running older version of HACMP such as 5.4 and below some of the paths may differ. Most of this should be correct for 5.5, 6.1 and 7.1, though I need to do some revision and testing for 7.1.
Cluster Status
Its
normally a good idea to get a status of the cluster and try to work
out how serious the problem is. Depending on the cluster/nodes status
some of the commands will need to be run on a active node, so in the
case of a fail-over the standby node -
/usr/es/sbin/cluster/utilities/clfindres -
shows current status of resource groups.
/usr/es/sbin/cluster/clstat
- shows cluster status and substate in real time, needs clinfo to be
running.
Now these 2 commands are good to show you the current status of the cluster, but you can also get more information with -
lssrc
-ls clstrmgrES
- shows the cluster manager state
lssrc
-ls topsvcs
- shows heartbeat information
If you want even more details about the cluster then the following commands should be able to help you further -
/usr/es/sbin/cluster/utilities/cltopinfo
- show current topology status and some information about the cluster.
/usr/es/sbin/cluster/utilities/clshowres
- shows short resource group information.
/usr/es/sbin/cluster/utilities/cldisp – shows cluster information such as monitor and rery intervals.
/usr/es/sbin/cluster/utilities/cllscf
- list the network configuration of a HACMP cluster.
/usr/es/sbin/cluster/utilities/cllsif
- show network interface information.
/usr/es/sbin/cluster/utilities/clRGinfo
-p -t
- show current RG state.
/usr/es/sbin/cluster/utilities/clRGinfo
–m -
shows RG monitor status
Cluster Logs -
Once
you have a idea of the status of the cluster you can start looking at
the log files to try to determin the problem, now a good place to
start is first the cluster.log
/usr/es/adm/cluster.log
- I tend to filter this as follows so that I can get a idea of the
events that have occured, but also look out for 'config_too_long'.
cat
/usr/es/adm/cluster.log |grep ' EVENT ' |more
- Should look something like this -
Jan
13 18:51:29 <node1> HACMP for AIX: EVENT COMPLETED: node_up <node1>
0
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT START:
node_up_complete <node1>
Jan 13 18:51:30 <node1> HACMP for AIX:
EVENT START: start_server cluster_AS
Jan 13 18:51:30 <node1> HACMP
for AIX: EVENT COMPLETED: start_server cluster_AS 0
Jan 13
18:51:30 <node1> HACMP for AIX: EVENT COMPLETED: node_up_complete <node1> 0
Jan 13 18:59:16 <node1> HACMP for AIX: EVENT START:
node_down <node1>
Jan 13 18:59:16 <node1> HACMP for AIX: EVENT
START: stop_server cluster_AS
Jan 13 19:00:34 <node1> HACMP for
AIX: EVENT COMPLETED: stop_server cluster_AS 0
Jan 13 19:01:04 <node1> HACMP for AIX: EVENT START: release_service_addr
From
these log entries you can see what events have completed and failed,
generally you will look out for 'config_too_log', EVENT FAIL or
events that have not completed. As all events that have a start
should also follow in the log at some point with a completed, else
this indicates a problem too.
Once
you have picked any problem you might have you can use these as a key
to look further into this log, or the hacmp.log
as this lists the same events but in much more detail, this
should give you some more clues as to where to look from there.
Depending on what you find the problem may well be RSCT related or HA
core.
Now
if the problem relates to the HA core then you can look further into
logs such as the following -
/var/hacmp/log/clstrmgr.debug
- clstrmgrES activity, even more detail then hacmp.out, good place to
look if events are missing from other logs.
/var/hacmp/log/cpsoc.log
- the cspoc log for the node that the command was issued on. It
contains time-stamped, formatted messages generated by HACMP C-SPOC
commands.
/var/hacmp/clcomd/clcomd.log
- cluster communication daemon log.
/var/hacmp/clverify/clverify.log
- cluster verification log, good place to look to make sure the
cluster was working before the problem.
/var/hacmp/adm/history/cluster.mmddyyyy
- It contains time-stamped, formatted messages generated by HACMP
scripts.
RSCT
(Reliable Scalable Cluster Technology)
These
are stored in /var/ha/log, and the RSCT relates to to site to site
comunication and heart-beating, this means that the RSCT logs are a
good place to look if you are having problem with the fail-over
adapters or the HB'ing disks. The logs are broken up into a number of
related areas
grpsvcs.*
- Group Services log; hacmp cluster comunication log.
nim.topsvcs.<interface>*
- Network Interface Module Topology Services log; deals with specific
interface comunication and login, there will be one of these of each
cluster interface on each node, plus the disk/serial HB'ing devices.
nmDiag.nim.topsvcs.<interface>*
- Network Module Diagnostic message; this relates only to the network
devices.
topsvcs.*
- Topology Services log; summary and some further detail of all the
network topology that is occuring with in the cluster, a good place
to look to get a status of the adapters and cluster status.
topsvcs.default
- Daily Topology Services log which is run at the begining of the day
to confirm the topsvcs status.