POWERing Me Up: 2011

Monday, 5 December 2011

Mounting ISO images in AIX 6 and 7

In AIX version 6 it was bought in a rather useful command, god knows where I've been to miss this one -

# loopmount -i <cdrom.iso> -o "-V cdrfs -o ro" -m /<mount-point>

So, got a ISO of a DVD that's got that one base filesystem package you can get anywhere else? Then this might just be the perfect command for you.

Thursday, 1 December 2011

RHEL 6.1 yum repository

So you have Redhat 6.1 Enterprise installed, but now you need to setup some local repositories from the DVD which the system was created. So here is a quick guide on how to do it -

First install the createrepo command, else we can create our yum repository.

# rpm -ivh --aid /mnt/Packages/deltarpm-3.5-0.5.20090913git.el6.ppc64.rpm

# rpm -ivh --aid /mnt/Packages/python-deltarpm-3.5-0.5.20090913git.el6.ppc64.rpm

# rpm -ivh --aid /mnt/Packages/createrepo-0.9.8-4.el6.noarch.rpm

Mount the installation DVD.

# mount /dev/dvd /mnt

Copy over the packages to a folder.

# mkdir /etc/yum.repos.d/rhel6.1

# cp /mnt/Packages/* /etc/yum.repos.d/rhel6.1

When you copy over the files you will also find a *.xml library file, either in the same dir as the source or in a sub that needs to be copied over for the next command.
Create the repo with the createrepo command.

# createrepo -g repomd.xml /etc/yum.repos.d/rhel6.1

Create the repo file.

vi /etc/yum.repos.d/rhel61.repo

[RHEL-Repository]
name=RHEL 6.1
baseurl=file:///etc/yum.repos.d/rhel6.1
enabled=1
gpgcheck=0

Update yum's repo list.

# yum clean all

Loaded plugins: product-id, subscription-manager
Updating Red Hat repositories.
RHEL-Repository                                              | 1.3 kB     00:00 ...
RHEL-Repository/primary                                      | 1.5 MB     00:00 ...
RHEL-Repository                                                         3210/3210
repo id                                repo name         status
RHEL-Repository                        RHEL 6.1                              3,210
repolist: 3,210

Then test.

# yum search httpd
Loaded plugins: product-id, subscription-manager
Updating Red Hat repositories.
======================== N/S Matched: httpd =======================================
httpd.ppc64 : Apache HTTP Server

httpd-devel.ppc : Development interfaces for the Apache HTTP server
httpd-devel.ppc64 : Development interfaces for the Apache HTTP server
httpd-manual.noarch : Documentation for the Apache HTTP server
httpd-tools.ppc64 : Tools for use with the Apache HTTP Server
mod_dav_svn.ppc64 : Apache httpd module for Subversion server
mod_dnssd.ppc64 : An Apache HTTPD module which adds Zeroconf support

Name and summary matches only, use "search all" for everything.

SLES 11 on POWER no SSH

So you have installed SLES 11 on you POWER system, but for some odd reason you can't get ssh to the server to work. You can ssh out and the deamon is up and running as you would expect and the IP tables is all empty, so what can the issue be.
In this case its the firewall -

# chkconfig --list SuSEfirewall2_setup
SuSEfirewall2_setup       0:off  1:off  2:off  3:on   4:on   5:on   6:off

# chkconfig --list SuSEfirewall2_init
SuSEfirewall2_init        0:off  1:off  2:off  3:on   4:on   5:on   6:off

So you can stop it as follows -

# chkconfig SuSEfirewall2_setup off

# chkconfig SuSEfirewall2_init off

Just make sure you perform the action in the order above else the command will fail as the two services are tided into each other.

Monday, 28 November 2011

OVF to OVA Conversion

Now that you have your VMControl working and have used it you capture a virtual appliance you might want to move it to other environments, such as importing it into IWD to deploy on any number of environments. The best thing to do is convert it to a ova, which just happens a tar file, just remember AIX tar files are limited to 8GB.

First you will need to log onto the Systems Director and take a note of its unique identifier, this is located as follows -
Left task menu -> System Configuration -> VMControl. Once this is open select the 'Virtual Appliances' tab followed by the 'Image Repository', then just select the relevant one if you have more and you should then see a field called 'Unique Identifier'.

Now we have the identifier from the NIM master, perform the following steps as the root user-
Change to the appliance directory of the virtual appliance that you want to export using the following command:
    # cd /export/nim/appliances/<unique identifier>

Copy the .ovf file, the mksysb file found in the directory to a new location.

In the copied .ovf files, change all file reference to use a relative path -
    <ovf:File ovf:href="file:///export/nim/appliances/<Unique Identifier>/image1.mksysb" ovf:id="vimRef1" ovf:size="<size>"/>
    to:
    <ovf:File ovf:href="image1.mksysb" ovf:id="vimRef1" ovf:size="<size>"/>

Now we can create our tar achive ensuring that the ovf is the first file in the archive -
    # tar xvf <file-name>.ovf <file-name>.ova

    # tar uvf <file-name>.mksysb <file-name>.ova

Friday, 25 November 2011

Virtual Appliance Errors

    DNZIMN867E Could not exchange SSH key with 172.19.105.25 due to the following error:
   2760-287 [dkeyexch] Internal error - exchange script returns unknown error: 1

This means your ssh keys have not been set-up, or you have simply never logged into the HMC from you NIM server to do the ssh exhange, so check with -
# ssh hscroot@cld_hmc1 lssyscfg -r sys -F name

If that fails then take a look at this Developer works page -
https://www.ibm.com/developerworks/wikis/display/WikiPtype/Manual+steps+to+enable+NIM+master+to+HMC+or+IVM+communication

   DNZIMN880E Failed to extract NIM Shared Product Object Tree (SPOT) resource from image appliance-0_image-1

There are two possible reasons for this, one is /export/nim filesystem needs more space, or its trying to locate a language file. So check in the Director agents error log here -
# /opt/ibm/director/agent/logs/error-log-0.html

For this message - 0042-001 nim: processing error encountered on "master":
Then along with it you should find something like this - /usr/lib/nls/msg/de_AT not found
/usr/lib/nls/msg/de_LU not found

Just touch these files on the NIM server and you should be able to rerun your deploy with no further occurrences of this error.

   DNZIMN356E Could not connect to the image repository

This means that there is a issue talking to the image repository on the NIM from the System Director, which while it might still show up on VMControl all your deploys continue to fail. In some cases this is a issue with what network interface the system is talking over, so if you do have multipule interfaces then you can lock it down as follows -
# vi /opt/ibm/director/data/IPPreference.properties
com.ibm.director.server.preferred.ip=<preferred IP>
com.ibm.director.endpoint.excluded.ip.prefix=<excluded IP>,<excluded IP>

This is not true in all cases, it could relate to firewall issues, or in my case is issue with the levels of the system, so if you run the following on the NIM server to check the install -
# /opt/ibm/director/agent/bin/./lwiupdatemgr.sh -listFeatures|grep nim
    com.ibm.director.im.rf.nim.subagent_2.3.1.2 Enabled

So we can see that the nim subagent is enabled, then look to see if the plugin is now active -
# /opt/ibm/director/agent/bin/./lwiplugin.sh -status | grep com.ibm.director.im
214:ACTIVE:Events Plug-in:2.3.1.2:com.ibm.director.im.rf.events
215:RESOLVED:NIM RF NLS Fragment:2.3.1.2:com.ibm.director.im.rf.nim.imaster.nl1
216:ACTIVE:NIM Master Interfaces Plug-in:2.3.1.2:com.ibm.director.im.rf.nim.imaster
217:ACTIVE:NIM Master CAS Agent Plug-in:2.3.1.2:com.ibm.director.im.rf.nim.master

Above is a correctly active agent, if you see the following -
214:ACTIVE:Events Plug-in:2.3.1.2:com.ibm.director.im.rf.events
215:RESOLVED:NIM RF NLS Fragment:2.3.1.2:com.ibm.director.im.rf.nim.imaster.nl1
216:((LAZY)):NIM Master Interfaces Plug-in:2.3.1.2:com.ibm.director.im.rf.nim.imaster
217:INSTALLED:NIM Master CAS Agent Plug-in:2.3.1.2:com.ibm.director.im.rf.nim.master

Then the agent is not installed correctly and it probably means your fix level is to high, in my case when I performed the following updated -
6100-06_AIX_ML:DirectorCommonAgent:6.2.0.1:6.2.0.0:-:AIX 6100-06 Update
6100-06_AIX_ML:DirectorPlatformAgent:6.2.0.1:6.2.0.0:-:AIX 6100-06 Update

From 6.2.0.0 issue above occurred and the only way I could recover the problem was restoring my server from mksysb.

Wednesday, 16 November 2011

POWER Cloud Build Notes

- Requirements -
- IBM Systems Director with VMControl -
   1/2 Procs Ent1 - Uncapped - 2/4 Virt - 4Gb Mem - 4-8Gb storage, extra if it is to be the NIM Master too. OS 5.3 - 6.1 - 7.1
NIM Server requires Director Common Agent installed if it is not on the same node + VMControl sub agent
POWER System 6/7
HMC
Integrated Virtualization Manager (IVM)

NOTES -
After a initial install I hit a number of problem getting the director server to install NIM agents on the NIM server and other issues such as complete System Director failure after it not being used for a week. In addition there where further problems getting the VMControl to work and Director rediscovering and inventorying known systems correctly. This revision should resolve those issues and get a system built ready to be integrated in any Cloud environment.
It should be stated that this is only the second revision, so some details have been take from memory is searches of history files from when I progressing round a problem. I have plans for further systems and more test when more equipment arrives.
I will also be looking into a possible issue that I can't get VMControl working correctly due to the HMC and System Director/NIM Server being on different VLANs, the plan being to see if I get the same failure when the HMC is moved to the same VLAN as everything else.

- VIO/System Director server install -
First you will need to get a VIO server installed - normal details can be followed to do this. Once its up and running and you can created a number of Virtual adapters to use you will need to create a partion for the systems director to be installed on using the specifications above, so we need some system pools created -
- Good VIO quick sheet - http://www.tablespace.net/quicksheet/powervm-quickstart.html

First lets create a boot volume group on the VIO Server, this is where we will keep all the boot/rootvg data, then followed by a data VG.
Create boot storage pool - # mksp cloud_boot hdisk1

Then the data pool -
# mksp cloud_data hdisk2 hdisk3

Note- the disks for the storage pool can be anything you like, either from the POWER systems internal disk or SAN attached, its your choice depending on what is available. In this case its all internal disk.

Then change the default storage pool from rootvg to cloud_boot -
# chsp -default cloud_boot

list those pools -
# lssp
Pool              Size(mb)   Free(mb) Alloc Size(mb)    BDs Type
rootvg                  69888      46848             128      0 LVPOOL
cloud_boot           69888      69888              64      0 LVPOOL
cloud_data          139776     139776              64      0 LVPOOL

Create a client disk for the System Director on adapter vhost0 from cloud_boot storage pool -
# mkbdsp -sp cloud_boot 20G -bd lv_pclm1_boot -vadapter vhost0

This server will also be a NIM server so we will need to create some data storage for it -
# mkbdsp -sp cloud_data 40G -bd lv_pclm1_data -vadapter vhost0

Note- keep the lv names no longer the 15 characters, else it will not map the disks correct to the vhost as is passes the command over, so 'lsmap -all' fails.

To list the details in the storage pool -
# lssp -bd                   - shows the default pool, in this case cloud_boot
# lssp -bd -sp cloud_data   - to list other pools.

Once these pools are created and assigned over to the server you can get it booted for the network install, at this stage you need to boot into SMS and select the network device and configure the ip details.

- System Director Install -
Obtain the latest System Director code, this example is 6.2.1.2 -
Extract the gzip/tar file into a relevant dir with enough space, about 3Gb in all. Source file is - "SysDir6_2_Server_AIX.tar"
Kick of the install with ./dirinstall.server
It should automatically expand the /opt filesystem to ensure that there is sufficient space for the install.

For proper environment setup for the database on this platform, please go to the IBM Systems Director Info Center:
http://publib.boulder.ibm.com/infocenter/director/v6r2x/topic/com.ibm.director.install.helps.doc/fqm0_t_install_config_database_application.html

You must configure the agent manager prior to starting the server.
To configure the agent manager, run # /opt/ibm/director/bin/configAgtMgr.sh
This will allow you to set the System Director user (root) and the default passwords, in our working environment they all match that of 'root'.

To start the server manually, run
# /opt/ibm/director/bin/smstart   (this path can be added to the .profile PATH variable)

Monitor the status of the start with the following command as it can take some time to start -
# smstatus -r
Ctrl-c breaks out once the status changes to active.

- Install the IBM Systems Director VMControl 2.3.1 Plugin -

Confirm that IBM System Director 6.2.1.2 is installed on your IBM Systems Director server.

Open a browser to https://<director>:8422/ibm/console and log on.
On the Welcome to IBM Systems Director panel, click on the Manage tab. Confirm IBM Systems Director 6.2.1.2 is installed

Logout of IBM Systems Director.
Start a telnet (or ssh) session to your IBM Systems Director server and sign on.
Create a directory for installing VMControl Enterprise Edition 2.3.1 and then change directories to it. # mkdir /tmp/sd
# cd /tmp/sd

The IBM System Director VMControl Enterprise Edition 2.3.1 for AIX 60 day trial can be downloaded at ->
http://www.ibm.com/systems/software/director/vmcontrol/enterprise.html
Unzip and decompress the IBM System Director VMControl Enterprise Edition 2.3.1 installation file # gunzip –c /tmp/sd/SysDir_VMControl_2_3_1_Linux_AIX.tar.gz | tar xvf –

Using VI from PuTTY or other editor, edit the /tmp/sd/installer.properties file as follows and save it.
Installer_UI = silent
License _accepted = true

Install IBM System Director VMControl Enterprise Edition 2.3.1 as follows:
# /tmp/sd/IBMSystemsDirector-VMControl-Setup.sh –i silent

When completed successfully, stop and restart IBM System Director
# /opt/ibm/director/bin/smstop
# /opt/ibm/director/bin/smstart
# /opt/ibm/director/bin/smstatus –r

Once IBM Systems Director is active, point your browser to https://<director>:8422/ibm/console and log on.
On the Welcome to IBM Systems Director panel, click on the Manage tab.
You should find that VMControl 2.3.1 is now listed.

- Part 2: Installing the VMControl 2.3.1.2 fix pack -
Start a telnet (or ssh) session to your IBM Systems Director server and sign on.
Log into the System Director and create the following dir for the download to be installed to. # mkdir /tmp/sd/vmc
# cd /tmp/sd/vmc

This fixpack can be downloaded from IBM Fix Central at -> http://www-933.ibm.com/support/fixcentral/ - Package should be com.ibm.director.vmc.collection_2.3.1.2
Confirm that the VMControl 2.3.1.2 fix pack files are located in the /tmp/sd/vmc directory.
I have tried at this point installing this on the local server using the graphical interface, but so far it fails. I have found that I need to run a up-to-date dicovery of the system - # smcli collectinv -p "All Inventory" -n <node_name>

Once this has completed you can then update the system with the VMControl package.
# smcli importupd -vr /tmp/sd/vmc
# smcli installupd -v -n pclm1_sdvmc -u com.ibm.director.vmc.collection_2.3.1.2

When completed successfully, stop and restart IBM System Director
# /opt/ibm/director/bin/smstop
# /opt/ibm/director/bin/smstart
# /opt/ibm/director/bin/smstatus –r

When IBM Systems Director is active, point your browser to https://<director>:8422/ibm/console and log on.

Confirm that VMControl is now at 2.3.1.2 via the Welcome to Systems Director/Manage tab
- NIM Master install -
Build the LPAR you are going to install the NIM master on either a standalone LPAR or in this case the Service Director LPAR -

Install the AIX OS release that you want to the NIM master to serve, remembering that the NIM server has to be at the current or higher OS/patch level then the clients it serves.

Make sure you are happy with the network setup of the NIM master, speed and such, so you can get things set-up. Remember you can always make changes later, but it is easier to ensure this is all correct first.

Install the following filesets if they are not installed already:
dsm.core
openssh.base.client
openssl.base
bos.sysmgt.nim.master
bos.sysmgt.nim.spot
bos.sysmgt.nim.client
bos.net.nfs.server

Now we can create the NIM master with the following script -
# /usr/sbin/nim_master_setup -a mk_resource=yes -a volume_group=nimvg -a device=/mnt/610TL06_lpp_res

In the example above I had mounted a known good lpp source from another nim server and used that, but you can use mksysb's or a DVD image (default).
# lsnim        - list the now created NIM resources.

From here you can then add other LPP sources and SPOTs but at the moment we want to make sure that the System Director picks up the NIM respository and installs the NIM agent. So now you have created the NIM filesystem it would be best to increase the size to about 30-40Gig.
# chfs -a size=30G /export/nim

Then perform a inventory on the Director server to pick up the changes to the server.
# smcli collectinv -p "All Inventory" -n <node>

- Installing the VMControl NIM subagent -

In the IBM Systems Director navigation pane, expand Systems Configuration and then click VMControl -
Notice message DNZIMC7081 – No image repositories were detected in IBM Systems Director. Click Install Agents.

On the Welcome to Agent Installation Wizard panel, click Next.

Click on Common Agent Subagent Packages.

Select CommonAgentSubagent_VMControl_NIM-2.3.1.2 and click Add then click Next.

Select your IBM Systems Director server and click Add. And then click Next.

Review the summary of the Install Agent Task and then click Finish.

Run the job now and display its properties. Monitor the job for a successful completion.

Collect inventory on your IBM Systems Director (and NIM master) server

When complete, in the IBM Systems Director navigation pane, expand Systems Configuration and then click VMControl.

Ensure that message DNZIMC709I is now displayed indicating that No virtual appliances were found.

Then ensure that you restart the Director service -
# smstop;smstart;smstatus -r

- NIM Server further configuration -

We can now update the bosinst_data resource so that we have hands free installation, we copy a file from the master system to the NIM volume group and define it. This time though, we also modify the resource. Read carefully!
# cp /bosinst.data /export/nim/bosinst_data/6100-05bid_ow
        This was the NIM resource - lsnim -l 6100-05bid_ow that the NIM creation script created.

For an automatic installation there are a number of lines we need to change:

The lines to change are from:

    PROMPT = yes
    RECOVER_DEVICES = Default
    ACCEPT_LICENSES =
    ACCEPT_SWMA =
    IMPORT_USER_VGS =
    CONSOLE = Default
    CREATE_JFS2_FS = Default
to:     PROMPT = no
    RECOVER_DEVICES = no
    ACCEPT_LICENSES = yes
    ACCEPT_SWMA = yes
    IMPORT_USER_VGS = no
    CONSOLE = /dev/vty0
    CREATE_JFS2_FS = yes
add
    ENABLE_64BIT_KERNEL = yes

And simplify the target_disk_data: stanza from:
target_disk_data:
        PVID = 00f67207e26dbb8d
        PHYSICAL_LOCATION = U8233.E8B.107207P-V41-C21-T1-L8100000000000000
        CONNECTION = vscsi0//810000000000
        LOCATION =
        SIZE_MB = 70006
        HDISKNAME = hdisk0
to: target_disk_data:
        HDISKNAME = hdisk0

Remember it is the file /export/nim/bosinst_data/6100-05bid_ow that needs to be edited, though you can of course change any more lines that you might want.

If you want to create and add any other additional resources then the following commands should help -

First this a package check to pick out any duplicate files in any lpp sources -
# /usr/lib/instl/lppmgr -d /export/nim/lpp_source/<lpp_source> -u -b -x -r

From a lpp source you can then create a SPOT, in this case from lpp 610TL06-lpp into a spot called 610TL06_spot.
# nim -o define -t spot -a server=master -a source=610TL06-lpp -a location=/export/nim/spot/610TL06_spot 610TL06_spot

To perform a command line discovery -
# smcli discover -i <ip-address>,<ip-address>

Then you should be able to perfrom a install via the NIM server
# nim -o bos_inst -a lpp_source=<lpp-source> -a spot=<spot> -a bosinst_data=<bos-inst> -a no_client_boot=no <server-name>

Friday, 23 September 2011

RHEL Kernel image to large

Interesting issue today in trying to install Redhat 6 on a POWER System, so that if you see this error -

Welcome to yaboot version 1.3.14 (Red Hat 1.3.14-35.el6)
Enter "help" to get some basic usage information
boot: linux
Please wait, loading kernel...
   Elf64 kernel loaded...
Loading ramdisk...
Claim failed for initrd memory at 02000000 rc=ffffffff

Now after some searching on Bugzilla I found what looked like my problem, from this I have done a little testing and resolved the issue as detailed below:

Boot the POWER server and enter the OpenFirmware prompt with selection 8 at the IBM menu on start, then run the following command at the OK prompt -

0 > printenv real-base

If you see the following the I
-------------- Partition: common -------- Signature: 0x70 ---------------
real-base                2000000             2000000

Now if I have understood this correctly the firmware is expecting a image of 32MB and in our case for RHEL6 it is 16MB (or smaller) so we need to set it correctly, as below -

0 > setenv real-base 1000000
0 > printenv real-base
-------------- Partition: common -------- Signature: 0x70 ---------------
real-base                1000000              2000000

Then reboot with a -

0 > reset-all

Then boot of the RHEL image as normal and you should find all install from this point will work OK, in this case it was a virtual server so until the LPAR is removed from the HMC.

Ref -

2000000 is 32MB
1800000 is 24MB
1000000 is 16MB
c00000 is 12MB

Monday, 5 September 2011

PowerHA Failure Rate

Changing Module failure rate -

Within PowerHA you can increase the rate in while HA detects the failure of the various hearbeat network, there are 3 predefined values of slow, normal and fast.

# smitty cm_config_networks
    - then select 'Change aNework Module using Predefined Values' and select the module you wish to change, in this case 'ether'

                               [Entry Fields]
* Network Module Name          ether
Description                  Ethernet Protocol
Failure Detection Rate       Normal                     +

NOTE: Changes made to this panel must be
        propagated to the other nodes by
        Verifying and Synchronizing the cluster

As you can see once the setting has been changed it is important to sync and verify the cluster 'smitty cm_ver_and_sync.select'.
If you wish to tune this further then you can select the 'Custome Values' option -

                                     [Entry Fields]
* Network Module Name                   ether
Description                           [Ethernet Protocol]
Address Type                          Address                       +
Path                                  [/usr/sbin/rsct/bin/hats_nim] /
Parameters                            []
Grace Period                          [60]                           #
Supports gratuitous arp               [true]                         +
Entry type                            [adapter_type]
Next generic type                     [transport]
Next generic name                     [Generic_UDP]
Supports source routing               [true]                         +
Failure Cycle                         [10]                           #
Interval between Heartbeats (seconds) [1.00]

Heartbeat rate is the rate at which cluster servic
es sends 'keep alive' messages between adapters in
the cluster. The combination of heartbeat rate and
failure cycle determines how quickly a failure can
be detected and may be calculated using this
formula:
(heartbeat rate) * (failure cycle) * 2 seconds

NOTE: Changes made to this panel must be
propagated to the other nodes by
Verifying and Synchronizing the cluster

So the default value above is 10 * 1.00 * 2 = 20.00 seconds before a failure of the network is declaired.

Then once you have this set, using the 'Show a Network Module' will give you a sumary current values.

PowerHA (HACMP) Notes

Useful PowerHA (HACMP) Notes -

Here are a few little notes that I have picked up over the years that relate to some of the functions of PowerHA and HACMP. They have proved useful to me in some situations and hopefully will to you too.

HA ODMs

The HA ODMs are the object databases that HA uses to ensure that the cluster is all in sysnc, if any of them differ in some why in some cases 'lazy update' with tidy up the records syncing the nodes. But in some cases this will cause a command/action to fail so you will need to go any sync the cluster.

DCD - The default HA ODM - HACMPtopsvcs. Look for 'instanceNum' – should be the same on all cluster nodes else it shows things have not syncd correctly!

ACD - Active – this is stored in /usr/es/sbin/cluster/etc/objrepos/active to read this ODM change you ODM path to this file, remember to change it back!

SCD - Staging

So to read the DCD ODM -

# odmget HACMPtopsvcs

HACMPtopsvcs:
        hbInterval = 1
        fibrillateCount = 4
        runFixedPri = 1
        fixedPriLevel = 38
        tsLogLength = 5000
        gsLogLength = 5000
        instanceNum = 4

The topsvcs daemon will also show the current instance number on the node.

# lssrc -ls topsvcs|grep Instance
Configuration Instance = 4

# odmget HACMPnode
    - the line ‘version’ shows the current HA version
    Version 11 = 6.1
    Version 10 = 5.5
    Version 9 = 5.4
    Version 8 = 5.3
    Version 7 = 5.2

Resource Group Dependencies

If you are having trouble with RG dependencise then the following command will list resource group start inter dependencies
"clrgdependency -t'PARENT_CHILD' -sp"

# clrgdepdency -t [PARENT_CHILD | NODECOLLOCATION | ANTICOLLOCATION |SITECOLLOCATION ] -sl
# clrgdepdency -t PARENT_CHILD -sl
Parent Child
rg1 rg2
rg1 rg3
# clrgdepdency -t NODECOLLOCATION -sl
# clrgdepdency -t ANTILOCATION -sl
HIGH:INTERMEDIATE:LOW
rg01::rg03frigg
# clrgdepdency -t SITECOLLOCATION -sl
rg01 rg03 frigg

Heat-Beating

# lssrc -ls topsvcs
- Shows the status of the system heart-beating.

Friday, 2 September 2011

PowerHA

PowerHA Problem Determination

Here a few notes that I made of some of the more usefull day to day commands to help look at problems in your PowerHA (HACMP) clusters. If you are running older version of HACMP such as 5.4 and below some of the paths may differ. Most of this should be correct for 5.5, 6.1 and 7.1, though I need to do some revision and testing for 7.1.

Cluster Status

Its normally a good idea to get a status of the cluster and try to work out how serious the problem is. Depending on the cluster/nodes status some of the commands will need to be run on a active node, so in the case of a fail-over the standby node -

/usr/es/sbin/cluster/utilities/clfindres - shows current status of resource groups.

/usr/es/sbin/cluster/clstat - shows cluster status and substate in real time, needs clinfo to be running.

Now these 2 commands are good to show you the current status of the cluster, but you can also get more information with -

lssrc -ls clstrmgrES - shows the cluster manager state

lssrc -ls topsvcs - shows heartbeat information

If you want even more details about the cluster then the following commands should be able to help you further -

/usr/es/sbin/cluster/utilities/cltopinfo - show current topology status and some information about the cluster.

/usr/es/sbin/cluster/utilities/clshowres - shows short resource group information.

/usr/es/sbin/cluster/utilities/cldisp – shows cluster information such as monitor and rery intervals.

/usr/es/sbin/cluster/utilities/cllscf - list the network configuration of a HACMP cluster.

/usr/es/sbin/cluster/utilities/cllsif - show network interface information.

/usr/es/sbin/cluster/utilities/clRGinfo -p -t - show current RG state.

/usr/es/sbin/cluster/utilities/clRGinfo –m - shows RG monitor status

Cluster Logs -

Once you have a idea of the status of the cluster you can start looking at the log files to try to determin the problem, now a good place to start is first the cluster.log

/usr/es/adm/cluster.log - I tend to filter this as follows so that I can get a idea of the events that have occured, but also look out for 'config_too_long'.

cat /usr/es/adm/cluster.log |grep ' EVENT ' |more - Should look something like this -

Jan 13 18:51:29 <node1> HACMP for AIX: EVENT COMPLETED: node_up <node1> 0
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT START: node_up_complete <node1>
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT START: start_server cluster_AS
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT COMPLETED: start_server cluster_AS 0
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT COMPLETED: node_up_complete <node1> 0
Jan 13 18:59:16 <node1> HACMP for AIX: EVENT START: node_down <node1>
Jan 13 18:59:16 <node1> HACMP for AIX: EVENT START: stop_server cluster_AS
Jan 13 19:00:34 <node1> HACMP for AIX: EVENT COMPLETED: stop_server cluster_AS 0
Jan 13 19:01:04 <node1> HACMP for AIX: EVENT START: release_service_addr

From these log entries you can see what events have completed and failed, generally you will look out for 'config_too_log', EVENT FAIL or events that have not completed. As all events that have a start should also follow in the log at some point with a completed, else this indicates a problem too.

Once you have picked any problem you might have you can use these as a key to look further into this log, or the hacmp.log as this lists the same events but in much more detail, this should give you some more clues as to where to look from there. Depending on what you find the problem may well be RSCT related or HA core.

Now if the problem relates to the HA core then you can look further into logs such as the following -

/var/hacmp/log/clstrmgr.debug - clstrmgrES activity, even more detail then hacmp.out, good place to look if events are missing from other logs.

/var/hacmp/log/cpsoc.log - the cspoc log for the node that the command was issued on. It contains time-stamped, formatted messages generated by HACMP C-SPOC commands.

/var/hacmp/clcomd/clcomd.log - cluster communication daemon log.

/var/hacmp/clverify/clverify.log - cluster verification log, good place to look to make sure the cluster was working before the problem.

/var/hacmp/adm/history/cluster.mmddyyyy - It contains time-stamped, formatted messages generated by HACMP scripts.

RSCT (Reliable Scalable Cluster Technology)

These are stored in /var/ha/log, and the RSCT relates to to site to site comunication and heart-beating, this means that the RSCT logs are a good place to look if you are having problem with the fail-over adapters or the HB'ing disks. The logs are broken up into a number of related areas

grpsvcs.* - Group Services log; hacmp cluster comunication log.

nim.topsvcs.<interface>* - Network Interface Module Topology Services log; deals with specific interface comunication and login, there will be one of these of each cluster interface on each node, plus the disk/serial HB'ing devices.

nmDiag.nim.topsvcs.<interface>* - Network Module Diagnostic message; this relates only to the network devices.

topsvcs.* - Topology Services log; summary and some further detail of all the network topology that is occuring with in the cluster, a good place to look to get a status of the adapters and cluster status.

topsvcs.default - Daily Topology Services log which is run at the begining of the day to confirm the topsvcs status.

First Post

Well this is the first post of this new blog, so lets cover what I'm hoping to do here. I've been working with IBM's version of UNIX called AIX since 1998 though not just at IBM, I have had some other jobs at other companies and even worked with other flavours. Such as HP-UX, Solaris and the now dead Dynix/Ptx, along with these I have also used various flavours of Linux, though mainly Ubuntu, Redhat and SLES.

Over the years I have collected a lot of knowledge and information, some of it like Dynix now dead and gone, but alot of it still very useful and more importantly as things change I to am learning new things all the time.

Hopefully I can pass some of that information on and create a useful resource not just for myself but for others too.