Problems Overcome During a Major LTM Software/Hardware Upgrade

I recently completed a successful major LTM hardware and software migration which accomplished two high-level goals:

· Software upgrade from v9.3.1HF8 to v10.1.0HF1

· Hardware platform migration from 6400 to 6900

I encountered several problems during the migration event that would have stopped me in my tracks had I not (in most cases) encountered them already during my testing. This is a list of those issues and what I did to address them. While I may not have all the documentation about these problems or even fully understand all the details, the bottom line is that they worked. My hope is that someone else will benefit from it when it counts the most (and you know what I mean).

Problem #1 – Unable to Access the Configuration Utility (admin GUI)

The first issue I had to resolve was apparent immediately after the upgrade finished. When I tried to access the Configuration utility, I was denied:

Access forbidden!

You don't have permission to access the requested object.

Error 403

I happened to find the resolution in SOL7448: Restricting access to the Configuration utility by source IP address. The SOL refers to bigpipe commands, which is what I used initially:

bigpipe httpd allow all add

bigpipe save

Since then, I’ve developed the corresponding TMSH commands, which is F5’s long-term direction toward managing the system:

tmsh modify sys httpd allow replace-all-with {all}

tmsh save / sys config

Problem #2 – Incompatible Profile

I encountered the second issue after the upgraded configuration was loaded for the first time:

[root@bigip2:INOPERATIVE] config # BIGpipe unknown operation error: 01070752:3: Virtual server vs_0_0_0_0_22 (forwarding type) has an incompatible profile.

By reviewing the /config/bigip.conf file, I found that my forwarding virtual servers had a TCP profile applied:

virtual vs_0_0_0_0_22 {

destination any:22

ip forward

ip protocol tcp

translate service disable

profile custom_tcp

}

Apparently v9 did not care about this, but v10 would not load until I manually removed these TCP profile references from all of my forwarding virtual servers.

Problem #3 – BIGpipe parsing error

Then I encountered a second problem while attempting to load the configuration for the first time:

BIGpipe parsing error (/config/bigip.conf Line 6870): 012e0022:3: The requested value (x.x.x.x:3d-nfsd {) is invalid (show | <pool member list> | none) [add | delete]) for 'members' in 'pool'

While examining this error, I noticed that the port number was translated into a service name – “3d-nfsd”. Fortunately during my initial v10 research, I came across SOL11293 - The default /etc/services file in BIG-IP version 10.1.0 contains service names that may cause a configuration load failure. While I had added a step in my upgrade process to prevent the LTM from service translation, it was not scheduled until after the configuration had been successfully loaded on the new hardware. Instead I had to move this step up in the overall process flow:

bigpipe cli service number

b save

The corresponding TMSH commands are:

tmsh modify cli global-settings service number

tmsh save / sys config

Problem #4 – Command is not valid in current event context

This was the final error we encountered when trying to load the upgraded configuration for the first time:

BIGpipe rule creation error: 01070151:3: Rule [www.mycompany.com] error: line 28: [command is not valid in current event context (HTTP_RESPONSE)] [HTTP::host]

While reviewing the iRule it was obvious that we had a statement which didn’t make any sense, since there is no Host header in an HTTP response. Apparently it didn’t bother v9, but v10 didn’t like it:

when HTTP_RESPONSE {
switch -glob [string tolower [HTTP::host]] {
<do some stuff>
}
}

We simply removed that event from the iRule.

Problem #5: Failed Log Rotation

After I finished my first migration, I found myself in a situation where none of the logs in the /var/log directory were not being rotated. The /var/log/secure log file held the best clue about the underlying issue:

warning crond[7634]: Deprecated pam_stack module called from service "crond"

I had to open a case with F5, who found that the PAM crond configuration file (/config/bigip/auth/pam.d/crond) had been pulled from the old unit:

# The PAM configuration file for the cron daemon

auth sufficient pam_rootok.so

auth required pam_stack.so service=system-auth

auth required pam_env.so

account required pam_stack.so service=system-auth

session required pam_limits.so

#session optional pam_krb5.so

I had to update the file from a clean unit (which I was fortunate enough to have at my disposal):

# The PAM configuration file for the cron daemon

auth sufficient pam_rootok.so

auth required pam_env.so

auth include system-auth

account required pam_access.so

account sufficient pam_permit.so

account include system-auth

session required pam_loginuid.so

session include system-auth

and restart crond:

bigstart restart crond

or in the v10 world:

tmsh restart sys service crond

Problem #6: LTM/GTM SSL Communication Failure

This particular issue is the sole reason that my most recent migration process took 10 hours instead of four. Even if you do have a GTM, you are not likely to encounter it since it was a result of our own configuration. But I thought I’d include it since it isn’t something you’ll see documented by F5. One of the steps in my migration plan was to validate successful LTM/GTM communication with iqdump. When I got to this point in the migration process, I found that iqdump was failing in both directions because of SSL certificate verification despite having installed the new Trusted Server Certificate on the GTM, and Trusted Device Certificates on both the LTM and GTM. After several hours of troubleshooting, I decided to perform a tcpdump to see if I could gain any insight based on what was happening on the wire. I didn’t notice it at first, but when I looked at the trace again later I noticed the hostname on the certificate that the LTM was presenting was not correct. It was a very small detail that could have easily been missed, but was the key in identifying the root cause.

Having dealt with Device Certificates in the past, I knew that the Device Certificate file was /config/httpd/conf/ssl.crt/server.crt. When I looked in that directory on the filesystem, there I found a number of certificates (and subsequently, private keys in /config/httpd/conf/ssl.key) that should not have been there. I also found that these certificates and keys were pulled from the configuration on the old hardware. So I removed the extraneous certificates and keys from these directories and restarted the httpd service (“bigstart restart httpd”, or “tmsh restart sys service crond”). After I did that, the LTM presented the correct Device Certificate and LTM/GTM communication was restored. I'm still not sure to this day how those certificates got there in the first place...

Published May 27, 2010

Version 1.0