Forum Discussion

alex100_194614's avatar
alex100_194614
Icon for Nimbostratus rankNimbostratus
May 27, 2015

Saving UCS on VCMP guest hangs up

OK, I have a vcmp enabled Viprion 2400 which runs 4 guests(3 LTM, 1 LTM + APM). When I run tmsh save sys usc the config_save hangs up. Now a bit more details. We have 30 Big-IP "boxes" across our production environment most running 10.2.4 and some running 11.6, few of which are VE and few VCPMs. So, we have a mix of different hardware platforms including two new Viprion 2400 chassis with two blades in each, configured as redundant pair. I have a daily UCS backup job running from a remote Linux box. It's as bash script kicked off by Cron_Daily every evening which connects to Big-IP appliances via SSH, executes "tmsh save sys ucs" command, does SCP and then cleans up locally and them does the same thing creating and coping SCF file. I have had this script running with great success for almost a year until I have added Viprion to environment. It appears that at some point saving UCS file hangs up, so when I try to save ucs day after something goes wrong, it fails to create an archive. If I do it from the GUI I get no error what so ever. For a while I can seen the standard icon with down pointer and default message: "Receiving configuration data from your device". After few minutes it just disappears with no errors ( no ucs file created in /var/local/ucs). When executing tmsh save sys ucs from comand line: I get following message:

 

Waiting for process config_save (pid:23131) to complete.

 

Waiting for process config_save (pid:23131) to complete.

 

Waiting for process config_save (pid:23131) to complete.

 

Waiting for process config_save (pid:23131) to complete.

 

It looks like previous config_save job has hung up and has never finished and now unable to start a new one. So, something gets hang up during process responsible for saving ucs file which prevents concurrent jobs from running. There is nothing in LTM nor Audit logs, only reference that it's now executing config_save...

 

Redeploying (essentially rebooting) vcmp guest corrects the issue for some time, but then it reappears again randomly. You can never tell when the problem reappears again. Sometimes it happens few days later and sometimes I can go for a month without an issue, but sooner or later it will happen. This only happens on Viprion and never on physical boxes nor VE. Also, saving scf file never fails or hangs up on this VCMP guests only effecting UCS.

 

I have a case open with F5 support and they haven't been able to conclude on root cause. The proposed solution was to provision APM module and then remove it. According to them there was a bug in 11.4 (if I am not mistaking) which could cause saving UCS to hang up if APM module was previously provisioned on the system. I am on the latest 11.6 HF at the moment. Also, APM was never provisioned on 3 out of 4 Vcmp guests. I doubt that it's going to work..

 

So I really wonder what the heck is going on and why config_save hangs. Has anyone come across similar issue with Viprion and VCMP? Please, share your thoughts and ideas...!

 

8 Replies

  • According to them there was a bug in 11.4 (if I am not mistaking) which could cause saving UCS to hang up if APM module was previously provisioned on the system. I am on the latest 11.6 HF at the moment. Also, APM was never provisioned on 3 out of 4 Vcmp guests. I doubt that it's going to work..

     

    i understand you mean ID453545/sol16089. if it does not fix (i.e. the problem comes back), may you ask support to also check ID521272?

     

    ID521272 AuthTokenWorker causes OutOfMemory if AuthTokens requested at high rate

     

  • 11.6.0 has a detailed statistics reporting engine for troubleshooting guest details. Maybe this will help?

     

    "Description

     

    As of BIG-IP 11.6.0, the vCMP hypervisor can view detailed guest performance statistics such as Disk usage, CPU usage and Network Throughput using Analytics, also called Application Visibility and Reporting (AVR). You can use AVR to view current and trending data regarding vCMP guest resource and network utilization. You can generate PDF reports and either downloaded or email them from the BIG-IP system."

     

    https://support.f5.com/kb/en-us/solutions/public/15000/600/sol15684.html

     

    My two cents: if all four guests are running on the same blade, and the blade has a disk (not SSD), you might have better results by running a staggered ucs save. That is, do guest 1 and only once complete beging the ucs save on guest 2. This would avoid what appears to be a problem with initiating a save on guest 1 and 2 at the same time. (probably not what you want to hear, but this should be functional until the concurrent save issue is sorted out.)

     

    • alex100_194614's avatar
      alex100_194614
      Icon for Nimbostratus rankNimbostratus
      Thanks for the info on Analytics. I am running UCS save job in staggered way. My script runs one job at the time traversing down the inventory list. For now I was able to find a workaround.
  • So, I was able to find a workaround for now by manually killing config_save process. Previously I have tried to kill it by executing kill , which did not seem work. However, kill -9 seems to play the trick. Once, process is killed, things seem to go back to normal. I have not encountered any issues for a week, but I have a feeling that it's just a matter of time till problem re-accrues. All concurrent, UCS save jobs are running ok so far. I will keep an eye on it for now.

     

  • So, I am back to the point where I have started initially. The problem is back and is reoccurring on multiple vCMP guests. SOL16089 did not solve the problem, so I am reopening the case with F5 support. In meantime I have made few observations. It almost looks like problem seems to resurface when configuration changes are made to the system. I have been working on SNMP agents configuration on all guests and suddenly the issue re-emerged on two vCMP guest out of four, however, I could be wrong and I could be just a coincidence. Analyzing the situation I come to conclusion that I has to be related to the fact that I have system with multiple blades. It seems to me that it has something to do with blade communication, but I could be wrong...

     

    I will post some updates as I progress towards the resolution.

     

  • Update:

     

    HF5 seemed to address this issue for the big part however we still had one singe instance of backup process going into zombie state. Now we are on 11.6 HF6 and the issue with hanging UCS files seem to be completely resolved. I have also noticed that UCS creation process is taking significantly less time after upgrading to latest HF.