Forum Discussion

Jason_40733's avatar
Jason_40733
Icon for Cirrocumulus rankCirrocumulus
Oct 28, 2013

Viprion w/ VCMP Guest loss of network connectivity - Solved

Posting this as an FYI. Solution included, such that it is. This was a royal pain to find even with F5 support. It's probably a rare occurrence if it even happens again. But should someone run into it, I'd like to save you many hours/days of headaches.

 

Design: VCMP Host00 running 11.2.0 4 x VCMP guests running 11.2.0 on Host00 Guest02,04,06,08

 

VCMP Host01 running 11.2.0 4 x VCMP guests running 11.2.0 on Host 01 Guest03,05,07,09

 

Host00 and Host01 are standalone. Guests are paired 2-3,4-5,6-7,8-9 in active/standby with even 's active.

 

Initial Incident: VCMP Host00 has a power supply blip that causes the system to reboot hard. All four active nodes on Host00 failover ( shouldn't this be 'succeedover'? ) to their odd numbered partner. Upon reboot, Host00 restarts all four guests. Guest04,06,08 all assume 'standby' roll when connecting with their partner. Guest02 assumes active node as it fails to talk to its partner over the config-sync vlan. Split brain ensues and all VIPs are useless.

 

Bandaid: We immediately hard-offline Guest02 so that Guest03 has sole access to all VIPs/Pools.

 

Research: We worked with F5 support to run down the issues. Guest02 continued to be unable to communicate via the interface used for config-syncs and cluster communication that all 7 other guests and both hosts could communicate on. Verified, re-verified, and then verified again all network configs, VCMP host configs and VCMP guest configs. No changes were made during reboot, verified against subversion backup of the configs ( all files… ).

 

Initial Findings: On the VLAN interface and IP used for cluster communication, we found that Guest02 was able to see traffic ( tcpdump ) from other guests/hosts. However, it would not update its arp table. This behavior survived a guest reboot, and a host/guest reboot.

 

Initial Fix: We tried removing and re-creating the interface. By removing the VLAN from the guest and at the host level. Then re-creating and even using a different IP on the VLAN. Same results.

 

Second Fix: Based off a rare issue with VCMP F5 recommended we upgrade from 11.2.0 to 11.2.1 HF9. This was implemented with identical network results. As soon Guest02 came up, the network issue was immediately present and the downtime used to try to online it immediately resulted in split-brain again. ( This whole time, communication between the other 7 guests and two hosts was fine on this same VLAN/subnet. ) The upgrade was done on both the Host and the Guest.

 

Third Fix: We moved another Guest into the cluster with Guest03 to determine if there was perhaps something odd going on with that guest. Everything went well with the alternate Guest in the cluster. Problem was definitively identified as belonging to Guest02.

 

Final Fix: We nuked the Guest02 from the VCMP Host… including destroying the disk image. Then recreated the guest ( same name, mgmt IP ), Re-created the partitions, route-domains, VLANS, Self-IPs ( Not Floating IPs ). Brought the "new" Guest02 up and was able to immediately communicate with Guest03 with no issues. We finally added it to the cluster with Guest03 and did a sync-push from Guest03 to Guest02 to restore complete functionality. Problem solved.

 

Theory: ( feel free to correct/add to if you like ) There was some sort of undetected corruption of the Guest02 disk image that was potentially caused by the initial sudden loss of power. It was small enough for everything to look swell and log ZERO errors anywhere ( Yes.. the whole time no errors were shown other than "can't talk to my partner" ). Multiple QkView examinations, 3 different sets of eyes from our company plus 2 different F5 engineers and F5 developers could find nothing wrong.

 

Parting Thought: The upgrade from 11.2.0 to 11.2.1 HF9 was quite simple and I highly recommend it for the GUI enhancements to the config-sync Overview. And at this point, I think anyone in my group can add bigip devices ( virtual/physical ) into and out of sync-failover clusters within minutes by editing files.

 

Hopefully nobody will ever need to use this. But in case you do, perhaps this will save you from tracking down DenverCoder9.

 

1 Reply

  • Jason, We recently had an issue that sounds similar and came across your post. We didn't hard reboot but had connectivity issues that would come and go. Was constantly flapping all the monitors on the box. Running vCMP in 11.4.1 HF2. Support kept trying to go down the arp/l2 issues rabbit hole. Was able to figure out the packets didn't even show up in tcpdump. So TMM wasn't even trying to send it out. They were at a loss. Eventually did this - https://support.f5.com/kb/en-us/solutions/public/13000/000/sol13030.html This guest was configured on 2 slots. Soon as we switched over to the other slot it was working. Did the above solution on the slot that was having issues. When it came back online we switched back to that slot and problem was gone. The MCPD creates a binary image of the config to use and each slot creates its own. Theory was this got messed up. Reboot’s won’t fix it. May or may not be related. Just thought I would share in case it helps anyone out. Thanks, Chad