ARP/MAC Tables Not Updating on Core Switches After F5 LTM Failover (GARP Issue?)

We have two F5 LTM 5250v appliances configured with 2 vCMP instances each in an HA pair (Active/Standby). Each F5 5250v has a 10G uplink to two core switches (Cisco Nexus 7010) configured as an LACP port-channel on the F5 side and a Port-Channel/vPC on the Nexus side.

Port-Channel127/vPC127 = F5ADC01 Port-Channel128/vPC128 = F5ADC01

When I look at the MAC address tables on both 7K1 and 7K2, I can see all the individual F5 MACs for each VLAN we have configured on the F5 vCMP instances.

We are having an issue during automatic or manual failover where the MAC addresses for the virtual-servers are not being updated. If F5ADC01 is Active and we force it Standby, it immediately changes to Standby and F5ADC02 immediately takes over the Active role. However, the ARP tables on the Nexus 7K Core switches do not get updated so all the virtual-servers continue to have the MAC address associated with F5ADC01.

We have multiple partitions on each vCMP instance with several VLANs associated with each partition. Each partition only has a single route-domain the VLANs are allocated to. For traffic to virtual-servers, we are using Auto-MAP to SNAT to the floating Self-IP and using Auto-Last Hop so return traffic passes through the correct source VLAN. We are not using MAC masquerading.

The ARP time out on the Nexus 7Ks is 1500 seconds (default) so it takes 25min after a failover for a full network recovery. Eventually the ARP entries age out for all virtual servers and get refreshed with the correct MAC address. Obviously this is not acceptable.

I found an SOL article that talks about when GARPs can be missed after failover: SOL7332: Gratuitous ARPs may be lost after a BIG-IP failover event. We have confirmed the upstream core switches are not dropping any GARPs. As a test I went in and manually disabled all virtual-servers and then enabled them and all MACs updated immediately.

I have opened a support case with F5 and we have yet to determine where the issue lies. Does anybody have any ideas what the issue might be? If I need to provide more information about our configuration, let me know.

We are pretty new to the F5 platform. We recently migrated from the Cisco ACE30 platform. Failover on the ACE platform worked perfectly. Similar cabling setup (two port-channels to two separate Catalyst 6509 switches with an ACE30 module in each switch). After ACE failover, the MAC tables/ARP caches immediately updated.

Thank You!

22 Replies

tatmotiv
Cirrostratus
Sep 23, 2016
...that means that shared MAC would only be learned on one Port-Channel interface.

Right. Or to put it a 100% correct, it should only be learned on one vpc at the same time.

When failover occurs, you're saying the new Active will start sending packets including the shared MAC and the switch CAM tables should update to reflect the MAC now being on a new Port-Channel interfaces (thus no need for GARPs)?

Right. I think it actually will send GARPs nevertheless, but those are not needed to update the ARP tables on all neighboring devices, those can be left unchanged. Thus, the network will be less error-prune to the N7K not learning the ARP update (regardless of the reason why that happens...).
Kai_Wilke
MVP
Sep 23, 2016
Hi Ron,

I've to second Tatmotiv's opinion to enable MAC masquerading on your Traffic-Groups.

We're sucessfully running this setup on VPC enabled Nexus devices. Our switch CAM tables will detect the failover event immediately on the very first received GARP packet and perform a single MAC flap.

Cheers, Kai
tatmotiv
Cirrostratus
Sep 23, 2016
Same here. I was also sceptical regarding MAC masquerading, but upon experiencing the above mentioned ARP problems in the nexus infrastructure (which were somehow related to fabric-path if I remember correctly), I converted all (60+) traffic-groups on my devices to MAC masquerading and didn't have any bad experiences. We are also connecting to Nexus switches using vpc (albeit N5Ks as spines, being cross-connected via N7Ks as hubs) and we are also using vcmp (on Viprion though, not 5250).
Ron_Peters_2122
Altostratus
Sep 23, 2016
Thank you all for the responses. I do have a TAC case open and we are going to do some investigation on why the Nexus 7Ks are dropping the GARP traffic. We are going to be coordinating another maintenance window to perform an ELAM capture to determine what is transpiring on the 7K side but regardless of what we find, it does indeed look like MAC Masquerading may be the way to go regardless as it would also have the added benefit of improving failover speed.

I will respond again once I know more from Cisco TAC as to what the cause of the dropped GARP traffic is for those that may be curious.
Ron_Peters_2122
Altostratus
Oct 06, 2016
We have finally resolved this issue and as promised I said I would comment on what the issue was. We confirmed 100% with a tcpdump on the F5s that they were sending Gratuitous ARPs out its 10G interfaces for all virtual-addresses after a failover event.

We opened a TAC case with Cisco and found that there is a hardware rate-limiter in place on the particular F1 card (very old card) that these F5's were terminating into. The rate-limit for class rl-4, which ARP was assigned to was set to 100 packets-per-second. This is way too low to support the amount of ARP traffic the F5 generates and we had millions of ARP drops on this particular card.

We analyzed the pcap file and found the rate at which the F5 transmitted these GARPs and adjusted the rate-limit on the rl-4 class to 3000 packets per second. We performed failover tests and the MAC addresses on both 7Ks updated immediately for all virtual-addresses.

Thanks for all the input you guys provided.
- tatmotiv
  Cirrostratus
  Oct 07, 2016
  Interesting! Thanks for the update!
- Destiny3986_116
  Nimbostratus
  Oct 07, 2016
  I think you could try to configure "MAC Masquerade Address" for traffic group.
- eddiepar_317026
  Nimbostratus
  Apr 10, 2017
  how was the adjustment of the rl-4 done on the switches? Was it done on the interfaces in pairs? On each switch?
  
  Thank you
BRUCE_A_NOLAN_1
Nimbostratus
Apr 30, 2018
We ran into a similar issue. Several Virtual Servers on different VLANs that did not have SelfIPs or Floating IPs configured. In our situation, allocating SIPs/FIPs for those VLANs was not an option, because we were migrating specific virtual servers from one Data Center to another (not the networks) and did not want to readdress the virtual servers. Migrating DNS virtual servers.

Our solution was to allocate SIPs/FIPs from the 198.51.100.0/24, which is a reserved subnet Assigned as "TEST-NET-2" for use in documentation and examples. It should not be used publicly.[7]

On failover, a gARP is issued for these addresses. Since there is no virtual servers, SNATs or anything else route-able or listening, the response is not required. But the gARP for those addresses triggers the failover from the switch perspective to the STANDBY port-channel.

Failover for the virtual servers is instant.

MAC Masquerade is enabled for the traffic group.

Forum Discussion

ARP/MAC Tables Not Updating on Core Switches After F5 LTM Failover (GARP Issue?)

22 Replies

Recent Discussions

Hybrid Exchange traffic and Starttls

GCP F5 deployment - Active -Active with config Sync

IP Intelligence

BUG Query

Health Monitor via MGMT (DCDs)

Related Content

Auditing Security Policy Updates

Configuring AWS HA Failover Across AZs Without EIPs Using F5 Cloud Failover Extension (CFE)

Why it does not switching secondary ip address on azure when failover???

Update your DevCentral user profile

Cisco ACI Endpoint Learning with a BIG-IP HA Failover