Sync-failover group doesn't sync properly

Question

Hello,I need some help with essential Active/Standby setup where I can't make two nodes to sync data. This is the problem I end up with: "did not receive last sync successfully"VLANs are configured like this:vlantagtagged interfaceClient111.1HA131.3Server121.2Self IPs and routes are following&nbsp;&nbsp;[root@bigip1:Active:Standalone] config # ip route
default via 192.168.159.2 dev mgmt  metric 4096 
10.11.11.0/24 dev Client  proto kernel  scope link  src 10.11.11.111 
10.12.12.0/24 dev Server  proto kernel  scope link  src 10.12.12.121 
10.13.13.0/24 dev HA  proto kernel  scope link  src 10.13.13.131 
127.1.1.0/24 dev tmm  proto kernel  scope link  src 127.1.1.254 
127.7.0.0/16 via 127.1.1.253 dev tmm 
127.20.0.0/16 dev tmm_bp  proto kernel  scope link  src 127.20.0.254 
192.168.159.0/24 dev mgmt  proto kernel  scope link  src 192.168.159.129

[root@bigip2:Active:Standalone] config # ip route
default via 192.168.159.2 dev mgmt  metric 4096 
10.11.11.0/24 dev Client  proto kernel  scope link  src 10.11.11.112 
10.12.12.0/24 dev Server  proto kernel  scope link  src 10.12.12.122 
10.13.13.0/24 dev HA  proto kernel  scope link  src 10.13.13.132 
127.1.1.0/24 dev tmm  proto kernel  scope link  src 127.1.1.254 
127.7.0.0/16 via 127.1.1.253 dev tmm 
127.20.0.0/16 dev tmm_bp  proto kernel  scope link  src 127.20.0.254 
192.168.159.0/24 dev mgmt  proto kernel  scope link  src 192.168.159.130 &nbsp;&nbsp;Floating IPs on both devices are set to:- Client: 10.11.11.110- Server: 10.12.12.120Both devices have certificates, time is in sync via NTP, have the same version 17.1.0.2 Build 0.0.2 (provisioned from the same OVA) and license.Conif sync is set to: HA self IPsFailover networks is: HA + ManagementMirroring: HA + ServerBigIP1 is Online, BigIP2 is Forced Offline before I start building cluster.Hosts are connected via VmWare Workstation Lan Segments, thus no filtering is applied. I double check I can see packets in "tcpdump -nn -i" for any of the interfaces Client/Server/HA when for example trying to establish the SSH connection from the other host to the respective IP of the interface that is being watched.Then I add device trust. Soon both devices are shown as "In sync" in the device_trust_group.Then create a sync-failover group of two devices with Automatic Incremental Sync with Max sync size =10240. After this, the sync statuses are following:- device_trust_group = In Sync- Sync-Failover-Group = Awaiting Initial SyncIf I run "tcpdump -nn -i any tcp" I mostly see packets on HA network for ports 1029 and 4343If I run "tcpdump -nn -i any udp" I mostly see packets on HA network for port 1026tmm log&nbsp;&nbsp;Sep 1 22:39:29 bigip1.sq.cloud notice mcpd[7261]: 01071436:5: CMI listener established at 10.13.13.131 port 6699
Sep 1 22:39:29 bigip1.sq.cloud err mcpd[7261]: 0107142f:3: Can't connect to CMI peer 10.13.13.132, TMM outbound listener not yet created
Sep 1 22:39:29 bigip1.sq.cloud err mcpd[7261]: 0107142f:3: Can't connect to CMI peer 10.13.13.132, TMM outbound listener not yet created
Sep 1 22:39:32 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 22:39:34 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 1 22:44:48 bigip1.sq.cloud notice mcpd[7261]: 01071038:5: Master Key updated by user %cmi-mcpd-peer-10.13.13.132
Sep 1 22:52:33 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 22:57:33 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01070430:5: end_transaction message timeout on connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132)
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01070418:5: connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01070430:5: end_transaction message timeout on connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132)
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01070418:5: connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries&nbsp;&nbsp;Lastly I push the configuration from the device that is in the online state to the Sync-Failover-Group.Then the sync status is like shown on the screenshot at the beginning of this message. Suggested sync actions (push A or B to group) do not help. Looked through:&nbsp;K63243467,&nbsp;K13946Appreciate any suggestions that can resolve or properly push/pull the config. Thank you!

ovov · Accepted Answer

Thank you for the hints!&nbsp;I've followed some actions described in ID882609&nbsp;, though it wasn't exactly the situation I had. Specifically one of the devices failed to correctly restart tmm: bigstart restart tmm. That started spawning the following message each two seconds: Re-starting mcpdI restarted that second device and did tail -f /var/log/tmm on both hosts.First device&nbsp;Sep 2 13:55:11 bigip2.xx.yyyy notice mcpd[6967]: 01b00004:5: There is an unfinished full sync already being sent for device group /Common/Sync-Failover-Group on connection 0xea1726c8, delaying new sync until current one finishes.&nbsp;Second device with sync issues contained&nbsp;end_transaction message timeout&nbsp;Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132)
Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132)
Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries&nbsp;That error message lead me to&nbsp;K25064172&nbsp;and K10142141 despite I'm not running in AWS, my VmWare Workstation used vmxnet3 driver and I tried to switch to sock as suggested in that KB.[root@bigip1:Standby:Not All Devices Synced] config # lspci -nn | grep -i eth
03:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
0b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
13:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
1b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)

[root@bigip1:Standby:Not All Devices Synced] config # tmctl -d blade tmm/device_probed
pci_bdf pseudo_name type available_drivers driver_in_use
------------ ----------- --------- --------------------- -------------
0000:03:00.0 F5DEV_PCI xnet, vmxnet3, sock,
0000:13:00.0 1.2 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
0000:0b:00.0 1.1 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
0000:1b:00.0 1.3 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3The fix for VmWare isecho "device driver vendor_dev 15ad:07b0 sock" &gt;&gt; /config/tmm_init.tclAnd after I have restarted both nodes I saw the desired "In Sync" status.What is interesting enough that I got this issue on two separate computers running the same VmWare Workstation version. I also reinstalled three different versions of BigIP and always got the same result.&nbsp;Another crazy thing is that if instead of Sync-Failover I would create Sync-Only group, there were no issues at all.&nbsp;It should be some compatibility issue I think.

ragunath154 · Answer

Check the connectivity between the BIGIP's via HA interface IP's

10.13.13.131 and 10.13.13.132

Also check the Port lockdown settings for the HA Selfip.

make sure the HA interface is Tagged or untagged .

Do telnet on 4353 between the BIGIP on HA selfip

ovov · Answer

Thank you for the suggestion.I haven't found the issue however:Port lockdown settings for HA Self IP's is set to "Allow All" for both devicesBoth HA interfaces are tagged with the same vlan 134353 connection is working fine, I can see packets travelling both ways on both hosts. Checked with: tcpdump -nn -i HA tcp port 4353&nbsp;First host&nbsp;09:39:39.272348 IP 10.13.13.132.4353 &gt; 10.13.13.131.57460: Flags [P.], seq 71446:72894, ack 0, win 9018, options [nop,nop,TS val 1419664648 ecr 1419664639], length 1448 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
09:39:39.272436 IP 10.13.13.131.57460 &gt; 10.13.13.132.4353: Flags [.], ack 72894, win 65535, options [nop,nop,TS val 1419664647 ecr 1419664648], length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
09:39:39.283026 IP 10.13.13.132.4353 &gt; 10.13.13.131.57460: Flags [.], seq 72894:74342, ack 0, win 9018, options [nop,nop,TS val 1419664651 ecr 1419664647], length 1448 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
09:39:39.283110 IP 10.13.13.132.4353 &gt; 10.13.13.131.57460: Flags [P.], seq 74342:74400, ack 0, win 9018, options [nop,nop,TS val 1419664651 ecr 1419664647], length 58 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
09:39:39.793529 IP 10.13.13.132.25677 &gt; 10.13.13.131.4353: Flags [P.], seq 1:203, ack 1, win 12316, length 202 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk=
09:39:39.793643 IP 10.13.13.131.4353 &gt; 10.13.13.132.25677: Flags [.], ack 203, win 16189, length 0 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk=
09:39:39.811879 IP 10.13.13.131.4353 &gt; 10.13.13.132.25677: Flags [P.], seq 1:76, ack 203, win 16189, length 75 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk=
09:39:39.813850 IP 10.13.13.132.25677 &gt; 10.13.13.131.4353: Flags [.], ack 76, win 12391, length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk=
09:39:39.824753 IP 10.13.13.131.57460 &gt; 10.13.13.132.4353: Flags [P.], seq 0:202, ack 72894, win 65535, options [nop,nop,TS val 1419665200 ecr 1419664648], length 202 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=​&nbsp;Second host&nbsp;09:41:24.654511 IP 10.13.13.132.4353 &gt; 10.13.13.131.51678: Flags [P.], seq 39154:40551, ack 1, win 6565, options [nop,nop,TS val 1419770029 ecr 1419770026], length 1397 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
09:41:24.658487 IP 10.13.13.131.51678 &gt; 10.13.13.132.4353: Flags [.], ack 40551, win 65535, options [nop,nop,TS val 1419770030 ecr 1419770029], length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
09:41:24.658558 IP 10.13.13.132.4353 &gt; 10.13.13.131.51678: Flags [P.], seq 40551:42079, ack 1, win 6565, options [nop,nop,TS val 1419770033 ecr 1419770030], length 1528 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
09:41:25.189243 IP 10.13.13.132.25677 &gt; 10.13.13.131.4353: Flags [.], ack 3575478456, win 13042, length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk=
09:41:25.190545 IP 10.13.13.131.4353 &gt; 10.13.13.132.25677: Flags [.], ack 1, win 18138, length 0 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk=
09:41:25.190633 IP 10.13.13.132.25677 &gt; 10.13.13.131.4353: Flags [.], ack 1, win 13042, length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk=
09:41:25.191423 IP 10.13.13.131.4353 &gt; 10.13.13.132.25677: Flags [.], ack 1, win 18138, length 0 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk=
09:41:25.658648 IP 10.13.13.132.4353 &gt; 10.13.13.131.51678: Flags [.], seq 40551:41999, ack 1, win 6565, options [nop,nop,TS val 1419771033 ecr 1419770030], length 1448 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
09:41:25.764044 IP 10.13.13.131.51678 &gt; 10.13.13.132.4353: Flags [.], ack 41999, win 65535, options [nop,nop,TS val 1419771136 ecr 1419771033], length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
09:41:25.764175 IP 10.13.13.132.4353 &gt; 10.13.13.131.51678: Flags [P.], seq 41999:42079, ack 1, win 6565, options [nop,nop,TS val 1419771139 ecr 1419771136], length 80 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
09:41:25.764206 IP 10.13.13.132.4353 &gt; 10.13.13.131.51678: Flags [P.], seq 42079:43527, ack 1, win 6565, options [nop,nop,TS val 1419771139 ecr 1419771136], length 1448 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=&nbsp;

ragunath154 · Answer

looks like connectivity issuehave you checked below linkshttps://cdn.f5.com/product/bugtracker/ID882609.htmlhttps://my.f5.com/manage/s/article/K47046731https://cdn.f5.com/product/bugtracker/ID1014361.html

Forum Discussion

Sync-failover group doesn't sync properly

4 Replies

Recent Discussions

BIG-IP DNS: Check Status Of Multiple Monitors Against Pool Member

Open Redirection Mitigation

F5Access | MacOS Sonoma

enable tls1.2 on management interface on F5 ltm running version 10.x

[ASM] - what is "Browser Challenge file" ?

Related Content

DSC sync-failover between vCMP guests in different hardware

iRule [string range...] not chunking data properly

How to ensure source address and source port are accepted and traversed properly via F5 SNAT automap

LTM and ASM configuration synchronization without Sync-Failover enabled

F5 OWASP Top Ten Rules, no working NoSQL Injection properly