Forum Discussion

portoalegre's avatar
portoalegre
Icon for Nimbostratus rankNimbostratus
Jun 04, 2013

Clock advanced case with F5

Since the end of last year either of our units 6900 LTM appear to freeze for whatever reason and produce the follwowing error for example:

 

Mon Feb 11 06:35:49 GMT 2013 notice tmm2 tmm2[10266] 01010029 Clock advanced by 3457 ticks

 

Mon Feb 11 06:35:49 GMT 2013 notice tmm3 tmm3[10267] 01010029 Clock advanced by 3458 ticks

 

Mon Feb 11 06:35:49 GMT 2013 notice tmm1 tmm1[10265] 01010029 Clock advanced by 3456 ticks

 

Mon Feb 11 06:35:49 GMT 2013 notice tmm tmm[10264] 01010029 Clock advanced by 3459 ticks

 

Mon Feb 11 06:35:51 GMT 2013 notice f5unit sod[6369] 010c0025 Toggle from active to standby to active.

 

Mon Feb 11 06:35:51 GMT 2013 notice f5unit sod[6369] 010c0025 Toggle from active to standby to active.

 

We run a pair of LTM 6900 across Data Centres (10GB dark fibre) in HA (Hot/Standby mode), the above problems have caused us no end of problems because a lot of our applications run trading/price platforms and we don't yet have VIP mirroring because of the high small packet traffic volume we are not sure the F5 could cope and the current risk of breaking connections again (NB: the cpu runs about 20%, memory ok). However, as you see above the unit goes into panic mode, Active then Standby then Active again breaking these critical TCP sessions. The case has now been with F5 since December 2012. Still Open!

 

We applied the following, the errors have been happening less often since the command was a applied about 3 months ago.

 

tmsh modify sys db failover.nettimeoutsec value 6

 

The latest in their labs is to disable the Linux NMI watchdog process

 

echo 0 > /proc/sys/kernel/nmi_watchdog

 

According to their lab tests......they say the following........

 

Here's where we stand concerning the NMI Watchdog:

 

The escalation engineer believes it will be worthwile to turn off the NMI (Non-maskable interrupt) watchdog on your device as the next step.

 

On our devices, this should be relatively safe because the NMI watchdog is really only used to detect serious failures in system components which might cause an ordinary computer to freeze. On F5 devices, we have our own hardware watchdog systems in place which cover this use case. The NMI watchdog is in fact disabled in LTM VE and vCMP. It is not required for normal operation of our equipment.

 

We'd like to emphasise that we're not seeing any watchdog triggering, but we are seeing peculiar behaviour with some of the interrupts during the 3.5 second pauses. We've run one of our lab units for 2 weeks with NMI watchdog disabled, and we did not see any incidence of the 3.5s pause. After re-enabling NMI watchdog and rebooting the device, we had two incidences within 3 days.

 

We have upgraded code from 10.2.1 (HF3) to 11.1.0 (HF5) as recommended a few months back, still these errors persist. Including replacing both hardware units and moving power.

 

Any comments out there, help!!!!

 

 

2 Replies

  • JG's avatar
    JG
    Icon for Cumulonimbus rankCumulonimbus
    It seems to me that "Clock advanced by" is notoriously hard to trouble-shoot.

     

     

    Have you run your config through iHealth? iHealth checks and provides pointers to solution articles if it finds a problem.

     

     

    You might want to upgrade to take advantage of all the bug fixes.

     

     

    We had lots of problems with 11.2.0 and upgraded to v11.3 right after it came out just before last Christmas.

     

     

    V11.3 adopted a threaded model, which is different kettle of fish. There are not a lot of hardware threads on the 6900, so v11.3 might not help you much, capacity-wise. I'd go for v11.2.1 HF5.

     

     

    -Jie
  • JG's avatar
    JG
    Icon for Cumulonimbus rankCumulonimbus

    Also, 20% CPU usage is normal, what we have here too.

     

    And you probably meant "connection mirroring".

     

    Why would "high small packet traffic" matter that much? It's the number of connections that matters. I used to have connection mirroring on and had no problems. But then I heard (read somewhere) that failback would be problematic for these connections. If that was true and is still true, that could not help you.

     

    Do you have dedicated link for HA?

     

     

    -Jie