monitor timeouts vs actual behaviour
Hi guys,
the default http health monitor (v10.2.4) polls on a 5 second interval, timeout of 16 seconds. To me, this says that every 5 seconds a monitor will fire, should no monitor be successful for 16 seconds then the pool mmeber is down.
Yet this really REALLY doesn't match what happens on the network to a huge extent:
pool blah_pool {
monitor all http
members 1.2.3.4:1234 {}
}
`
a tcpdump shows:
`11:01:36.761159 IP 10.101.131.4.35514 > 1.2.3.4.1234: S
11:03:13.742647 IP 10.101.131.4.46160 > 1.2.3.4.1234: S
11:03:16.742445 IP 10.101.131.4.46160 > 1.2.3.4.1234: S
11:03:22.742838 IP 10.101.131.4.46160 > 1.2.3.4.1234: S
11:03:34.741285 IP 10.101.131.4.46160 > 1.2.3.4.1234: S
11:03:58.740435 IP 10.101.131.4.46160 > 1.2.3.4.1234: S
11:04:46.736725 IP 10.101.131.4.46160 > 1.2.3.4.1234: S
11:06:23.738147 IP 10.101.131.4.48428 > 1.2.3.4.1234: S
11:06:26.737763 IP 10.101.131.4.48428 > 1.2.3.4.1234: S
11:06:32.737102 IP 10.101.131.4.48428 > 1.2.3.4.1234: S
11:06:44.735753 IP 10.101.131.4.48428 > 1.2.3.4.1234: S
so we have only one single TCP attempt at one time, so not every 5 seconds, and whilst the monitor will mark a node down after 16 seconds still, the tcp connection is still going to try to continue until the tcp/ip stack times it out. So once it's down after 16 seconds it's still got a huge wait before it tries again, no new connection will try to connect until the single current one finishes. so if, for some (presumably pretty stupid) reason the specific connection is not being replied to, maybe a weird FW rule or IPS action) LTM won't be able to check status on a new connection for three minutes and 10 seconds.
I've also seen equivalent behaviour with an http GET just not being replied to, again having to wait until the TCP connection is reset, or the webserver finally responds well, well after the "timeout" period has expired before the monitor will fire again. Testing just now, I see the HTTP monitor just crudely stuffing additional GET's down the same connection that's still waiting for a response, what's that all about??
I can't make any sense of this, and, TBH, has gone right against all the things I've designed for, sticking to the 3n+1 rule etc. What merit does 3n+1 have in this sort of situation? I see no logic in it at all if additional monitors can't run in parallel. Who would want to be forwarding to a web server that is routinely taking, say, 15 seconds to reply (3n+1 - 1s) when all the other members in a pool take 0.01s to serve the same gif file? Shouldn't a timeout actually always be something like 4 seconds (to at least give time for 2 SYN's to hit the back end? Even in that case though, I'm still stuffed until the next connection is allowed to be attempted.
Any thoughts on this would be appreciated!