Forum Discussion

Chris_Phillips's avatar
Chris_Phillips
Icon for Nimbostratus rankNimbostratus
Jan 26, 2012

Does LB_Failed have the same criteria as using an HTTP Fallback Host?

Nice quick one...

 

 

Is there any difference at all between when the LB_FAILED event fires and an HTTP Fallback Host configured on an HTTP profile would fire?

 

 

 

We occasionally have about 0.1% of connections over a narrow 10 second period receiving a fallback host from a number of very busy virtual services. Something like 200 failed connections out of 50,000,000 per day!

 

 

 

Trying to track down this needle in a haystack we're trying to completely understand when the fallback 302 would be sent, and it appears that it's exactly 100% of the same reasons the LB_FAILED event would fail, which would mean that a scenario where, say, an HTTP request IS made to a pool member and then has it's connection reset etc, would NOT cause the fallback host to kick in? Once a TCP connection is established to the member, both the even and the fallback redirect can never occur?

 

 

 

In terms of once we're out on the wire, we're looking at only unreplied to SYN's or instantly RST requests before a 3 way handshake occurs. yup?

 

4 Replies

  • Hi Chris,

     

     

    Trying to track down this needle in a haystack we're trying to completely understand when the fallback 302 would be sent, and it appears that it's exactly 100% of the same reasons the LB_FAILED event would fail.

     

     

    I think that's correct.

     

     

    an HTTP request IS made to a pool member and then has it's connection reset etc, would NOT cause the fallback host to kick in? Once a TCP connection is established to the member, both the even and the fallback redirect can never occur?

     

     

    That's also correct. You can handle this failure scenario using the after command. You'd need to set a timeout in milliseconds to wait for a server response. If it doesn't come then you could send an HTTP response back to the client and/or log something. The second example on the after wiki page should be a good start:

     

     

    http://devcentral.f5.com/wiki/iRules.after.ashx

     

     

    I put in an RFE to support this type of response timeout in an HTTP profile. The ID is BZ373937. You could open a case with F5 Support to raise the visibility of the request.

     

     

    Aaron
  • Hmmmmmmmm, so why, with a default tcp-lan-optimized profile on an HTTP vs are we getting LB_FAILED after as long as 72 seconds?? This suggests, to me at least, than the connection is (half?)opened, but maybe no data ever gets acked back from it? I'm sensing more of a subtlety about when a connection is officially deemed to have been balanced. Being 72 seconds, that naturally feels like some sort of time out period expiring...

     

     

     

    when LB_FAILED {

     

     

    log local0. "LB_FAILED EVENT! vs=[virtual name] local_addr=[IP::local_addr] client=[IP::client_addr]:[TCP::client_port] LB_pool=[LB::server pool] LB_addr=[LB::server addr] age=[IP::stats age]ms"

     

     

    }

     

     

    Jan 27 00:05:46 [10.X] tmm1 tmm1[5195]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:57786 LB_pool=t2_XXX_pool LB_addr=10.X age=17861ms

     

    Jan 27 00:05:46 [10.X] local/tmm1 info tmm1[5195]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:57786 LB_pool=t2_XXX_pool LB_addr=10.X age=17861ms

     

    Jan 27 00:05:46 [10.X] tmm1 tmm1[5195]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:57776 LB_pool=t2_XXX_pool LB_addr=10.X age=25016ms

     

    Jan 27 00:05:46 [10.X] local/tmm1 info tmm1[5195]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:57776 LB_pool=t2_XXX_pool LB_addr=10.X age=25016ms

     

    Jan 27 00:05:47 [10.X] tmm tmm[5129]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:45299 LB_pool=t2_XXX_pool LB_addr=10.X age=33885ms

     

    Jan 27 00:05:48 [10.X] tmm tmm[5572]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:58975 LB_pool=t2_XXX_pool LB_addr=10.X age=72012ms

     

    Jan 27 00:05:48 [10.X] tmm tmm[5572]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:45281 LB_pool=t2_XXX_pool LB_addr=10.X age=38729ms

     

    Jan 27 00:05:50 [10.X] tmm1 tmm1[5573]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:59032 LB_pool=t2_XXX_pool LB_addr=10.X age=10473ms

     

    Jan 27 00:05:50 [10.X] tmm1 tmm1[5243]: Rule _temp_LB_FAILED_logging_rule : LB_FAILED EVENT! vs=t2_XXX_vs local_addr=10.X client=10.X:59028 LB_pool=t2_XXX_pool LB_addr=10.X age=11962ms

     

     

    So that's over night, with logs from multiple LTM's going to multiple members (XXX's obscured that fact though) via another LTM forwarding vs on a different pair of LTM's. Can you explain this huge delay in the LB failing??
  • Hi Chris,

     

     

    See the LB_FAILED wiki page for details. What do you have the max syn retransmits set to on your TCP profile? Let me know if the LB_FAILED info doesn't match up with what you're seeing in your TCP profile(s).

     

     

     

    http://devcentral.f5.com/wiki/iRules.lb_failed.ashx

     

     

    LB_FAILED is triggered when LTM is ready to send the request to a pool member and one hasn’t been chosen (the system failed to select a pool or a pool member), is unreachable (when no route to the target exists), or is non-responsive (fails to respond to a connection request).

     

     

    If the target fails to respond to a connection request, the "Maximum Syn Retransmissions" option in the TCP profile will affect the amount of time before LB_FAILED is triggered.

     

     

    When a client doesn't receive a response to the SYN, there is a defined algorithm for the specified number of re-tries. First retransmission if no response is typically 3 seconds, and typical back-off timer algorithm is to double the wait time after each failed attempt.

     

    ...

     

    LTM's default tcp profile sets "Maximum Syn Retransmissions" to 4, so with the default setting, LB_FAILED would be triggered if server didn't respond in 45 seconds:

     

     

    11st SYN: 0

     

    2 2nd SYN: +3 seconds

     

    3 3rd SYN: +6 seconds

     

    4 4th SYN: +12 seconds

     

    5 5th SYN: +24 seconds

     

    6======================

     

    7LB_FAILED: 45 seconds

     

     

     

    Aaron
  • I have retransmits set to 3, But I think that that table is not at all correct. I did a test on a dev system on 10.2.0 and just saw retries (tcpdump) every 3 seconds until it hit the max, so with a fake pool member which would never connect, LB_FAILED always fired on 12001ms (well... ish). No sign of an incremental back off whatsoever.

     

     

    We've found that these blips are apparently all on members on a single physical host (but with multiple IP's), and so far these seem to only be on Solaris 10 v490 boxes, which are a significant minority of the estate, and we can also see all http traffic go awol for this brief time period... very strange. Feels arp-y to me, but who knows... But it doesn't look like an LTM / TMOS issue at heart.

     

     

    Can you think of a scenario where these LB_FAILED events would be firing on such vague times, when the members appear to be freezing in some way? I'm thinking it would need to be a RST as if it were not something coming from the server, but delayed, then the LB_FAILED's would still be firing based on retry intervals.