lb_failed with lb mode rr does not seem to produce expeted results

Question

I am trying to use the following rule to make sure that when a node is down that traffic immeadiately goes to another node.  But when I reboot one node from a two node pool the second node starts throwing 404's until the node monitors (http &amp; icmp) take the node out of the pool.  &nbsp;
&nbsp;What have I missed in my logic or understanding here?&nbsp;&nbsp;
&nbsp;when CLIENT_ACCEPTED {
&nbsp;    set retries 0
&nbsp;    set max_retries 3
&nbsp;}
&nbsp;when LB_FAILED {
&nbsp;    log local0. "lb failed: $retries"
&nbsp;    if { ($retries &lt; $max_retries) and ($retries &lt; [active_members [LB::server pool]]) }{
&nbsp;        LB::mode rr
&nbsp;        LB::reselect
&nbsp;        incr retries
&nbsp;    } else {
&nbsp;        HTTP::respond 504 content { reached max retries or all members of the pool failed } noserver Connection close
&nbsp;    }
&nbsp;}&nbsp;
&nbsp;Regards,&nbsp;
&nbsp;Pippin&nbsp;

michael_yates · Answer

Hi Pippin, 
&nbsp;  
&nbsp; I really don't think that you need to go as complex as you are.  If the Node failed to trigger the LB_FAILED Event, then it is dead: 
&nbsp;  
&nbsp;  
&nbsp; LB_FAILED is triggered when LTM is ready to send the request to a pool member and one hasn’t been chosen (the system failed to select a pool or a pool member), is unreachable (when no route to the target exists), or is non-responsive (fails to respond to a connection request). 
&nbsp;  
&nbsp;  
&nbsp; You can shorten your process by just verifying that there is another server in the pool available to take the traffic.  If there is then you should not need to worry about retries.  You can drop the current persistence and do a server reselect. 
&nbsp;  
&nbsp; 
when HTTP_REQUEST {
 The HTTP_REQUEST Event here is just for testing...
log local0. "Initial Server [LB::server name]"
}
when LB_FAILED {
 Check to see if there are any available members left after the failure.
 If less than 1 (Zero), redirect the Client to a Sorry Page.
if { [active_members [LB::server pool]] &lt; 1 } {
HTTP::redirect "http://www.yahoo.com"
}
else {
 Drop current persistence and LB::reselect
persist none
LB::reselect
}
}
 
&nbsp;  
&nbsp; Note:  This code does not take into account the server passing a health check and being declared "Active", but still being in an unhealthy state.  You can add that functionality on within the HTTP_RESPONSE and monitoring the HTTP::status codes. 
&nbsp;  
&nbsp; Hope this helps.

nitass · Answer

But when I reboot one node from a two node pool the second node starts throwing 404's until the node monitors (http &amp; icmp) take the node out of the pool.since there is 404, i don't think LB_FAILED will be triggered at that time. 
&nbsp;  
&nbsp; The 404 or Not Found error message is a HTTP standard response code indicating that the client was able to communicate with the server, but the server could not find what was requested. A 404 error should not be confused with "server not found" or similar errors, in which a connection to the destination server could not be made at all. A 404 error indicates that the requested resource may be available again in the future. 
&nbsp;  
&nbsp; http://en.wikipedia.org/wiki/HTTP_404 
&nbsp;  
&nbsp; LB_FAILED is triggered when LTM is ready to send the request to a pool member and one hasn’t been chosen (the system failed to select a pool or a pool member), is unreachable (when no route to the target exists), or is non-responsive (fails to respond to a connection request). 
&nbsp;  
&nbsp; http://devcentral.f5.com/wiki/iRules.LB_FAILED.ashx 
&nbsp;  
&nbsp; i agree with Michael - u may catch 4xx code instead.

nitass · Answer

i did a bit test. hope it is helpful. 
  
 [root@iris:Active] config  b virtual bar list
virtual bar {
   snat automap
   pool foo
   destination 172.28.17.33:http
   ip protocol tcp
   persist mysource
   profiles {
      http {}
      tcp {}
   }
}
[root@iris:Active] config  b profile mysource list
profile persist mysource {
   defaults from source_addr
   mode source addr
   timeout indefinite
}
[root@iris:Active] config  b pool foo list
pool foo {
   monitor all http
   members {
      10.10.70.200:http {}
      209.85.175.104:http {}
   }
}

[root@iris:Active] config  b persist show all
PERSISTENT CONNECTIONS
|     Mode source addr   Value 172.28.17.30
|        virtual 172.28.17.33:http   node 10.10.70.200:http   age 4sec

[root@iris:Active] config  curl -I http://10.10.70.200/
HTTP/1.0 404 Not Found
Server: BigIP
Connection: Keep-Alive
Content-Length: 0

[root@iris:Active] config  b rule myrule list
rule myrule {
   when HTTP_REQUEST {
        set retries 0
        set request_headers [HTTP::request]
}

when HTTP_RESPONSE {
        if {[HTTP::status] eq 404} {
                incr retries
                LB::down
                HTTP::retry $request_headers
        }
}
}

[root@iris:Active] config  b virtual bar rule myrule

[root@iris:Active] config  curl -I http://172.28.17.33/
HTTP/1.1 200 OK
Date: Sat, 08 Oct 2011 08:47:51 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Server: gws
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked

[root@iris:Active] config  b persist show all
PERSISTENT CONNECTIONS
|     Mode source addr   Value 172.28.17.30
|        virtual 172.28.17.33:http   node 209.85.175.104:http   age 4sec
[root@iris:Active] config

kman_52500 · Answer

The problem is that even though the rule advances things to the next member, other instances of the same rule might be doing the same thing and things are advanced globally, not just for that rule.
i.e.
pool with 2 members
member 2 goes down
LB_FAILED is triggered for connection 1
iRule instance for connection 2 advances to member 1
LB_FAILED is triggered for connection 2
iRule instance for connection 2 advances the pool again to member 2
iRule instance for connection 1 routes to member 2 (down)
This will only happen with high load when a high percentage of the pool is down.
You can avoid this by keeping track of where things are failing in a table and then making sure you don't select that same member again.
i.e.when CLIENT_ACCEPTED {
    set retry count to 0 to start off
   set retries 0
}
when LB_SELECTED {
    set initial value for server_addr
   set server_addr [LB::server addr]
}
when LB_FAILED {             
    retry a limited number of times
   if { $retries &lt; 3 }{
       only count as a try if tried a member not previously known to be down
      if { [table lookup -notouch -subtable dont_try $server_addr] != 1 }{
         incr retries
      }
       remember that this node failed
      table set -subtable dont_try $server_addr 1 10 20
      set loop_tries 0
       work around to bug where LB::server is not updated after LB::reselect
      set new_pick [LB::select]
      set server_addr [getfield $new_pick " " 4]
       keep looping until we get a server not in teh dont_try table, loop a maximum of 5 times
       if you have a small pool, you will likely hit the max loop tires quickly, with 2 nodes and 200 concurrent connections I saw 8 max
      while { ([table lookup -notouch -subtable dont_try $server_addr] == 1) and ($loop_tries &lt; 10) }{
         incr loop_tries
         set new_pick [LB::select]
         set server_addr [getfield $new_pick " " 4]
          debug logging to see what is happening in the loop
         log local0. "set addr to $server_addr: loop try: $loop_tries"
      }
       select the new server based on values determined above
      eval $new_pick
      LB::reselect
   } else {
       if all else fails, send a 504 error back to the client and log
      log local0. "504 virtual: [virtual name], retries: $retries, last_server: $server_addr"
      HTTP::respond 504 content { reached max retries or all members of the pool failed } noserver Connection close
   }
}

Forum Discussion

lb_failed with lb mode rr does not seem to produce expeted results

4 Replies

Recent Discussions

F5 loadbalancer not working

Overwriting or adding LTM SSL Traffic cert and key using iControlREST

Pricing when used with aws waf

F5 terminal - help to run commands - disk space full

import live updates from version x to version y

Related Content

F5 Maintenance Page iRule produces ERR_CONNECTION_RESET in browser

iRule LB_FAILED

LB_FAILED. Why exactly?

Diagnose the LB_FAILED Event

debugging LB_FAILED event