Forum Discussion

pwallace_110041's avatar
pwallace_110041
Icon for Nimbostratus rankNimbostratus
Oct 07, 2011

lb_failed with lb mode rr does not seem to produce expeted results

I am trying to use the following rule to make sure that when a node is down that traffic immeadiately goes to another node. But when I reboot one node from a two node pool the second node starts throwing 404's until the node monitors (http & icmp) take the node out of the pool.

 

 

What have I missed in my logic or understanding here?

 

 

 

when CLIENT_ACCEPTED {

 

set retries 0

 

set max_retries 3

 

}

 

when LB_FAILED {

 

log local0. "lb failed: $retries"

 

if { ($retries < $max_retries) and ($retries < [active_members [LB::server pool]]) }{

 

LB::mode rr

 

LB::reselect

 

incr retries

 

} else {

 

HTTP::respond 504 content { reached max retries or all members of the pool failed } noserver Connection close

 

}

 

}

 

 

Regards,

 

 

Pippin

 

4 Replies

  • Hi Pippin,

     

     

    I really don't think that you need to go as complex as you are. If the Node failed to trigger the LB_FAILED Event, then it is dead:

     

     

     

    LB_FAILED is triggered when LTM is ready to send the request to a pool member and one hasn’t been chosen (the system failed to select a pool or a pool member), is unreachable (when no route to the target exists), or is non-responsive (fails to respond to a connection request).

     

     

     

    You can shorten your process by just verifying that there is another server in the pool available to take the traffic. If there is then you should not need to worry about retries. You can drop the current persistence and do a server reselect.

     

     

    
    when HTTP_REQUEST {
     The HTTP_REQUEST Event here is just for testing...
    log local0. "Initial Server [LB::server name]"
    }
    when LB_FAILED {
     Check to see if there are any available members left after the failure.
     If less than 1 (Zero), redirect the Client to a Sorry Page.
    if { [active_members [LB::server pool]] < 1 } {
    HTTP::redirect "http://www.yahoo.com"
    }
    else {
     Drop current persistence and LB::reselect
    persist none
    LB::reselect
    }
    }
    

     

     

    Note: This code does not take into account the server passing a health check and being declared "Active", but still being in an unhealthy state. You can add that functionality on within the HTTP_RESPONSE and monitoring the HTTP::status codes.

     

     

    Hope this helps.
  • But when I reboot one node from a two node pool the second node starts throwing 404's until the node monitors (http & icmp) take the node out of the pool.since there is 404, i don't think LB_FAILED will be triggered at that time.

     

     

    The 404 or Not Found error message is a HTTP standard response code indicating that the client was able to communicate with the server, but the server could not find what was requested. A 404 error should not be confused with "server not found" or similar errors, in which a connection to the destination server could not be made at all. A 404 error indicates that the requested resource may be available again in the future.

     

     

    http://en.wikipedia.org/wiki/HTTP_404

     

     

    LB_FAILED is triggered when LTM is ready to send the request to a pool member and one hasn’t been chosen (the system failed to select a pool or a pool member), is unreachable (when no route to the target exists), or is non-responsive (fails to respond to a connection request).

     

     

    http://devcentral.f5.com/wiki/iRules.LB_FAILED.ashx

     

     

    i agree with Michael - u may catch 4xx code instead.
  • i did a bit test. hope it is helpful.

    [root@iris:Active] config  b virtual bar list
    virtual bar {
       snat automap
       pool foo
       destination 172.28.17.33:http
       ip protocol tcp
       persist mysource
       profiles {
          http {}
          tcp {}
       }
    }
    [root@iris:Active] config  b profile mysource list
    profile persist mysource {
       defaults from source_addr
       mode source addr
       timeout indefinite
    }
    [root@iris:Active] config  b pool foo list
    pool foo {
       monitor all http
       members {
          10.10.70.200:http {}
          209.85.175.104:http {}
       }
    }
    
    [root@iris:Active] config  b persist show all
    PERSISTENT CONNECTIONS
    |     Mode source addr   Value 172.28.17.30
    |        virtual 172.28.17.33:http   node 10.10.70.200:http   age 4sec
    
    
    [root@iris:Active] config  curl -I http://10.10.70.200/
    HTTP/1.0 404 Not Found
    Server: BigIP
    Connection: Keep-Alive
    Content-Length: 0
    
    [root@iris:Active] config  b rule myrule list
    rule myrule {
       when HTTP_REQUEST {
            set retries 0
            set request_headers [HTTP::request]
    }
    
    when HTTP_RESPONSE {
            if {[HTTP::status] eq 404} {
                    incr retries
                    LB::down
                    HTTP::retry $request_headers
            }
    }
    }
    
    [root@iris:Active] config  b virtual bar rule myrule
    
    [root@iris:Active] config  curl -I http://172.28.17.33/
    HTTP/1.1 200 OK
    Date: Sat, 08 Oct 2011 08:47:51 GMT
    Expires: -1
    Cache-Control: private, max-age=0
    Content-Type: text/html; charset=ISO-8859-1
    Server: gws
    X-XSS-Protection: 1; mode=block
    Transfer-Encoding: chunked
    
    [root@iris:Active] config  b persist show all
    PERSISTENT CONNECTIONS
    |     Mode source addr   Value 172.28.17.30
    |        virtual 172.28.17.33:http   node 209.85.175.104:http   age 4sec
    [root@iris:Active] config 
    
    
  • The problem is that even though the rule advances things to the next member, other instances of the same rule might be doing the same thing and things are advanced globally, not just for that rule.

    i.e.

    pool with 2 members

    member 2 goes down

    LB_FAILED is triggered for connection 1

    iRule instance for connection 2 advances to member 1

    LB_FAILED is triggered for connection 2

    iRule instance for connection 2 advances the pool again to member 2

    iRule instance for connection 1 routes to member 2 (down)

    This will only happen with high load when a high percentage of the pool is down.

    You can avoid this by keeping track of where things are failing in a table and then making sure you don't select that same member again.

    i.e.

    when CLIENT_ACCEPTED {
        set retry count to 0 to start off
       set retries 0
    }
    when LB_SELECTED {
        set initial value for server_addr
       set server_addr [LB::server addr]
    }
    when LB_FAILED {             
        retry a limited number of times
       if { $retries < 3 }{
           only count as a try if tried a member not previously known to be down
          if { [table lookup -notouch -subtable dont_try $server_addr] != 1 }{
             incr retries
          }
           remember that this node failed
          table set -subtable dont_try $server_addr 1 10 20
          set loop_tries 0
           work around to bug where LB::server is not updated after LB::reselect
          set new_pick [LB::select]
          set server_addr [getfield $new_pick " " 4]
           keep looping until we get a server not in teh dont_try table, loop a maximum of 5 times
           if you have a small pool, you will likely hit the max loop tires quickly, with 2 nodes and 200 concurrent connections I saw 8 max
          while { ([table lookup -notouch -subtable dont_try $server_addr] == 1) and ($loop_tries < 10) }{
             incr loop_tries
             set new_pick [LB::select]
             set server_addr [getfield $new_pick " " 4]
              debug logging to see what is happening in the loop
             log local0. "set addr to $server_addr: loop try: $loop_tries"
          }
           select the new server based on values determined above
          eval $new_pick
          LB::reselect
       } else {
           if all else fails, send a 504 error back to the client and log
          log local0. "504 virtual: [virtual name], retries: $retries, last_server: $server_addr"
          HTTP::respond 504 content { reached max retries or all members of the pool failed } noserver Connection close
       }
    }