Eliminating the overhead associated with active health checks without sacrificing availability.

f5fridayOne of the core benefits of cloud computing and application delivery (and primary purposes of load balancing) is availability. In the simplest of terms, achieving availability is accomplished by putting two or more servers (virtual or iron) behind a load balancing device. If one of the servers fails, the Load balancer directs users to the remaining server, ensuring the application being served from that server remains available.

The question then is this: how does the load balancer know when an application is not available? The answer is: health monitoring.

Every load balancer (and clustering solution) can do this at some level. It may be just an ICMP ping or a TCP three-way handshake or determining whether the HTTP and application response received are correct. It may be a combination of a variety of health monitoring options. Regardless of what the health check is doing, it’s getting done and an individual server may be taken out of rotation in the event that its health check response indicates a problem.

Now, interestingly enough there is more than one way to perform a health check. As you might have guessed the first way is to communicate out-of-band with the server and/or application. Every <user configured> time interval, the load balancer performs a check and then acts or doesn’t act upon the response. The advantage of this is that the load balancer can respond very quickly to problems provided the time interval is of sufficiently granular value. The disadvantage of this approach is that it takes up resources on the load balancer, the network, and the server. In a service-provider or cloud computing environment, the resources consumed by out-of-band health checks can be devastating to network performance and may well impact capacity of the server.

What else is there?

INBAND and PASSIVE MONITORING

While inband monitoring is relatively new, passive monitoring was pioneered by F5 many years ago. In fact, leveraging passive monitoring and inband monitoring together provides the means to more quickly address problems as they occur.

Inband monitoring pdf-icon was introduced in BIG-IP v10. Inband monitors can be used with either a Standard or a Performance (Layer 4) type virtual server, and as a bonus can also be used with active monitors. What inband monitoring does is basically eavesdrop on the conversation between a client and the server to determine availability. The monitor, upon an attempt by a client to connect to a pool member, behaves as follows:

  • If the pool member does not respond to a connection request after a user-specified number of tries within a user-specified time period, the monitor marks the pool member as down.
  • After the monitor has marked the pool member as down , and after a user-specified amount of time has passed, the monitor tries again to connect to the pool member (if so configured).

What inband monitoring does do – and does well – is eliminate all the extraneous traffic and connections consuming resources on servers and the network typically associated with active monitoring. But what it can’t do at this time is inspect or verify the correctness of the response. It’s operating strictly at the layer 4 (TCP). So if the server|application responds, the inband monitor thinks all is well. But we know that a response from a server does not mean that all is well; the content may not be what we expect. What we want is to mitigate the impact of monitoring on the network and servers but we don’t want to sacrifice application availability. That’s where passive monitoring comes in.

Passive monitoring pdf-icon is actually a technique that leverages network-side scripting (in our case F5 iRules) to inspect the content returned by an application and determine whether it is valid or not. If it is not valid, iRules affords the ability to mark the node down and/or resend the request to another (hopefully correctly working) application instance. Here’s a brief example that can mark the server down after three successive failures, otherwise attempts to “retry” the request:

   1: rule count_server_down {
   2:    when HTTP_REQUEST {
   3:    if { not [info exists orig_request]} {
   4:       set orig_request [HTTP::request]
   5:    }
   6: }
   7: when HTTP_RESPONSE {
   8:    if { [HTTP::status] >= 500 } {
   9:       set failures [session lookup dest_addr [LB::server addr]]
  10:       if { $failures >= 3 } {
  11:          LB::down
  12:       } else {
  13:          session add dest_addr [LB::server addr] [incr failures]
  14:          LB::detach
  15:          HTTP::retry $orig_request
  16:       }
  17:    }
  18: }
  19: } 

Passive monitoring is real-time, it’s looking at real requests to determine actual availability and correctness of response. This is even more useful when you start considering how you might respond. The robust nature of iRules allows you to do some interesting manipulation of content and communication channel, so if you can think it up you can probably get it done with an iRule.

By combining inband with passive monitoring you end up with “inband passive monitoring”. This solution eliminates the overhead of active monitoring by eavesdropping on client-server conversations and ensures application availability by inspecting content.

For a great discussion of inband passive monitoring and a detailed scenario of how it might work in conjunction with a real application, check out Alan Murphy’s post on the subject, “BIG-IP v10: Passive Application Monitoring”.