Using an iRule to Sort Out Spiders for Network Computing

In their most recent L4-L7 product review, Network Computing decided to base their testing off of the requirement of a real-world IT shop – CMP Media (Click here to see the results). This review demonstrated the importance of product flexibility and more specifically, the power behind F5’s Programming Language - iRules.

The Challenge

Like many communication companies, part of CMP Media’s revenue is generated from ads hosted on their site. In order to provide accurate counts for ad impressions and click throughs, IT must filter out the illegitimate traffic like spiders and robots that hit a site to index it’s content.

These sources present an interesting challenge for many shops that must serve all content but identify and track real users separately from spiders/robots. Often, it’s not acceptable to simply block certain source IP addresses, because much of this traffic helps get your content listed in search engines like Google.

At CMP, this requirement for advanced user based routing is made even more complicated in their real world deployment where a company hosts multiple sites on the same IP address. Many organizations do the same – serving several websites which are all virtualized by a single address.

In order to meet these requirements, a solution has to first effectively identify if the user is a spider/robot and then identify the site being requested.

Configuration Can Get Messy Quickly

This problem can be visualized by thinking of 5 websites being hosted on a single IP address. For each site there are two possible destinations: the real content which is being tracked for billing purposes and spider/robot content. This means there is essentially twice the number of websites. So, if you had 5 real websites, you’re traffic management device would have to treat it as 10 websites, causing a replication of pools, nodes and many other pieces of the configuration that basically double the administration for any site.

Sorting Out the Traffic with a 9-Line iRule

During the review, Network Computing leveraged F5 resources and DevCentral to create an iRule in just 20 minutes which accomplished the required tasks with far greater simplicity. In addition to the speed of development and performance of the box, the real testimony from our perspective is the simplicity of the iRule and the fact that BIG-IP allowed the customer to forgo the complexities and costs of redundant configuration to meet their objective.

rule nwc_robot_routing_rule {
   when HTTP_REQUEST {
      if { [HTTP::header User-Agent] == "" } {
         if { [matchclass [IP::remote_addr] equals $::blacklisted_clients] } {
            pool spider_[HTTP::header Host]
         }
      }
      elseif { [matchclass [HTTP::header User-Agent] contains $::blacklisted_useragents] } {
            pool spider_[HTTP::header Host]
      }
      elseif { [string first -nocase "bot" [HTTP::header User-Agent]] >= 0 } {
            pool spider_[HTTP::header Host]
      }
      else {
            pool pool_[HTTP::header Host]
      }
   }
}

Step by Step Explanation of the NC iRule:

This is a rule that separates business logic from configuration.

Walking through the rule line-by-line, we have this:

when HTTP_REQUEST {

This signals that this rule applies to HTTP requests. Rules can be applied to various “events”, such as HTTP requests, responses, TCP data, connection establishment, et cetera.

if { [HTTP::header User-Agent] == "" } {

This checks to see if the HTTP user-agent header exists. If this header does not exist, then we execute this line of the rule:

if { [matchclass [IP::remote_addr] equals $::blacklisted_clients] } {

The iRule simply checks the client’s IP address against a list of known-bad clients that are stored in a list named “blacklisted_clients”. If the client does not present an HTTP user-agent header, and the client’s IP address does match one of these known-bad IP address, then we run the next line of the rule:

pool spider_[HTTP::header Host]

This says that we’re going to use a set of a servers (a “pool” of servers) named “spider_

Next, if the user did have a HTTP user-agent, it’s checked by this line:

elseif { [matchclass [HTTP::header User-Agent] contains $::blacklisted_useragents] } {

The iRule simply checks to see if the HTTP user-agent matches one of the user-agents in the list named “blacklisted_useragents”.

If the user-agent does match, the user is sent to this line of the rule, which does the same thing as the earlier appearance of this line:

pool spider_[HTTP::header Host]

Next, if the HTTP user-agent does exist, but didn’t match the previous check to see if it was a blacklisted-useragent, then this line:

elseif { [string first -nocase "bot" [HTTP::header User-Agent]]

This checks to see if the user-agent contains the string “bot” anywhere in the user-agent string.

If it does, the user is sent to the familiar rule line:

pool spider_[HTTP::header Host]

If none of the previous checks turned out to be true, then the user is deemed legitimate, and they are sent to this final line of the rule:

pool pool_[HTTP::header Host]

This sends the user to a pool named pool_

Learn More About the Network Computing Review

You can learn more about the specific review by visiting: http://www.f5.com/communication/articles/2005/article021105.html

More about DevCentral

We welcome you to explore our site further. DevCentral was created as a community for F5 customers and partners to learn and share the iRules that provide valuable solutions.

Published Feb 21, 2005

Version 1.0

BIG-IP

devops

iRules