This article discusses a method for regulating HTTP accesses from robots (aka. crawlers, spiders, or bots) by the use of F5 LineRate Precision Load Balancer.

A growing number of accesses from robots can potentially affect the performance of your web services. Studies show that robots accounted for 35% of accesses in 2005 [1], and increased to 61.5% in 2013 [2]. Many sites employ the de facto Robots Exclusion Protocol standard [3] to regulate the access, however, not all the robots follow this advisory mechanism [4]. You can filter the disobedient robots somehow, but that will be an extra burden for already heavily loaded web servers. In this use-case scenario, we utilize LineRate scripting to exclude the known robots before they reach the backend servers.

The story is simple. When a request hits LineRate, it checks the HTTP User-Agent request header. If it belongs to one of the known robots, LineRate sends the 403 Forbidden message back to the client without conveying the request to the backend servers. A list of known robots can be obtained from a number of web sites. In this article, user-agent.org was chosen as the source because it provides a XML formatted list. The list contains legitimate user agents, so only the entries marked as 'Robots' or 'Spam' must be extracted.

Here is the code.

'use strict';
var vsm = require('lrs/virtualServerModule');
var http = require('http');
var event = require('events');
var xml = require('xml2js');

First, include necessary modules. lrs/virtualServerModule is a LineRate specific module that handles traffic. http and events are standard Node.js modules: The former is used to access the user-agent.org server as a HTTP client, and the latter is for custom event handling. xml2js is an NPM module that translates a XML formatted string to the JSON object.

function GetRobots () {
    this.vsname = 'vs40';

    this.ops = {
	host:    '80.67.17.172',
	path:    '/allagents.xml',
	headers: {'Host':    'www.user-agents.org',
		  'Accept':  '*/*'}
    };

    this.stat404 = 403;
    this.body403 = Forbidden;
    this.head403 = {'Content-Type': 'text/plain',
		    'Content-length': this.body403.length};

    this.xml = '';              // populated by getter()
    this.list = {};             // populated by parser()
};
GetRobots.prototype = new event.EventEmitter;

The GetRobot class stores information such as HTTP access information and 403 response message. In order to handle custom events, the class is extended by the Events.EventEmitter. The class contains two methods (functions): GetRobot.parser() is for parsing XML strings into JSON objects, and GetRobot.getter() is for getting the XML data.

// Parse XML string into an object
GetRobots.prototype.parser = function() {
    var reg = /[RS]/i;
    var self = this;
    try {
        xml.parseString(self.xml, function(e, res) {
            if (e || ! res)
                 console.error('robot: parser eror: ' + e);
            else if (! res['user-agents'] || ! res['user-agents']['user-agent'])
                console.error('robot: parser got malformed data.');
            else {
                var array = res['user-agents']['user-agent'];
                for (var i=0; i<array.length; i++) {
                    if (reg.test(array[i].Type))
                        self.list[(array[i].String)[0]] = 1;
                }
                self.emit('parser');
            }
       });
    }
    catch(e) {
        console.error('robot: parser got unknown error ' + e);
    }
};

This is the parser method. The XML data retrieved is structured in the <user-agents><user-agent>....</user-agent></user-agents> format. Each <user-agent>....</user-agent> contains information of an user agent. The tags we are after are <String> and <Type>. The <String> tag contains the value of the HTTP's User-Agent. The <Type> tag contains the type of agents: We are after the Type R(obot) or S(pam) as shown in the regular expression in the code. After it completes parsing, it emits the custom 'parser' event.

// Retrieve the XML formatted user agent list
GetRobots.prototype.getter = function() {
    var self = this;
    try {
        var client = http.request(self.ops, function(res) {
            var data = [];
            res.on('data', function(str) {
                data.push(str);
            });
            res.on('end', function() {
                self.xml = data.join('');
                self.emit('getter');
            });
       }).end();
    }
    catch(e) {
        console.error('robot: getter error: ' + e.message);
    }
};

This snippet is the getter. It sends a HTTP GET request to the server, and receives the XML string. After it receives all the XML data, it emits the custom 'getter' event.

// main part
var robo = new GetRobots();
vsm.on('exist', robo.vsname, function(vso) {
    robo.on('getter', function() {
        console.log('robot: got XML file. ' + robo.xml.length + ' bytes.');
        robo.on('parser', function() {
            var num = (Object.keys(robo.list)).length;
            console.log('robot: got ' + num + ' robots.');
            vso.on('request', function(servReq, servResp, cliReq) {
                var agent = servReq.headers['User-Agent'];
                if (robo.list[agent]) {
                    servResp.writeHead(robo.stat403, robo.head403);
                    servResp.end(robo.body403);
                }
                else {
                    cliReq();
                }
            });
        });
        robo.parser();
    });
    robo.getter();
});
console.log('robot: retrieving info from ' + robo.ops.headers['Host']);

Now, combine them together. The code follows the following steps sequentially.

  1. Instantiate the class (the object is robo here). Then, log the message (shown at the end of the code).
  2. Check if the LineRate virtual server exists (vsm.on('exist', ...)). The name of the virtual server is hard-coded in the class definition.
  3. Register the 'getter' event to the robo object using the on function (robo.on('getter', ...)). Then, run the getter() to retrieve the XML formatted agent list from the user-agent.org. You need to run the getter() AFTER the event is registered. Otherwise, the object may miss the event because it is not prepared to catch it.
  4. After the getter() completes (received the 'getter' event), register the 'parser' event to the robo object (robo.on('parser', ...)). Then, run the parser() to parse the retrieved XML data. Note that you want to run the parser() only after the completion of the getter(). Otherwise, the parser() tries to parse null string as the data is not prepared there yet.
  5. Once the list of robots becomes ready, register the LineRate's 'request' event (vso.on('request', ...)) so the script can start processing the traffic. The rest of the story is simple. If the User-Agent header contains any of the agent names listed, send the 403 message back to the client (robot). Otherwise, pass the request through to the backend servers.

Let's test the script.

Try accessing to the LineRate with your browser. It should return the backend server's data as if there is no intermediate processor exists.

Try mimicking a robot (any of them with the R or S mark) using curl as below.

$ curl -D - -H "User-Agent: DoCoMo/1.0/Nxxxi/c10" 192.168.184.40
HTTP/1.1 403 Forbidden
Content-Type: text/plain
Content-length: 9
Date: Tue, 03-Mar-2015 04:03:56 GMT

Forbidden

The User-Agent string must be the exact match to the string appeared in the user-agent.org list.

The script leaves the following log messages upon statup.

robot: retrieving info from www.user-agents.org
robot: got XML file. 693519 bytes.
robot: got 1527 robots.

While the script runs fine, there are a few possible alterations that can make it nicer.

  1. Waterfall the process - The script must run virtual server check, getter, parser, and traffic processing in sequence. The NPM async module is handy for such a deeply nested structure. See our "LineRate: HTTP session ID persistence in scripting using memcache" DevCentral article for more details.
  2. Handle the lengthy initialization process nicely - It takes noticeable time for the getter and parser to prepare the list of robots (it took about 8s in the authors environment). While the script is preparing for the essential data, any incoming request is passed through as if no process is performed by the proxy - so robots can access the servers for the first several seconds. If you want to change this behavior, check our "A LineRate script with lengthy initialization process" DevCentral article.
  3. Cater for multiple instances - LineRate may spawn multiple HTTP processing engines (called lb_http) depending on a number of vCPUs (cores/hyper-threads). With the script above, each engine runs getter and parser, keeping the same data individually. You can run the getter and parser just once on a designated instance, and make the data available to all others. Learn the data sharing methods from "A Variable Is Unset That I Know Set" in our Product Documentation or "LineRate and Redis pub/sub" DevCentral article.

Please leave a comment or reach out to us with any questions or suggestions and if you're not a LineRate user yet, remember you can try it out for free.


References:
[1] Yang Sun, Ziming Zhuang, and C. Lee Giles: "A large-scale study of robots.txt", Proc. 16th Int. Conf World Wide Web (WWW 2007), 1123-1124 (May 2007).
[2] Igal Zeifman: "Report: Bot traffic is up to 61.5% of all website traffic", Incapsula's Blog (09 Dec 2013).
[3] The Web Robots Pages. The protocol was proposed to IETF by M. Koster in 1996.
[4] C. Lee Giles, Yang Sun, and Isaac G. Councill: "Measuring the web crawler ethics", Proc. 19th Int. Conf. World Wide Web (WWW 2010), 1101-1102 (Apr 2010).