Forum Discussion

nathan_hoffman_'s avatar
nathan_hoffman_
Icon for Nimbostratus rankNimbostratus
Dec 22, 2017

Applying different policies for authorized web scrapers

I'm working on a method for dealing with external web scrapers in my organization. Some web scrapers are allowed, some aren't. My task is to define rules to authorize the "good" bots and block the rest, without impacting normal user traffic.

 

I'm thinking about implementing a local traffic policy iRule that routes "good" bots to one of two security policies specific for those bots with increased rate ceilings, while still using a default security policy to catch the unauthorized bots. Basically I'm trying to convince the ASM that the good bots AREN'T bots, as long as the bot handlers throttle themselves to stay under the rate ceiling. I think this is possible by turning off Bot Detection and tuning the Session Opening and/or Session Transactions Anomaly settings. Does this sound right?

 

The main thing I haven't figured out is whether this setup will allow me to ONLY apply the security policy I want to the specified "good" bots, while still applying the default security policy with more standard anti-scraping settings. Will this iRule accomplish this? Also, does it look like it would be really performance intensive for the ASM?

 

Portions of code shamelessly stolen from others on this site. Any suggestions or criticism are very welcome.

 

when HTTP_REQUEST {

     set start and end time
    set start_time "20:00"
    set end_time "05:30"

     convert start/end times to seconds from the epoch for easier date comparisons
    set start [clock scan $start_time]
    set end [clock scan $end_time]

     get the current time in seconds since the Unix epoch of 0-0-1970
    set now [clock seconds]

     only do the next section if it's an authorized bot, otherwise the request should go to the default security policy
     authorized_bots is a data group of addresses known to belong to authorized bot handlers
     not relying on ASM_REQUEST_VIOLATION or ASM_REQUEST_DONE to decide if it's a bot - just going by IP
    if { [class match [IP::client_addr] equals authorized_bots] } {

     currently outside business hours?
        if {$now > $start and $now < $end} {

             check if bot is scraping the app it's authorized for
             if it's authorized to scrape that app, send to the low volume security policy
             have to check for the app's URI path as well as its dependencies that aren't under the app root dir
            if { ([HTTP::uri] starts_with "/app") or ([HTTP::uri] starts_with "/dependency1") or ([HTTP::uri] starts_with "/dependency2") } {

             use the security policy with a higher rate ceiling for bot detection
                ASM::enable /Common/auth_scrape_high_volume
            } else {  the URI doesn't match - the bot isn't authorized for this URI
                drop
            }
        } else {  if we get here, it's currently within business hours

             check if bot is scraping the app it's authorized for
             if it's authorized to scrape that app, send to the high volume security policy
            if { ([HTTP::uri] starts_with "/app") or ([HTTP::uri] starts_with "/dependency1") or ([HTTP::uri] starts_with "/dependency2") } {

             use the security policy with a lower rate ceiling for bot detection
                ASM::enable /Common/auth_scrape_low_volume
            } else {  the URI doesn't match - the bot isn't authorized for this URI
                drop
            }
        }
    }
}

2 Replies

  • Hi,

    A dual-policy setup is justified in some cases, but here you can ease your management efforts and go with one. When it comes to creating exceptions for "good bots", you can identify those by User-Agent header.

    For example, those are top well-known "good bots" https://www.keycdn.com/blog/web-crawlers/

    iRule logic for allowing those specific bots, without globally disabling the violation itself would work as summarized by 5 steps below. This will not do rate-throttling but will help you distinguish well-known bots from the rest.

    1. Create a list of good bot user-agents in string type LTM data-group
    2. Run a check against client's User-Agent in HTTP_REQUEST event. If it's one of the good bots (matches with any value in your LTM data-group), set a variable that you can refer to later on in ASM-related iRule events (i.e. set goodBot 1).
    3. Catch the occurrence of that bot violation with a simple IF condition (Must enable iRule events in ASM policy settings)
    4. Check to make sure only 1 violation is triggered (count violations) to make sure people couldn't exploit this exception and bypass your ASM WAF by simply presenting something like User-Agent: GoogleBot as a request header.
    5. When only that specific violation is triggered, run a check against variable you set in step 2. If there's a match and you know the bot violation occurred for a good bot, then disable ASM blocking with ASM::unblock command. If there's no match, do nothing
    6.  

    If you have your own bot/crawler, just make up a User-Agent header if it doesn't already have one. Pardon for not providing you code, I lack ability to test at this hour and don't want to provide something completely untested.

    Hope this will get you started!

    • nathan_hoffman_'s avatar
      nathan_hoffman_
      Icon for Nimbostratus rankNimbostratus

      Hi Hannes,

       

      I really appreciate your reply. I like the idea of checking the user-agent and checking for just 1 violation to prevent workarounds. I'll have to check the logs and see what user-agent our bots are using.

       

      The main reason I was thinking about doing it this way is because our apps that are getting scraped are behind an auth wall, so bots have to present creds to get in, so we aren't getting Google, Bing, etc. The good bots are used by companies that are acting as agents or intermediaries to get data on behalf of groups of real users, and we are required to allow them or the real users who hire these companies can claim we deny access to their own data. The problem is when the bots scrape too fast, thus the daytime and nighttime rate ceilings. We have rate requirements that the bot handlers are aware of but don't always respect, so we need a technical control.

       

      I'll check the user-agents. Assuming the bots DON'T use unique user-agents, does it look like my idea will work? Any obvious alternatives? I'm in a bit of a time crunch and also don't have full admin rights in the ASM (we use a delegated admin model), so trial and error will take some time that I don't have much of.

       

      One other question: will this iRule fire before the ASM processes any security policy, so I can choose whether one of the 2 custom policies gets applied and have the default be used otherwise? Do I need to make any config changes to achieve this? I read about setting the "Trigger ASM iRule event setting" in a security policy, but I want this iRule to run before the default security policy if possible. I am still very new to ASM so I don't know if this is feasible, or what is the accepted way of selecting which security policy to use.

       

      I really appreciate your help! Please feel free to recommend reading material in lieu of an answer if that seems appropriate.

       

      Nathan