Forum Discussion

Stewart's avatar
Stewart
Icon for Altostratus rankAltostratus
Mar 04, 2015

ASM to block indexing of sites.

Hi,

 

I need to set up ASM to block any indexing of my web site. I think that I can do this by enabling Anomaly Detection for Web Scraping. If this is done it will stop most but will still allow Google, Ask, Bing and Yahoo.

 

Would deleting Google, Ask, Bing and Yahoo from the ASM's Search Engines section block indexing from these sites also? Is there anything else that could be impacted by deleting them from the Search Engines section?

 

And finally, is there a better way of doing this?

 

Thanks Stewart.

 

5 Replies

  • Thanks John, Your articles are a great help in picking my way through how to set up ASM.

     

    Do you know if deleting them will impact any areas other than Web Scraping?

     

  • Before raising another new thread for the same topic, I hope someone will still read and answer here.

     

    We have a requirement that an application should be accessible from the Internet, but should not be listed in all these search engines. Deleting the pre-defined search engines is a global settings, but I want to do this only for specific policies. How can I achive this, means which features/functions do I need to enable/configure (fyi: IPI is not licensed)? I guess Bot detection needs to be enabled anyway. Can I use the default values/thresholds?

     

    Thank you!

     

    Ciao Stefan :)

     

  • While you can configure Web Scraping protection, I don't think that it's the best solution to prevent search-engine indexing. I'm in favor of relying on /robots.txt standard.

    If Google does not receive response to /robots.txt request (request times out), all indexing of your web-site will be discarded, as if the web-site did not exist at all. If they receive HTTP 404 or any other normal response, then your website is subject to indexing. There are firms that will not wait a day to file a lawsuit against Google if they find anything indexed that they don't want to be indexed, and for that reason Google proposed this as a compromise. Most major search engines today behave in the same way.

    For the simplest solution, just make sure requests to /robots.txt will time out, and you're done.

    1. Many viable solutions here. Personally, I use a LTM policy: (default-rule: Enable ASM ; conditional-rule for /robots.txt: Drop; Policy Strategy: Best-match). A simple iRule that drops requests to /robots.txt will work too.

    2. Arguably the cleanest solution would be a valid response to /robots.txt request which responds with HTTP 200 response, and specifies the following in payload:

      User-agent: *

      Disallow: /

      (search engines will respect the statement and understand that no pages are subject to be indexed).

      This file named as 'robots.txt' can be hosted in the end-server, placed in WWW root directory (/) . It can also be hosted in BigIP as iFile, or raw code in iRule. Regardless of your choice, to respond to /robots.txt request from BigIP, you will have to use 'HTTP::respond 200 content' function (Example of HTML-payload-response iRule here: https://devcentral.f5.com/questions/irule-response-with-static-html-message-when-pool-members-are-down)

    Regards,

  • Hi Hannes,

     

    thank you for the quick response. But will the 200 OK blocking page from ASM result in the "correct" behavior for the search engines, because this will not be a timeout. Or should I not directly handle this outside ASM and just create a small iRule to drop any request towards /robots.txt?

     

    Ciao Stefan :)