ASM to block indexing of sites.

Question

Hi, &nbsp;
I need to set up ASM to block any indexing of my web site.  I think that I can do this by enabling Anomaly Detection for Web Scraping.  If this is done it will stop most but will still allow Google, Ask, Bing and Yahoo.  &nbsp;
Would deleting Google, Ask, Bing and Yahoo from the ASM's Search Engines section block indexing from these sites also?  Is there anything else that could be impacted by deleting them from the Search Engines section?&nbsp;
And finally, is there a better way of doing this?&nbsp;
Thanks
Stewart.&nbsp;

ltwagnon · Answer

Here's an article I wrote that discusses web scraping and how to configure ASM to block against it:  https://devcentral.f5.com/articles/these-are-not-the-scrapes-youre-looking-for-session-anomalies.&nbsp;
To my knowledge, deleting Google, Ask, Bing, etc will block indexing from these sites (although I haven't tested that specifically).  That's why most people don't delete these search engines in their ASM configuration.&nbsp;
I hope this helps!
John&nbsp;

stewart · Answer

Thanks John, Your articles are a great help in picking my way through how to set up ASM.&nbsp;
Do you know if deleting them will impact any areas other than Web Scraping? &nbsp;

stefan_klotz · Answer

Before raising another new thread for the same topic, I hope someone will still read and answer here.&nbsp;
We have a requirement that an application should be accessible from the Internet, but should not be listed in all these search engines. Deleting the pre-defined search engines is a global settings, but I want to do this only for specific policies. How can I achive this, means which features/functions do I need to enable/configure (fyi: IPI is not licensed)? I guess Bot detection needs to be enabled anyway. Can I use the default values/thresholds?&nbsp;
Thank you!&nbsp;
Ciao Stefan :)&nbsp;

hannes_rapp · Answer

While you can configure Web Scraping protection, I don't think that it's the best solution to prevent search-engine indexing. I'm in favor of relying on /robots.txt standard.
If Google does not receive response to /robots.txt request (request times out), all indexing of your web-site will be discarded, as if the web-site did not exist at all. If they receive HTTP 404 or any other normal response, then your website is subject to indexing. There are firms that will not wait a day to file a lawsuit against Google if they find anything indexed that they don't want to be indexed, and for that reason Google proposed this as a compromise. Most major search engines today behave in the same way.
For the simplest solution, just make sure requests to /robots.txt will time out, and you're done.

Many viable solutions here. Personally, I use a LTM policy: (default-rule: Enable ASM ; conditional-rule for /robots.txt: Drop; Policy Strategy: Best-match). A simple iRule that drops requests to /robots.txt will work too.

Arguably the cleanest solution would be a valid response to /robots.txt request which responds with HTTP 200 response, and specifies the following in payload:
User-agent: *
Disallow: /
(search engines will respect the statement and understand that no pages are subject to be indexed).
This file named as 'robots.txt' can be hosted in the end-server, placed in WWW root directory (/) . It can also be hosted in BigIP as iFile, or raw code in iRule. Regardless of your choice, to respond to /robots.txt request from BigIP, you will have to use 'HTTP::respond 200 content' function (Example of HTML-payload-response iRule here: https://devcentral.f5.com/questions/irule-response-with-static-html-message-when-pool-members-are-down)

Regards,

stefan_klotz · Answer

Hi Hannes,&nbsp;
thank you for the quick response. But will the 200 OK blocking page from ASM result in the "correct" behavior for the search engines, because this will not be a timeout. Or should I not directly handle this outside ASM and just create a small iRule to drop any request towards /robots.txt?&nbsp;
Ciao Stefan :)&nbsp;

Forum Discussion

ASM to block indexing of sites.

5 Replies

Recent Discussions

preserve client IP on layer 4 VIP

Failed to convert character dclid=%EDclid!

ASM - Parent policy vs OWASPcompliance

Rewrite uri translation not working

CONNECTION IS LOST DUE TO SELF IP IS DELIVERED

Related Content

F5 Distributed Cloud - Customer Edge Site - Deployment & Routing Options

Redirecting to an external site when the internal site is down

F5 Distributed Cloud - Regional Decryption with Virtual Sites

How to block web site technologies information with ASM/Advance WAF

When using F5 Distributed Cloud Platform, never deal with Site to Site IP conflicts again!