Forum Discussion

schusb's avatar
schusb
Icon for Nimbostratus rankNimbostratus
Dec 14, 2018

Is blocking all HTTP-HEAD requesta a bad idea?

We think about blocking all HTTP Head requests for our Web-Applications (not REST or SOAP) via ASM, by returning a html response page with HTTP-code 200 OK, because most of them are requests from crawlers.

 

Are there experiences concerning client behavior? Since HTTP-200 is returned, the client thinks that the request ist valid, even if the site doesn't exists. For Office-Doks, which constain invalid web links the user doesn't get a info popup which tells him that the ressource doesn't exists, instead the web-client is opened which then sends a HTTP-GET to a non existing ressource. For me it doesn't sound like a major drawback. Are there any other pitfalls known?

 

4 Replies

  • kinda wondering why? you don't exactly want to block, but you want to let them think the request is handled correctly.

     

  • schusb's avatar
    schusb
    Icon for Nimbostratus rankNimbostratus

    Isn't this the normal behavior? If any client-request is blocked by ASM, the error-page is returned in turn of a HTTP-200 response. Or should the status code be any of 4xx (ie. 403) for blocked requests in general?

     

  • no you are right, normally a HTTP 200 with the message is returned.

     

    so what are you looking to do and why?

     

  • It is a bad idea.

     

    In most cases you should not block HEAD requests. HEAD is crucial for determining the metadata such as the status and 'freshness' of the URL without using an expensive GET request which will actually retrieve the content.

     

    Imagine your website has a URL which is a PDF file containing annual report of your company to shareholders/investors. - this PDF file can be quite huge (let's say 10Mb is not unusual). Imagine a client has already downloaded that file and has it in its cache. If the same client attempts to access the same URL again the HEAD request will only return the metadata (e.g. file size, and last modified date). Since the file has not been modified the client will not be issuing a GET request to pull a massive 10Mb file again.

     

    If you block HEAD requests the client will have to issue a GET request to download the resources from your website to check their size and freshness thus costing your website bandwidth and slowing down access for everyone else.

     

    There is a reason why crawlers (e.g. Google Search Engine) are issuing HEAD requests instead of GET - they are just determining the status of the content without downloading it. If you interfere with this by returning the same metadata for all URLS (that is my understanding of what you are proposing) you will screw up all the search results in Google and other search engines, all caches in corporate proxy servers who might be accessing your website...