Forum Discussion

perfmon_109693's avatar
perfmon_109693
Icon for Nimbostratus rankNimbostratus
Oct 14, 2008

Datacenter Failover & Browser DNS Caching

Working across multiple geographic sites presents many challenges. Our F5 productions have handled the challenge wonderfully and are able balance traffic across them and if we take a VIP down traffic goes to the other sites. The hosted application behind this infrastructure is browser based and spread across 4 data centers. The users all leverage IE and typically are in the application all day. That is once they login they don't log out till their shift is over. Sometimes the application experiences issues in one datacenter and we 'mark' the site down. Our 3DNS no longer answers DNS queries with the down data center, however for that site all the users experience problems because they're stuck with that 'down' site's VIP. IE will not re-resolve the hostname again for us. So our users are forced to close their browser and re-enter the applicaction.

 

 

I was hoping to get some people to post their experiences with this matter and how they've dealt with it. So to get things started here are 3 highlevel options:

 

 

1. Application layer: add a layer of site awareness

 

2. Infrastructure layer: utilize IP failover via Route Health Injection(I've seen here on DEV Central some dialog addressing this facet specifically by using ARM and BGP RHI.)

 

3. Blend Of App & Infrastructure (1 & 2)

 

 

What other alternate methodologies have you've applied? What have you seen as the pro's and con's with them?

2 Replies

  • The default IE ttl is 30 minutes (http://support.microsoft.com/default.aspx?scid=KB;en-us;263558& Click here) so there shouldn't be a valid reason for IE not resolving a new name unless your system cache is holding onto the entries longer. What is the TTL of the wideIP that 3DNS is handing out?
  • Thank you. I'm familiar with the IE DNS cache tweak and I'm planning to employ it as a short term small win to alleviate some of the 'pain'. By reducing the timeout from 30 mintues to a minute this will decrease the probability for enduser impact. However, there is still a significant enough probability that a group of users will not be helped.