Forum Discussion

OracleGuru_6934's avatar
OracleGuru_6934
Icon for Nimbostratus rankNimbostratus
Feb 23, 2009

F5 Big IP for App Failover between Oracle Data Guarded Clusters

I've got a situation where the loss of a RAC Cluster requires the restart of all application servers to repoint them to the Oracle Data Guard Failover Cluster. In this case there are hundreds of application servers, all with connection pools. The time to failover the Oracle Database 10gR2 RAC Cluster to the standby is in minutes before it is fully available, and could be quicker. The application servers don't start returning to online status for at least 1 1/2 hours to as long as three hours as they must be completely shutdown and manually restarted (I know). My objective is to have an architecture where the connection pools are simply rerouted to the Data Guard Failover Cluster which is available within minutes, and if fast start failover is used, virtually immediately. F5 Big IP equipment is used in the DC and I'm looking for architeture/configuration suggestions on how best to approach this to make the application failover seamless between a failed primary cluster, and a Data Guard standby cluster which assumes the role of the primary cluster on failover. This is a new architecture we're moving towards, so anything is basically on the table.

 

 

TIA

 

 

Bill

9 Replies

  • Hi Bill,

     

     

    I thought RAC should provide seemless failover between cluster nodes. Are you trying to handle the scenario where the complete cluster goes down?

     

     

    It would be good if someone here with more experience with Oracle commented as well, but in concept, I think you could configure a single pool with two groups of members. You could use priority group activation to ensure application requests go to the higher priority group first. If that group isn't available the lower priority members would be used. You could configure an Oracle DB monitor to check if the pool members are up.

     

     

    If you want LTM to send a reset to the client if the selected pool member goes down, you can configure this using the 'Action On Service Down' down option in the pool properties.

     

     

    Aaron
  • Hi Aaron,

     

     

    You're right the failover of nodes in a RAC 10g and later cluster is seamless. The ability of the app to seamlessly failover from node to node depending on how it's connectivity is configured may be another matter, but my question is purely focused on cluster to cluster failover. Consider two RAC clusters local to each other, one the primary, the other a local standby, i.e. same data center, and a 3rd cluster in a remote data center. Oracle's Data Guard will handle the failover of the Oracle RAC Cluster to either the local standby or the remote standby. The preferred failover target is the local standby, backed up by the remote standby. My challenge here is how to seamlessly repoint the 100+ application servers from the primary cluster to the local standby and back again without having to do any type of a restart of the app servers, or bounce of the connection pools from the app servers. I've used the F5 BIG-IP and Cisco Catalyst for load balancing of app server connections to my cluster nodes when a thin java client was in place, but this is different in that their current application as it exists today, can't manage the persistence or caching of transactional data. As a result, in the event of any complete cluster failure, they are shutting down every app server and manually repointing them after the standby database cluster is up and running. We could use FAN and FCF and an application API to dynamically manage the failover, but their app won't support it. Hence, I'm looking outside the box at alternatives to enable a "rapid application failover" that can execute in minutes without dropping the connection pools allowing the app servers to "simply" failover to the standby database cluster as the primary database cluster fails over the same standby database cluster.

     

     

    It's not relevant in a remote DR scenario as a complete duplicate set of app servers exist which will have to be brought up anyway in the event of a primary "site" failure.

     

     

    HTH clarify what I'm trying to architect.

     

     

    Again, thanks to anyone who can shed some insight on how to configure the F5's to support this.

     

     

    Bill
  • Hi Bill,

     

     

    Thanks for the explanation. I'm not an Oracle expert by any means, so that was useful.

     

     

    If the app client was configured only to open connections to the LTM VIP, LTM could select a new cluster if the primary first cluster went down. This would provide near-seemless resilience at the TCP layer. But do the other clusters have real time mirroring of the first cluster's data? If so, it seems like the failover between clusters could be seemless at the TCP and app layer.

     

     

    If the data isn't mirrored between clusters, what would the app client need to get in response from the server end (either LTM or the database) to tell it to restart its session because the existing cluster died? Would a TCP reset suffice? Or a SQL level message? Or something entirely different?

     

     

    Aaron
  • Mike_Schrock_61's avatar
    Mike_Schrock_61
    Historic F5 Account
    Hi Bill,

     

     

    We would like to work directly with you on this out of F5's Oracle Product Mangement Engineering and Solution Engineering teams. We need to fully understand your needs as we have seen similar requests and do not have any public solution yet.

     

     

    Will you please email Randy Cleveland r.cleveland@f5.com our Director of Solution Engineering and myself m.schrock@f5.com.

     

     

    Thanks,

     

    Mike Schrock

     

    F5 Oracle Alliance and Solution Engineering Manageer

     

     

     

     

  • Mike_Schrock_61's avatar
    Mike_Schrock_61
    Historic F5 Account
    Absolutely, that is intent of reaching out to work more direct to you. We will share all solutions with others and in this forum.
  • Hi Mike, I will do that when I get back to Dallas later this evening or tomorrow (depending on how late American is this week) from my client site.

     

     

    Bill
  • Aaron,

     

     

    I'll try and put it into an architecture diagram and solution once we have it figured out and make it available.

     

     

    Bill
  • Our application using ODP.Net and connection pooling. If Oracle FAN is not configured, will a system with F5 work as a failover tool?

     

     

    Thanks,

     

    Yibin
  • Chris_Akker_129's avatar
    Chris_Akker_129
    Historic F5 Account
    Yes, in reading the ODP.net specs, the connection pool would be made to a BigIP virtual server, and you can control the timeout values of those pooled connections using a TCP profile on the virtual server. So for example, you want the connections held open for an hour, you would create a TCP profile with a timeout value of 3600 seconds, and apply that to the TCP client side of the virtual server. Now for the database servers, which are in a pool, there are several options for when a pool member goes down. This setting is called "Action on Service Down", and the default is to do nothing, so you would want to change this to "reject", so the bigip will reset all client side connections immediately. This is done in the UI under Pool Advanced properties in Version10. Then these connections, that were connected to that server, would be given instant notice that the server is unavailable, so they will retry their tcp connection. The Big-IP will automatically select another database server, giving you the quick failover/re-connect you are looking for. You could also try the ReSelect option - depending on your application, this may work. More info:

     

     

    Action on Service Down

     

    Specifies how the system should respond when the target pool member becomes unavailable. The default is None.

     

     

    •None: Specifies that the system does not select a different node. Selecting None causes the system to send traffic to the node even if it is down, until the next health check is done.

     

     

    •Reject: Specifies that the system sends an RST or ICMP message.

     

     

    •Drop: Specifies that the system simply cleans up the connection.

     

     

    •Reselect: Specifies that the system selects a different node. Selecting Reselect causes the system to send traffic to a different node after receiving the message that the original node is down.