Thus far in our article series about running BIG-IP in EC2, we’ve talked about some VPC/EC2 routing and network concepts, and we walked through the basics of running and licensing BIG-IP in this environment.  It’s time now to discuss some more advanced topologies that will provide highly redundant and highly available network services for your applications.

As we touched upon briefly in our last article, failover between BIG-IP devices has typically relied upon L2 networking protocols to reach sub-second failover times. We’ve also hinted over this series of articles as to how your applications might need to change as they move to AWS.  We recognize that while some applications will see the benefit of a rewrite, and will perhaps place fewer requirements on the network for failover, other applications will continue to require stateful mechanisms from the network in order to be highly available.

Below we will walk through 3 different topologies with BIG-IP that may make sense for your particular needs.  We leave a 4th, auto-scale of BIG-IP released in version 12.0, for a future article.  Each of the topologies we list has drawbacks and benefits, which may make them more or less useful given your tenancy models, SLAs, and orchestration capabilities.

 

Availability Zones

We've mentioned them before, but when discussing application availability in AWS, it would be negligent to skip over the concept of Availability Zones. At a high-level, these are co-located, but physically isolated datacenters (separate power/networking/etc) in which EC2 instances are provisioned. For a more detailed/accurate description, see the official AWS docs:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html

 

Because availability zones are geographically close in proximity, the latency between them is very low (2~3 ms).  Because of this, they can be treated as one logical data center (latency is low enough for DB tier communication).  AWS recommends deploying services across at least two AZs for high availability.  To distribute services across geographical areas, you can of course leverage AWS Regions with all the caveats that geographically dispersed datacenters present on the application or database tiers.

 

Let's get down to it, and examine our first model for deploying BIG-IP in a highly available fashion in AWS.  Our first approach will be very simple: deploy BIG-IP within a single zone in a clustered model.  This maps easily to the traditional network environment approach using Device Service Clusters (DSC) we are used seeing with BIG-IP.

 

Note: in the following diagrams we have provided detailed IP and subnet annotations.  These are provided for clarity and completeness, but are by no means the only way you may set up your network.  In many cases, we recommend dynamically assigning IP addresses via automation, rather than fixing IP address to specific values (this is what the cloud is all about).  We will typically use IP addresses in range 10.0.0.0/255.255.244.0 for the first subnet, 10.0.1.0/255.255.244.0 in the second subnet, and so on. 100.x.x.x/255.0.0.0 denote publicly routable IPs (either Elastic IPs or Public IPs in AWS).  

 

 

Option 1:  HA Cluster in a single AZ

 

 

Benefits:
  • Traditional HA. If a BIG-IP fails, service is "preserved".
Tradeoffs:
  • No HA across Datacenters/AZs. Like single DC deployment, if the AZ in which your architecture is deployed goes down, the entire service goes down.
HA Summary:
  • Single device failure = heartbeat timeout (approx. 3 sec)  + API call (7-12 sec)
  • AZ failure = entire deployment
 
As mentioned, this approach provides the closest analogue to a traditional BIG-IP deployment in a datacenter.  Because we don’t see the benefits AWS availability zones in this deployment, this architecture might make most sense when your AWS deployment acts as a disaster recovery site.
A question when examining this architecture might be: “What if we put a cluster in each AZ?”
 

Option 2:  Clusters/HA pair in each AZ

 

Benefits:

  • Smallest service impact for either a device failure or an AZ failure.

  • Shared DB backend but still provides DC/AZ redundancy

  • Similar to multiple DC deployment, generally provides Active/Active capacity.

Tradeoffs:

  • Cost: both pairs are located in a single region. Pairs are traditionally reserved for "geo/region" availability

  • Extra dependency and cost of DNS/GSLB.

  • Management overhead of maintaining configurations and policies of two separate systems (although this problem might be easily handled via orchestration).  

HA Summary:

  • Single device failure = heartbeat timeout (approx. 3 sec)  + API call (7-12 sec) for 1/2 Traffic

  • AZ failure = DNS/GSLB timeout for 1/2 traffic

 

The above model provides a very high level of redundancy.  For this reason, it seems to make most sense when incorporated into shared-service or multi-tenant models.  The model also begs the question, can we continue to scale out across AZs, and can we do so for applications that do not require that the ADC manage state (e.g. no sticky sessions)?  This leads us to our next approach.

 

Option 3: Standalones in each AZ

 

Benefits:

  • Cost

  • Leverage availability zone concepts

  • Similar to multiple DC deployment, Active/Active generally adds capacity.

  • Easiest to scale

Tradeoffs:

  • Management overhead of maintaining configuration and policies across two or more separate systems; application state is not shared across systems within a geo/region.

  • Requires DNS/GSLB even though not necessarily "geo-region" HA.

  • Best suited for inbound traffic

    • For outbound use case: you have the distributed gateway issue (i.e. who will be the gateway, how will device/instance failure be handled, etc.)

    • SNAT required (return traffic needs to return to originating device).

    • For Internal LB model: DNS required to distribute traffic between each AZ VIP.

HA Summary:

  • Single device failure = DNS/GSLB timeout for 1/(N Devices) traffic..

  • AZ failure = DNS/GSLB timeout for 1/(N Devices) traffic

 

One of the common themes between options 2 and 3 is that orchestration is required to manage the configuration across devices.  In general, the problem is that the network objects (which are bound to layer 3 addresses) cannot be shared due to differing underlying subnets.

 

Summary:  

Above, a number of options for deploying BIG-IP in highly available or horizontally-scaled models were discussed.  The path you take will depend on your application needs.  For example, if you have an application that requires persistent connections, you'll want to leverage one of the architectures which leverage device clustering and an Active/Standby approach.  If persistence is managed within your application, you might aim to try one of the horizontally scalable models.
 
Some of the deployment models we discussed are better enabled by the use of configuration management tools to manage the configuration objects across multiple BIG-IPs.  In the next article we'll walk through how the lifecycle of BIG-IP and network services can be fully automated using open-source tools in AWS.  These examples will show the power of using the iControlSoap and iControlREST APIs to automate your network.