Increase Security in AWS without Rearchitecting your Applications - Part 4: Thursday

Welcome back! On Tuesday we discussed the security concerns we are facing, different architecture patterns and why using F5 SSL Orchestrator with AWS Gateway Load Balancer can solve a multi-dimensional security problem. On Wednesday morning we reviewed the configuration items to enable SSLO to address the different security needs followed by an afternoon investigation into how the AWS objects are structure to enable the solution Today we will look at scalability, resiliency, common troubleshooting scenarios and a brief discussion on how to deploy BIG-IP.

Reviewing the End to End Solution

When we look at the end to end solution we note a complex pattern comprised of Application (protected) VPCs, a Security VPC, endpoints and objects. Yesterday we went through this architecture in depth.

Scaling The System

Instance Types

In AWS you can find an array of instance types that may be compute or memory optimized, with different generations of hypervisors. An example of this would be M(emory)5.2xl or a C(ompute)6.4xl demonstrating a 5th generation and a 6th generation options with different numbers of CPU cores. My general recommendation is lean towards the M instances and nothing older than the 5th generation. The other aspect of the instances is how they behave on the network. If you review the AWS documentation you will see that smaller instances are labeled with up-to bandwidth numbers and larger CPU counts have dedicated bandwidth numbers. This is extremely important in the SSLO use case as these bandwidth numbers reflect the amount of bandwidth an instance can send in bursts and at sustained levels. If we use a simple example of a server that is responding with a 1 GB/S stream (I know not common, but the math is easy) and the responses are being inspected by SSLO, and we have two security chains this means we are sending 3 GB/S of traffic "off" the instance.

out-to-sec-1	1 Gb/S
out-to-sec-2	1 Gb/S
out-to-client	1 Gb/S
Total	3 Gb/S

The above example shows how easy it is to have to send a lot of traffic. So let's take a look at how an instance with an up-to designation behaves vs. a dedicated network allowance. In the example below IPERF was used to send traffic until the rate limiter was applied. (RUN 1) at which point the traffic was stopped and immediately restarted (RUN 2) to see how long it would take to impact traffic again. While the graphs are the same size the number on the bottom is seconds (the rate limiter asserts itself quickly on the second run)

If we compare that to an M6i.8XL you will see that the rate limiter is not applied since it has dedicated network access.

For our use-case it is encouraged to use instances that have dedicated network access.

Security Functions

Within the security VPC we have one or more SSL Orchestrators and their associated security appliances. I like to think of them is a disaggregated system with different considerations at the aggregate and component level. I think of these as "blocks" in my security system. We need to approach scaling the system by looking at the entire security processing of a single SSL Orchestrator Instance and the security devices associated with that deployment.

SSL Orchestrator Scale Block

With SSLO being the first and last hop of the security processing, these would be the metrics of interest. We have deployed into the public cloud and pushed all the work to the instance CPU (noting FPGAs are no longer available to handle complex SSL operations), including any other L7 processing that needs to happen. Eventually, you may need to add additional parallel blocks (SSLO + security services) to meet your traffic requirements.

Resource	Description of Usage	Scale Option
Instance CPU	Processing SSL, network traffic, protocol inspection, access policy, monitoring	Scale up to 16 vCPU, add parallel deployment block
Instance Bandwidth	AWS instances have network characteristics based on type. All egress bandwidth counts	Scale to a larger network instance, add parallel deployment block

Service Chain Instances Scale Block

Instances in the Security Chain have similar limits as the SSL Orchestrator Nodes but the scaling options differ. You can add additional security service instances as long as the SSLO deployment in the block has additional capacity or you can redeploy the security service on a larger instance prior to deploying additional scale block.

Resource	Description of Usage	Scale Option 1	Scale Option 2
Instance CPU	Processing SSL, network traffic, protocol inspection, access policy, monitoring	Scale up to 16 vCPU, add parallel deployment block	If SSL Orchestrator Block is less than capacity add inspection node
Instance Bandwidth	AWS instances have network characteristics based on type. All egress bandwidth counts	Scale to a larger network instance, add parallel deployment block	If SSL Orchestrator Block is less than capacity add inspection node

Understanding Resiliency

In the hardware world we think of active standby and high availability via failover. In the cloud world we use the term resiliency, and it is accomplished by deploying N number of active instances horizontally. Inside the processing chain we can think of this as two layers; resiliency at the SSL Orchestrator Layer and resiliency at the security device layer.

Resiliency Item	Single AZ considerations	Multiple AZ Considerations
SSL Orchestrator	Instance Protection, Delete Protection, N Number of Scale blocks per AZ	Repeat Single AZ pattern across 2 or more AZs
Inspection Device	Instance Protection, Delete Protection, N Number of Scale blocks per AZ	Repeat Single AZ pattern across 2 or more AZs

If borrow the same diagram from the scale blocks we can demonstrate how horizontal scale works across three AWS AZs. In this scenario we will have the horizontal capacity during normal operations and should an SSL Orchestrator instance need to be rebooted traffic will simply be sent to the security processing chain of one of the other scale blocks.

All Active Deployment

When we deploy this solution each block is isolated. GWLB will allow us to distributed traffic acrossN number of SSLO systems providing horizontal scale inter and intra AZ. Should AWS suffer a full AZ failure traffic will be sent over the other deployments and such time that traffic is restored. Should you need to take an SSLO instance offline, again traffic will be distributed. Each deployment in the resiliency pattern is an active participant for processing traffic.

Troubleshooting Packets in the Security VPC

If you need to troubleshoot the packets in the security VPC you need to understand the end to end flow thought the security VPC. Let's look at a sample flow from the internet to a protected endpoint. The same pattern could be applied between two internal endpoints. We will use the following IP addresses to build the examples. Internet client 1.1.1.100 and protected IP address of 192.168.1.1. We will assume that all traffic is permitted.

Interface Name	Interface Function
geneve-tunnel	VLAN associated with inner tunnel traffic
out-to-sec-chain	VLAN associated with leaving SSL Orchestrator to 1 or more security devices
in-frm-sec-chain	VLAN associated with leaving a 1 or more security devices and entering SSL Orchestrator

Example Capture Commands

Captured Flow	Interface	Example command	Notes
1.1.1.100 to 192.168.1.1 (ingress)	geneve-tunnel	tcpdump -ni geneve-tunnel src 1.1.1.100 dst 192.168.1.1	This will capture both in / out flow of the ingress processing
1.1.1.100 to 192.168.1.1 (ingress)	out-to-sec-chain	tcpdump -ni geneve-tunnel src 1.1.1.100 dst 192.168.1.1	This will capture the out flow from SSL Orchestrator of the ingress processing
1.1.1.100 to 192.168.1.1 (ingress)	in-from-sec-chain	tcpdump -ni geneve-tunnel src 1.1.1.100 dst 192.168.1.1	This will capture the in flow from the inspection device to SSL Orchestrator of the ingress processing

Common Issues

Issue	Validate
GWLB is not sending SSL Orchestrator any packets	Did you open the SG for the SSL Orchestrator VLAN that hosts the tunnel to UDP 6081? Did you use the correct SSL Orchestrator interface in the GWLB Service?
SSL Orchestrator/security device is sending packets but my security device/SSL Orchestrator does not receive them	On the security device - Did you disable SRC/DST check? Did you permit 0.0.0.0/0 in your security group?
SSL Orchestrator is not sending packets back down the tunnel	Did you create the ARP entry? Is the pool marked down? Did you disable strictness in the security policy?
TCP Handshakes complete but my files do not transfer	Did you create a virtual server for NON-TCP and NON-UDP traffic? This is required for PMTUD Did you set the SSL Orchestrator interface MTU to 8500?
I do not see the packets arriving on by application instance	Ensure that your SG allows the source to send traffic to the instance. Validate your routing configurations
The TCP handshake never completes, I see the SYN on may app instance	Egress routing issue.

Deployment Options

As you are aware, F5 offers many different options to deploy BIG-IP into AWS. We have example CloudFormation templates and terraform modules, or you can deploy it manually. As you look at building your initial environment I recommend starting manually. Prior to automating the deployment you need to ensure your team has a solid understanding of F5, AWS, and other security appliance data plane objects in the solution and how you want them applied to your environment. Deploying an F5 AMI in AWS requires that you subscribe to a BYOL based instance (at the time of this writing we do not offer an hourly SSL Orchestrator enabled system) or upload your self created image built with the F5 Image Generator.

General VPC Architecture

From my testing and experience building complex network usecases in AWS I have assembled a list of recommendations.

Dedicate service subnets (and route tables) for objects like GWLB endpoints, TGW endpoints, S3 Endpoints, EC2 API endpoints.
To ensure proper East / West traffic flow insertion you should create a dedicated subnet in each AZ and all VPCs for the GWLB Endpoints. My testing demonstrated that placing the GWLB Endpoints in dedicated subnets is best; when trying to send traffic form an instance on subnet A to a GWLB endpoint on subnet A to an instance on subnet B resulted in traffic not being forwarded correctly.
AWS defaults to /20 ranges when it creates subnets. This address scope is large if East / West inspection is of concern and should be decreased as appropriate.
Production VPCs should not place management interfaces in a public route table.
When using GWLB to inspect traffic between VPCs think about the flow. For example we want to inspect traffic from VPC A to VPC B. If we insert into the network at both VPCs traffic could be inspected twice. It may make more sense to inspect at VPC A only.
The GWLB Endpoints are AZ level constructs. It helps to give them a human friendly name to ensure you are using the correct one.
All of your load balancers (ALB, NLB, ELB, GWLB) should be enabled for cross zone load balancing.
If using Transit Gateway in your topology please set it to Appliance Mode.

Conclusion

Thank you for spending time with me over the last couple of days. Many customers are currently iterating on how to further inspect and secure traffic in complex multi-cloud environments and F5 has solutions that can be leveraged across environments and in unique ways. Please let us know how we can assist you and your organization on their application security journey.

Published Apr 04, 2023

Version 1.0

application delivery

cloud

security