Scaling Stateful Network Devices

One of the premises of #SDN and #cloud scalability is that it's easy to simply replicate services - whether they be application or network focused - and distribute traffic across them to scale infinitely.

In theory, this is absolutely the case. In theory, one can continue to add capacity to any layer of the data center and simply distribute requests across the layer to scale out as necessary.

Where reality puts a big old roadblock in the way is when services are stateful. This is the case with many applications - much to the chagrin of cloud and REST purists, by the way - and it is also true with a significant number of network devices. Unfortunately, it is often these devices that proponents of network virtualization target without offering a clear path to addressing the challenges inherent in scaling stateful network devices.

SDN's claims to supporting load balancing, at least at layer 4, are almost certainly based on traditional, dumb layer 4 load balancing. We use the term "dumb" to simply mean that it doesn't care about the payload or the application or anything else other than its destination port and service and does not participate in the flow. In most layer 4 load balancing scenarios for which this is the case, the only time the load balancer examines the traffic is when processing a new connection. The load balancer may buffer enough packets to determine some basic networking details - source and destination IP and TCP ports - and then it establishes a connection between the client and the server. From this point on, generally speaking, the load balancer assumes the role of a simple forwarder. Subsequent packets with the same pattern are simply forwarded on to the destination.

If you think about it, this is so close to the behavior described by an SDN-enabled network as to be virtually the same. In an SDN-enabled network, a new flow (session if you will, in the load balancing vernacular) would be directed to the SDN controller for processing. The SDN controller would determine its destination and inform the appropriate network components of that decision. Subsequent packets with the same pattern would be forwarded on to the destination according to the information in the FIB (Forwarding Information Base). As the load balancing service was scaled out, inevitably packets would be distributed to components lacking an entry in the FIB. Said components would query the controller, which would simply return the appropriate entry to the device.

In such a way, simple layer 4 load balancing can be achieved via SDN*.

However, the behavior of the layer 4 load balancing service described is stateless. It does not actively manage the flow. Aside from the initial inspection and routing decision, the load balancing service is actually just a bump in the wire, forwarding packets much in the same manner as any other switch in the network.

But what happens when the load balancing service is actively participating in the flow, i.e. it is stateful.

Scaling Stateful Devices

Stateful devices are those that actively manage a flow. That is, they may inspect, manipulate, or otherwise interact with flows in real-time. These devices are often used for security - both ingress and egress - as well as acceleration and optimization of application exchanges. They are also use for content transformation purposes, such as XML or SOA gateways, API management, and other application-focused scenarios. The most common use of stateful devices is persistent load balancing, aka sticky sessions, aka server affinity. Persistent load balancing requires the load balancing service (or device) maintain a mapping of user to application instance (or server, in traditional, non-virtualized environments). This mapping is unique to the device, and without it a wide variety of applications break when scaled - VDI being the most recent example of an application relying on persistence of sessions .

In all these cases, however, one thing is true: the device providing the service is an active participant. The device maintains service-specific information regarding a variety of variables including the user, the device, the traffic, the application, the data. The entire context of the session is often maintained by one or more devices along the traffic chain.

What that means is that, like stateful, shared-nothing applications, it matters to which device a specific request is directed. While certainly the same model used at layer 4 and below in which a central controller (or really bank of controllers) maintains this information and doles it on on-demand, the result is that depending on the distribution algorithm used, every stateful device would end up with the same flows installed. In the interim, the network is frantically applying optimization and acceleration policies to traffic that may be offset by the latency introduced by the need to query the controller for session state information, resulting in a net loss of performance experienced by the end-user.

And we're not even considering the impact of secured traffic on such a model, where any device needing to make decisions on such traffic must have access to the certificates and keys used to encrypt the traffic in order to decrypt, examine, and usually re-encrypt the traffic. Stateful network devices - application delivery controllers, intrusion prevention and detection systems, secure gateways, etc... - are often required to manage secured content, which means distributing and managing certificates and keys across what may be an ever-expanding set of network devices.

The reality is that stateful network devices are a necessary and integral component of not just networks but applications today. While modern network architectures like SDN bring much needed improvements to provisioning and management of large scale networks, their scaling models are based on the premise of stateless, relatively simple devices not actively participating in flows. For those devices that rely upon deep participation in the flow, this model introduces a variety of challenges that may not find a solution that fits well with SDN without compromising on performance outside new protocols capable of carrying that state persistently throughout the lifetime of a session.

* This does not address the issue of resources required to maintain said forwarding tables in a given device, which given current capacity of commoditized switches supported for such a role seems unlikely to be realistically achieved.

Published Jan 28, 2013

Version 1.0