
Data centers have continued to drive the focus of innovators and practitioners of data center builds. With the every growing need for more data center capacity for cloud offerings or enterprise data centers, the extended need to support both massive scale and maintainability have never been more important. Operators continue to face the ever present need to build infrastructure that scales, however, the ability to configure, maintain, troubleshoot and resolve runtime problems is just as important as that need to build.
One recent development in some large scale data center networks, is the focus on building generic and scalable fabrics based on CLOS fabrics. This type of underlay network allows for scaling of the data centers using battle tested and simple to apply principles. A nice outcome of this mode is that it can also assist in providing vendor independence for operators in addition to the achieve massive scale. Another decision made by some of these large data center builders is forcing simplicity into the protocol layer of the network fabric as well. The decision there has led to the wide use of BGP, and in some cases, no underlying IGP at all. It is this second aspect of the large scale data centers that is of interest.
Why? Well, for years, I recall many architecture and design level discussions seeking to add more capability into the network, helping add and improve functionality. We had IGPs working, overlay technologies such as MPLS/VPN, needed QoS, and pushed a raft of improvements into the relative protocols to help round it all out, and iron out issues we faced by forcing additional complexity into the network.
As we hit the 2010s, a shift was well underway were we pushed much of the complexity into the overlay network. This left the underlay network relatively untethered, providing an opportunity to simplify the design. Even more recent improvements, such as pushing overlay technology purely into the software realm has also led to much faster progression of overlay capabilities which detached it from the more cumbersome and slow process of upgrading and updating network hardware.
In the most extreme examples, one can find network underlays that run nothing more than a single routing protocol – BGP. That can be assisted by a way to manage the device such as SSH and NETCONF and provide needed neighbour discovery using LLDP. In that model, no IGP is needed or even desired, with the trade-off granted to simplicity. This very basic model does lose out on some capabilities that you get with an IGP, however, in the world of cloud, many of those capabilities aren’t really needed for the underlay and the larger goal of simplicity and stability are much more important than finesse and speed.
In this extreme example, because BGP was not originally intended for this lone purpose of routing just internal traffic with no underlying traditional IGP, we are left with a few gaps in efficiency. This is where Link-State Vector Routing (LSVR) comes in. What is LSVR? Well it is BGP with some modifications to update how routes are calculated. Anyone using BGP in its modes of operation would be quickly familiar with LSVR. From a confirmation and behavior perspective it is BGP. In that world, we continue to support the simplicity of what we get with running as few protocols as possible on the underlay network, but also gain the efficiency of vectored routing on the underlay normally supplied by a full fledged IGP. The objective is not to gain the same level of protocol efficiency as an IGP, but maintain the much more lucrative and hard to achieve goal of operational simplicity with added efficiency.
The fewer things that need to be configured, updated and maintained, the better off an operational environment will be. There are fewer things to put into configuration models, less hurdles in integrating multiple vendors (only one routing protocol) and less protocols needed to be secured and troubleshot. Another benefit of this approach is that is uses a familiar protocol like BGP which we need to use anyway for exterior routing. Teams can onboard this new functionality with little new complexities to learn outside of the architects of the organization which would often need and want to understand the nuts and bolts of any protocol. However, the vast majority of the operational teams can continue to use the same simple model with under-the-hood improvements that allow them to keep things simple. And simple, for operators – is really the most important goal as it leads to lower costs and stability.
What is LSVR?
