Another Long Break
It’s been a bit over a year since I’ve last written an entry. The primary reason for that is just that I got very busy with my day job. Part of that has included experimenting with the new Nvidia Bluefield-2 SmartNICs. I’m hoping my employer lets me write some generic-enough technical blog posts about what I’ve been doing with those NICs. So stay tuned for that. But, long and short of it is: I’ve been focused on my role at work and not on publishing anything here. Let’s fix that.
This post will lean a bit more into the opinion side of things, with technical references help make my point. I’ve often felt that network customization should be pushed as far down the stack as it can be. Generally that means all the way to the server. I understand that makes some folks shift in their seats a bit; both network engineers and systems administrators. But bear with me while I dig into this a bit more.
The application of this document assumes a few things about your network:
- It’s larger than a couple of racks of servers. Think whole data centers
- You’ve already migrated your awful L2 spanned network to an L3 infrastructure with VXLAN to connect VLANs together. I’ve written tons on this topic.
- You’re using plenty of different VNIs to separate your traffic out.
- All of the configurations for the second and third bullet are done at the top of rack switch.
The fourth bullet in that list is what I’m referring to when I speak of network customization: per-port configuration on your top of rack (TOR) switches that define which VLANs or VRFs your servers are part of.
If we assume a generic Arista TOR, the configuration to put a port into a VLAN as well as enable VXLAN tunneling for that would include:
interface 21/1 switchport access vlan 100 ! interface Vxlan1 vxlan source-interface Loopback0 vxlan udp-port 4789 vxlan vlan 100 vni 1000100 !
BGP and EVPN
Of course if we’re deploying VXLAN at data center scale, we’d better be using EVPN as our control plane.
router bgp 65001 ! vlan 100 rd 10.1.2.3:100 route-target both 65001:100 redistribute learned !
Is L3 Any Better?
What if you’re running BGP between your server and the TOR, but you need to separate that traffic into different VRFs depending on what the server is doing? Is it any better? Not really. I’m not going to give out configuration examples, but it’s nearly identical:
- Define the VRF and apply it to the interface
- Define the VRF to VNI mapping in the VXLAN interface
- Define the VRF’s RD and RTs in the BGP section
While I’m a huge proponent of L3 to the server, and in fact the rest of this document is going to assume it, doing so with VRFs and VXLAN at the TOR is just as much customization as VLANs are.
Customizing The Network At The Host
"But they're server folks. They don't know anything about networking!!" -- typical Network Engineer
Does that one sound familiar at all? If you’ve worked in large organizations that have split their networking and systems engineers into different groups, then you’ve absolutely heard something like that in the past. You might have even been guilty of uttering those words. Before you put your defenses up, know that I’m not blaming you or anyone else for that. It’s an unfortunate byproduct of keeping networking and server folks in organizational silos. It continues to happen, in large and small companies alike, regardless of how much friction it causes between those groups.
I can tell you from my nearly thirty(!) years in the industry that: they might not know how to do networking on their server, but they’re willing to learn. So TEACH THEM.
BGP On The Host
The first step to making this customization succeed is getting BGP running on the server. There are tomes written about this, including by myself here, here, and here. Those examples are all FreeBSD based, but the same thing applies to Linux. The key thing here is to get a simple, unnumbered BGP peer (or peers) using IPv6 link local addresses and RFC5549 to forward IPv4 prefixes with IPv6 next-hops. FRR supports that just fine, as do most commercial TORs these days. No IPv4 addressing needs to be done other than the respective loopback interfaces.
Note: Redundant TORs
If your servers are connected to two TORs for redundancy and NOT running BGP with them, it means you’re doing redundant L2 links. In other words, you have some sort of MLAG configured between your TORs, along with an 802.3ad LAG with the server. All of this configuration completely evaporates when you run BGP on the server. No more need for the MLAG side-link between the TORs, and no more bundle configuration with the server. Think about that for a moment while you debate the merits of installing FRR on your server.
VXLAN On The Host
There are a couple of ways of configuring VXLAN on the host. You could configure individual VXLAN interfaces and bridges per VNI. Or you can combine everything into a single VXLAN interface and a single VLAN-aware bridge. The former’s configuration can get a bit out of control if you have a bunch of independent VNIs, but it’s the way I prefer to configure VXLAN on Linux. The latter is more akin to a typical TOR’s VXLAN configuration where there’s a single VXLAN interface and a VLAN-to-VNI mapping. Either way will work; it’s entirely up to you.
VRFs on Linux?
Yes. They work with VXLAN as well. If you need separate VRFs on your Linux host to separate out traffic into different L3 overlays, you can do that. The configuration for the VRFs and VXLAN interfaces might look like this, assuming the usual /etc/network/interfaces file:
auto VRF01 iface VRF01 inet manual pre-up ip link add dev VRF01 type vrf table 1100 up ip link set VRF01 up auto VRF02 iface VRF02 inet manual pre-up ip link add dev VRF02 type vrf table 1101 up ip link set VRF02 up auto vx-100 iface vx-100 inet manual pre-up ip link add vx-100 type vxlan id 100 dstport 4789 local 10.11.12.1 dev lo:2 || true up ip link set vx-100 up down ip link set vx-100 down post-down ip link del vx-100 || true auto VNI-100 iface VNI-100 inet manual bridge-ports vx-100 up ip link set dev VNI-100 master VRF01 auto vx-200 iface vx-200 inet manual pre-up ip link add vx-200 type vxlan id 200 dstport 4789 local 10.11.12.1 dev lo:2 || true up ip link set vx-200 up down ip link set vx-200 down post-down ip link del vx-200 || true auto VNI-200 iface VNI-200 inet manual bridge-ports vx-200 up ip link set dev VNI-200 master VRF02
That set up two VRFs called VRF01 and VRF02, two VXLAN interfaces vx-100 (VNI 100) and vx-200 (VNI 200), and finally two VNI bridges VNI-100 and VNI-200.
Publishing L2 or L3 Knowledge Into EVPN
We have to tie this all together with EVPN of course. Again, volumes have been written about using EVPN with FRR on this blog and many other places. L2 VNIs don’t need any configuration in FRR’s EVPN section unless you specifically want to call out route designators and route targets. Otherwise, you can let FRR assign those for you. L3 VNIs do require VRF configuration in FRR, as well as a separate BGP sub-section per VRF.
L3 VNI Example
If we continue with our previous two-VRF configuration above, FRR might look like:
vrf VRF01 vni 100 exit-vrf ! vrf VRF02 vni 200 exit-vrf ! router bgp 65001 address-family l2vpn evpn neighbor evpn activate advertise-all-vni exit-address-family exit ! router bgp 65001 vrf VRF01 ! address-family ipv4 unicast redistribute connected exit-address-family ! address-family l2vpn evpn advertise ipv4 unicast rd 10.11.12.1:100 route-target import 65001:100 route-target export 65001:100 exit-address-family exit ! router bgp 65001 vrf VRF02 ! address-family ipv4 unicast redistribute connected exit-address-family ! address-family l2vpn evpn advertise ipv4 unicast rd 10.11.12.1:200 route-target import 65001:200 route-target export 65001:200 exit-address-family exit
Why Do This?
That’s the big question, of course. And there are actually a few answers that I’ll attempt to summarize here.
The ultimate nirvana of any network engineer should be as simple a network as he or she can build. Freedom from any port and device-based customization will lead you to that nirvana. Following my examples above, the majority of your TORs will be BGP peering with their downstream servers and that’s it. I’ll touch upon exceptions in a bit, but for the most part, your TOR configurations are stupid-simple: IPv6 LLAs auto-assigned on each interface, and BGP peering with whatever is connected on the other side of that interface. No VLANs. No VRFs. No VXLAN or EVPN. Nothing.
It does mean that guard rails need to be put up. Border routers, or border leafs, or however you want to configure them, will need to exist somewhere in your network. They’ll have to route the VLANs and VRFs that the servers are creating and using, out to the rest of the world. But this isn’t done on every rack, it’s done somewhere centralized and only be a small number (meaning two or four) devices.
There are going to be devices that simply can’t do FRR properly, or can do BGP, but can’t do VXLAN and EVPN. Firewalls, storage appliances, etc are examples of those. For those devices, you’ll need to have some customization on the ports they connect to.
Another important exception is external customer workloads. If your servers are housing external customer workloads (eg: VMs, containers, etc), you’ll want to think twice about including VXLAN and EVPN on the hypervisor or container host. The challenge there is the case of the VM (or container?) getting “jailbroken”, giving the attacker access to the hypervisor itself. Should he or she get that, they’ll then have full access to your EVPN control plane. I have a solution for that and I hope to share that in follow-up blog entries.
Improved Experience For Server Engineers
Giving the server engineers control over their own networking along with the education on how to do so will ultimately improve their efficiency. Sure, there will be a ramp up time as they learn how to get BGP, EVPN, and VXLAN configured on their servers. But the network engineers should be right there to help them learn that. Once the server engineers have gotten past that learning curve, the sky’s the limit for what they can do on their servers without having to open tickets with the network engineers. This allows the latter to focus on scaling and improving the network without worrying about all the various port and device customization needed for each server.
Automation: They’re Better At It
Another shot across the bow of my network engineer brethren, but, from my extensive experience in the industry, I’ve found this to be a universal truth: the server engineers are vastly better at automating themselves out of a job than the network engineers are. All of the host configuration examples I’ve given can be done on the CLI using ip, bridge, and vtysh -c. Given that, they can easily be stuffed into a series of Ansible playbooks and pushed to any server (or set of servers) that need the customization.
Hopefully I’ve made it clear that I believe the network customization should be pushed the host whenever and where ever possible. With any luck I’ve convinced you of the same but I certainly understand if those on the network side of the house are still reluctant and hesitant. All I can suggest is you at least consider the possibility and think through all the flexibility it allows for.
As mentioned in the opening section, with permission, I’ll be writing about the Nvidia Bluefield-2 SmartNICs next. And a lot of what I’ve written here will play right into that. It’s almost as if… I planned it.