FreeBSD, Network Architecture, Server and OS

Server Merge, ZFS Fun, and BGP Routing For Jails

Introduction

About five years ago, I built two new servers in my basement, as outlined here and here.  One server was my general login/mail/web/DNS/etc server, and the other my home NAS.  This entry will document and detail my merging of the two servers into one, and the associated challenges that brought with it.  I also took this opportunity to re-do my router and have it take an active role in this migration.  Let’s get into it.

Joker: Login Server

If you went through the previously linked entry, you saw that joker was: a SuperMicro dual-Xeon board with two 6-core Xeons, 128GB of RAM, two 300GB 10K RPM SATA drives, and four 1TB SATA drives.  The 300GB drives were mirrored together into a ZFS mirror, and served as joker’s boot and OS drive.  The other four 1TB drives were put together into a 2TB ZFS RAID10 volume that I called local.  Remember that name, because it comes into play later.  That volume was where I stored stuff that would be local to joker, versus needing to be shared on the NAS.

Joker’s primary mission was my general server.  Because I have a business class Internet connection here at home, I also have a small block of public IPs assigned to me.  Joker had a bunch of publicly-facing jails for things like this blog, my authoritative DNS server, my email and IMAP server, etc.  As I documented in previous blog entries such as this one, I also used the server to help me stream to Twitch, YouTube, and even Mixer.

Bane: The NAS

My NAS was built with a single-processor motherboard from SuperMicro, and it had an Intel Core i7 4790 CPU in it.  I also had 16GB of RAM in it, but I replaced that last December with 32G because I was running into server crashes due to bad RAM.  Importantly, the motherboard sat in a SuperMicro 4RU chassis with eight hot-swappable drive bays, each filled with a 4TB HDD.  Those drives were put together into a ZFS RAID10 volume of 16TB and exported via NFS, SMB, and AFP for the various clients throughout the house.

Over the years, I decided to put some private services on that server as well.  An internal MySQL DB, and an internal PostgreSQL DB as well.  I also spun up a Plex server so that my living room devices could pull my digitized movies from the server.

Oh, and remember that local name I asked you to remember?  I also had the same ZFS pool name in used on bane.  What could possibly go wrong? …

The Merge

Five years ago, it made sense to me to keep the NAS and login servers separate.  I can’t really say why that made sense back then, it just did.  Over the course of the last few months I’ve been re-thinking that stance, and I finally decided to merge the two.  In a data center and/or production environment, it makes sense to keep your mass storage and your worker servers separate.  But, maybe that’s just a waste of electricity in a home lab.

The Plan

The plan is multi-faceted, and is still underway as I write this.  I’ll explain what I mean by that in a moment.  But roughly:

  1. Take the motherboard/CPUs/RAM/boot drives out of joker and put them into the NAS chassis
  2. Take the motherboard/CPU/RAM out of the NAS chassis, and put them into a new 3RU chassis
  3. Take the boot drive and 10GigE NIC out of my FreeBSD router and put it into the same 3RU chassis
  4. Rack mount the 10GigE switch, new 10GigE router, and new merged server into a small, 12RU rack

The current status as of writing this: the first three steps are done.  The fourth will happen shortly.

The Pre-Work And Storage Danger!

Theoretically, I should have been able to just plug in the existing NAS 4TB drives into the dual-Xeon motherboard and boot joker’s old mirrored boot drives.  The NAS’ local ZFS pool should be visible and mount immediately.  However, just in case it didn’t work as planned, I decided to get a fresh backup of the pertinent data off of bane.  I knew this was going to be slow because my very old NAS, a four-bay QNAP, only has GigE on it.  And it’s a slow server, too.  Very, very slow.  But it would suit the purpose.  I spun it up, found four unused 4TB drives, and created a 16TB RAID0 volume on it.  I didn’t need the integrity or redundancy of something like RAID10 or RAID5 for this.  I just wanted speed because the data wouldn’t be on the device any longer than it needed to.

I got the backup NAS running, and as soon as I started copying data off of bane. these appeared in bane’s logs:

(ada4:ahcich8:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 f9 5b 40 74 01 00 01 00 00
(ada4:ahcich8:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:ahcich8:0:0:0): Retrying command, 3 more tries remain

Crrrrrrap!  What was worse was that two of the drives were complaining, but /dev/ada4 was definitely the worst of them.  Thankfully they were both part of separate mirrors within the RAID10 volume.  So if I had replacement drives, I could replace them separately.  I didn’t have two replacements, but did have a single 8TB SATA drive that I wasn’t using.  Recall that I bought a Mac Pro back in December of 2019.  I picked up an aftermarket drive rack so that I could put two SATA SSDs in that Mac.  That rack came with an 8TB SATA drive that I had no intentions of using.

Until now.  In it went, and I let the server resilver the volume overnight.  Once it was done, I began the data backup again.  The second bad drive barked a few messages in the logs, but the backup completed.

The Merge

Since those servers were built five years ago, I decided to update the CPU cooling on them.  I picked up three Noctua NH-L9x65 heat-sink/low-profile-fan combo coolers.  I shut the storage server down and pulled the motherboard out of it.  That was set aside for the time being.  On joker, I made a simple mistake before shutting it down.  I knew I was going to be replacing the local zpool with the one from the NAS, so I thought I just had to:

zpool export local

which unmounted the local zpool and all filesystems associated with it.  I thought that would be enough.  It wasn’t.

After mounting the new CPU coolers and struggling to stuff an E-ATX motherboard into that 4RU ServerMicro chassis (much blood was donated to the gods), I got everything connected and booted up.  Joker was renamed to arkham, and you’ll see why in a moment.  It booted up and found the old root mirror without any trouble.  But the local pool?  There was a problem.

ZFS Makes It Easy …

…assuming you know what you’re doing.  In this case, I didn’t exactly.  What I should have done before shutting joker down was:

zpool destroy local

I had no intentions of saving any of the data that was local to joker, and what I needed to do was tell joker’s OS: “Forget about your local pool completely.  It’s gone!”  So what I was left with when joker-now-arkham booted was a very confused OS with a ZFS pool in a bad shape:

arkham# zpool status local
  pool: local
state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
    replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-3C
  scan: none requested
config:

    NAME                      STATE     READ WRITE CKSUM
    local                     UNAVAIL      0     0     0
      mirror-0                UNAVAIL      0     0     0
        2440557950419120305   UNAVAIL      0     0     0  was /dev/ada2
        16238227418067990627  UNAVAIL      0     0     0  was /dev/ada3
      mirror-1                UNAVAIL      0     0     0
        8029751195750855597   UNAVAIL      0     0     0  was /dev/ada4
        12594972150827200400  UNAVAIL      0     0     0  was /dev/ada5

Um.  No, that’s not how that should look.  It was supposed to have four mirrors of two drives each, striped into a single pool.  That looked like…

Hey, wait a minute!  Once it dawned on my that what I was looking at was joker’s old local pool, I knew what I had to do:

zpool destroy local
zpool import local

The moment I did that, the NAS’ pool was imported along with each of the ZFS filesystems.  Phew!  Close one.  And thankfully: I didn’t need to do the slow restore off the QNAP.

I’ll get back to arkham and its new role in a moment.  But while I’m discussing the physical work, let me address:

The Router

This one went a lot smoother.  I just had to move the boot drive out of the old router, along with the 10GigE NIC it had.  I had a second 10GigE NIC that I pulled from the old bane since arkham already had its own.  Meaning the router now has two 2x10GigE NICs in it.  Ya know, like a real router!

I was sort of foiled by the NICs though.  My intention with one of the NICs was to put copper 10GBASE-T SFPs into it and use them for copper links to the network.  One would be the uplink to my Verizon Internet which is GigE right now.  The second was going to be used as the routed path for my wireless network.  But, the problem is that the NICs I purchased weren’t able to take SFPs.  They’re only able to take DACs.  The moment the OS on the router booted and saw the SFPs in the NICs, it barked “unrecognized media!” errors in the logs.  So for the time being, I ran two more DACs from the 10Gig switch to those interfaces on the router, and all is well.

I intend to overhaul my connection to Verizon in the future, and I’ll detail all of that in another entry.

Arkham Lives

Keeping with my “Batman Rogues” naming convention, I decided to name this server arkham.  It’s sole purpose and existence will be to run FreeBSD jails.  No services or anything at all for the house.  Just jails.  Akin to a hypervisor.  And, since my public jails are all named after the Batman Rogues, Arkham seemed like a proper name.

The Jails

I have a series of jails running on arkham.  In fact, this blog is on one of them.

arkham# jls
   JID  IP Address      Hostname                      Path
     3  108.28.193.217  riddler.lateapex.net          /local/jails/riddler
     4  108.28.193.219  catwoman.lateapex.net         /local/jails/catwoman
     5  108.28.193.212  scarecrow.lateapex.net        /local/jails/scarecrow
     6  192.168.10.2    nas.private.lateapex.net      /local/jails/nas
     7  192.168.10.50   dns.private.lateapex.net      /local/jails/dns
     8  192.168.10.4    dbhost.private.lateapex.net   /local/jails/dbhost
     9  192.168.10.6    media.private.lateapex.net    /local/jails/media
    11  108.28.193.210  joker.lateapex.net            /local/jails/joker
    15  108.28.193.213  madhatter.lateapex.net        /local/jails/madhatter

Some are public, others are private.  I didn’t mix the two sets of jails though.  Instead, I’m running two FIBs on arkham: the default (FIB 0) and the public (FIB 1).  My 10GigE switch is trunking two VLANs: 50 and 200 down to arkham’s 2x10GigE LAG interface.  Meaning I have lagg0.50 and lagg0.200 interfaces on the FreeBSD server.  Initially, I didn’t put an IP address on lagg0.50, I just put it into FIB 1.  Then each public jail that I started would also end up in FIB 1.

For instance, my authoritative DNS server:

# scarecrow (external, authoritative DNS)
scarecrow {
	host.hostname = "scarecrow.lateapex.net";
	exec.fib=1;
	ip4.addr += "lagg0.50|108.28.193.212/32";
}

I had to set a static interface route in that FIB so that I could then set a default route for it.  The reason I had to do the static interface route was because I didn’t put a public IP on lagg0.50:

setfib 1 route add -net 108.28.193.208/28 -iface ix1
setfib 1 route add default 108.28.193.222

With those two lines in place in my /etc/rc.local, each public jail could route to/from the Internet.  The private jails were fine since they were part of FIB 0.

Slight FIB Problem

This one drove me a bit batty.  With the public and private jails in place and in different FIBs, they technically should have followed their respective default routes to get to one another.  In other words, if my web server needed to talk to my private dbhost jail for MySQL, its packets should have egressed arkham on lagg0.50, bounced off my router which is running pf to filter traffic, and then ingressed arkham on lagg0.200 destined for the dbhost.  But that’s not what was happening.

What was happening was that that two set of jails were bypassing the FIB boundaries and talking directly, internally.  And the reason was because each of their respective /32 IPs was showing up in BOTH FIBS!  And that’s not supposed to happen.

At least, it isn’t if you have the line

net.add_addr_allfibs=0

in your /boot/loader.conf file.  And I did.  What that’s supposed to do is make absolute sure that addresses in each FIB do NOT get added to all other FIBs.  But, they were.

As it turns out, I’d set that variable back before I was running FreeBSD 12 on my servers.  Apparently, sometime during the transition to Version 12, that line no longer works when it’s in the loader.conf file.  Instead, it has to be put in the /etc/sysctl.conf file.  Which I did.  And so that I didn’t have to reboot:

sysctl -w net.add_addr_allfibs=0

Bam.  The incorrect IP addresses were removed from each FIB and routing between said FIBs followed the correct path.

Routing “F”un

Before venturing too much further into this section, please reference the “Verizon Sucks” section of my router document.  Following that, page down to the “Proxy Arp” section of the same document and read that, too.  In summary, my router has a series of little interface static routes for all of my public IP address, egressing over a privately-IP’d 10Gig interface.  And it’s running proxy ARP, which is …

…fucking awful.  Pardon the language.  But: it is.

The reason I had to come up with this awful solution is documented in the previous blog entry.  In summary: Verizon won’t properly route my public almost-a-/28.  Instead, my public IPs are on their /24 broadcast domain, and I’m expected to set the default route of any machine I put on that public network to their .1 interface and let them route for me.

I… don’t think so.  Sorry Verizon, but I can do this better than you can, and I can do my own network security better than you can, too.  But because it’s all the same /24 broadcast domain, I have to tell my router to answer ARP requests for my public IPs that are hiding behind it.  And because it doesn’t have a route specifically to all of those IPs, it needs a series of statics.

Ick.  Let’s try to un-fornicate some of that.

Proxy ARP: Stays

Unfortunately, I can’t do anything about the proxy ARP setup.  So it stays.

Replace Statics With BGP

Come on.  I’m a router guy.  I course I’m going to try and use dynamic routing, right?  I’d previously installed Free Range Routing on my router, to use with my VM lab some time ago.  I did some pretty big modifications to its configuration, but I didn’t have to install anything.  The bgpd.conf looks like this:

router bgp 65000
 bgp router-id 10.0.0.1
 bgp bestpath as-path multipath-relax
 neighbor 10.0.0.2 remote-as 65100
 !
 address-family ipv4 unicast
  neighbor 10.0.0.2 soft-reconfiguration inbound
  neighbor 10.0.0.2 prefix-list arkham_in in
  neighbor 10.0.0.2 prefix-list arkham_out out
  network 0.0.0.0/0
 exit-address-family
!
ip prefix-list arkham_in seq 5 permit 108.28.193.208/28 ge 32
ip prefix-list arkham_in seq 10 deny any
!
ip prefix-list arkham_out seq 5 permit 0.0.0.0/0
ip prefix-list arkham_out seq 10 deny any

Remember that the interface facing my public VLAN has that 10.0.0.1 IP address.

I installed FRR on arkham and set it up like this:

router bgp 65100
 bgp router-id 10.0.0.2
 bgp bestpath as-path multipath-relax
 neighbor 10.0.0.1 remote-as 65000
!
 address-family ipv4 unicast
  redistribute connected
  neighbor 10.0.0.1 soft-reconfiguration inbound
  neighbor 10.0.0.1 prefix-list router_out out
 exit-address-family
!
ip prefix-list router_out seq 5 permit 108.28.193.208/28 ge 32
ip prefix-list router_out seq 10 deny any

The challenge is that I didn’t want FRR to do anything to the default FIB on arkham.  In other words: I didn’t want it accepting a default route from the upstream router, and putting that into the default FIB.  Everything would break.  I needed it focused in FIB 1.  Fortunately, that’s pretty easy with FreeBSD.  I had to add a few extra lines to the /etc/rc.conf, but:

# Get routing running so we can announce the public jails
frr_fib="1"
frr_enable="YES"
frr_daemons="zebra bgpd"
bgpd_fib="1"
zebra_fib="1"

Basically I told it to start zebra and bgpd, and keep them in FIB 1.  I added an RFC1918 IP address to arkham’s public lagg0.50 interface, and kicked FRR into gear.  The moment I did that, the two devices formed a BGP peer.  Arkham could see a 0/0 in its FIB 1, coming from the router.  And the router saw all of arkham’s /32s from the jails because they’re each attached to interface lagg0.50 as secondaries or aliases.

arkham#	setfib 1 netstat -nr
Routing tables (fib: 1)

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.0.0.1           UG1    lagg0.50
10.0.0.0/24        link#7             U      lagg0.50
10.0.0.2           link#7             UHS         lo0
108.28.193.210     link#7             UHS         lo0
108.28.193.210/32  link#7             U      lagg0.50
108.28.193.212     link#7             UHS         lo0
108.28.193.212/32  link#7             U      lagg0.50
108.28.193.213     link#7             UHS         lo0
108.28.193.213/32  link#7             U      lagg0.50
108.28.193.217     link#7             UHS         lo0
108.28.193.217/32  link#7             U      lagg0.50
108.28.193.219     link#7             UHS         lo0
108.28.193.219/32  link#7             U      lagg0.50
127.0.0.1          lo0                UHS         lo0
lateapex-gw# netstat -nr | grep 108.28
default            108.28.193.1       UGS         ix3
108.28.193.0/24    link#4             U           ix3
108.28.193.210/32  10.0.0.2           UG1         ix1
108.28.193.212/32  10.0.0.2           UG1         ix1
108.28.193.213/32  10.0.0.2           UG1         ix1
108.28.193.217/32  10.0.0.2           UG1         ix1
108.28.193.219/32  10.0.0.2           UG1         ix1

That allowed me to remove the series of static routes I had on the router and start relying on BGP to do the right thing.  It also meant that I can’t use public IPs on anything else in my house that doesn’t support BGP.  A little safety mechanism, maybe?  But, there was still a problem…

FRR and FreeBSD Not Playing Well

The idea behind the redistribute connected line in my server’s BGP config was to make sure that when a public jail starts, its new /32 automatically gets announced into BGP because that IP gets attached to lagg0.50.  Conversely, when a public jail is stopped, its respective IP is pulled from lagg0.50 and then BGP.  Unfortunately, that wasn’t working to plan.

I found that, at least on FreeBSD, when FRR starts it takes a “snapshot”, if you will, of the IP addresses on all interfaces.  If any OS-level IPs are changed in the mean, FRR doesn’t receive that update.  Meaning that when an existing /32 was removed from the interface, or a new one added, FRR wouldn’t see that change.  I could verify this by killing a jail, then performing the show interface lagg0.50 command in FRR’s vtysh.  In each case, it would miss the update and continue routing the old IP or not routing the new one.

Hm.  That’s sub-optimal.  As I found out through discussing this with some of the FRR guys, it’s likely a bug between FRR and FreeBSD.  It works the way it should on Linux, so I’ve opened a bug for it.

In the mean, I had to whip out my sledge hammer to fix it.  My sledge hammer, in this case, is my shell scripting.  I first changed the routing config on my server by removing the redistribute connected line.  Once restarted, FRR wasn’t announcing anything to the upstream router.  I then added this to my /etc/rc.local file, so that it happens with every boot:

for IP in `jls | grep 108.28 | awk '{print $2}'`
do
	vtysh -c "conf t" -c "router bgp 65100" -c "address-family ipv4 unicast" -c "network ${IP}/32"
done

Basically: iterate through the list of publicly-addressed jails and manually stuff their /32s into BGP using vtysh.  This is all run time though, meaning that if the FRR process crashes, or something changes with a jail, it won’t be updated or saved.  But that’s fine for the time being.

I also wrote two more scripts to start and stop jails, and I’ll use them instead of the service jail [start|stop] command:

jail-start

#!/bin/sh

if test $# != 1
then
	echo Usage: jail-start jail
	exit
fi

JAIL=${1}

# Let the OS figure out of this is a real jail, and error if not.
service jail start ${JAIL}

if test $? == 1
then
	echo Jail ${JAIL} does not exist.
	exit
fi

JAILIP=`jls -j ${JAIL} | grep -v JID | awk '{print $2}'`

# Is this a public or RFC1918 jail?  If public, add the BGP route.
if test `echo ${JAILIP} | cut -d . -f 1,2` != "192.168"
then
	vtysh -c "conf t" -c "router bgp 65100" -c "address-family ipv4 unicast" -c "network ${JAILIP}/32"
fi

jail-stop

#!/bin/sh

if test $# != 1
then
	echo Usage: jail-stop jail
	exit
fi

JAIL=${1}

#
# Make sure this is a real jail
jls -j ${JAIL} 2>/dev/null >/dev/null

if test $? == 1
then
	echo Jail ${JAIL} does not exist.
	exit
fi

JAILIP=`jls -j ${JAIL} | grep -v JID | awk '{print $2}'`

# Is this a public or RFC1918 jail?  If public, remove the BGP route.
if test `echo ${JAILIP} | cut -d . -f 1,2` != "192.168"
then
	vtysh -c "conf t" -c "router bgp 65100" -c "address-family ipv4 unicast" -c "no network ${JAILIP}/32"
fi

# Now kill the jail
service jail stop ${JAIL}

 

Conclusion

That wraps up the story, but don’t think I’ll leave you without some pretty pictures.  Here’s what the disaster of my “basement data center” looked before the merge started:

Left to right: joker, router on top of the switch, the switch, and then bane (NAS) on the far right.

And now, before the rack installation?

Still work in progress.  I’ll post up another pic later, after I get everything tidied up in a rack.

And yeah: blinky lights are still cool!

Leave a Reply