art with code


InfiniBanding, pt. 3, now with ZFS

Latest on the weekend fileserver project: ib_send_bw 3.3 GB/s between two native CentOS boxes. The server has two cards so it should manage 6+ GB/s aggregate bandwidth and hopefully feel like a local NVMe SSD to the clients. (Or more like remote page cache.)

Got a few 10m fiber cables to wire the office. Thinking of J-hooks dropping orange cables from the ceiling and "Hey, how about a Thunderbolt-to-PCIe chassis with an InfiniBand card to get laptops on the IB."

Flashing the firmware to the latest version on the ConnectX-2 cards makes ESXi detect them, which somehow breaks the PCI pass-through. With ESXi drivers, they work as ~20 GbE network cards that can be used by all of the VMs. But trying to use the pass-through from a VM fails with an IRQ error and with luck the entire machine locks up. So, I dropped ESXi from the machines for now.


Been playing with ZFS with Sanoid for automated hourly snapshots and Syncoid for backup sync. Tested disks getting removed, pools destroyed, pool export and import, disks overwritten with garbage, resilvering to recover, disk replacement, scrubs, rolling back to snapshot, backup to local replica, backup to remote server, recovery from backup, per-file recovery from .zfs/snapshot, hot spares. Backup syncs seem to even work between pools of different sizes, I guess as long as the data doesn't exceed pool size. Hacked Sanoid to make it take hourly snapshots only if there are changes on the disk (zfs get written -o value -p -H).

Copied over data from the workstations, then corrupted the live mounted pool with dd if=/dev/zero over a couple disks, destroyed the pool, and restored it from the backup server, all without rebooting. The Syncoid restore even restored the snapshots, A+++.

After the successful restore, my Windows laptop bluescreen on me and corrupted the effect file I was working on. Git got me back to a 30-min-old version, which wasn't so great. So, hourly snapshots aren't good enough. Dropbox would've saved me there with its per-file version history.

I'm running three 6TB disks in RAIDZ1. Resilvering 250 gigs takes half an hour. Resilvering a full disk should be somewhere between 12 and 24 hours. During which I pray to the Elder Gods to keep either of the two remaining disks from becoming corrupted by malign forces beyond human comprehension. And if they do, buy new disks and restore from the backup server :(

I made a couple of cron jobs. One does hourly Syncoid syncs from production to backup. The others run scrubs. An over-the-weekend scrub for the archive pool, and a nightly scrub on the fast SSD work pool. That is the fast SSD work pool that doesn't exist yet, since my SSDs are NVMe and my server, well, ain't. And the NVMe-to-PCIe -adapters are still on back order.


I'll likely go with two pools: a work pool with two SSDs in RAID-1, and the other an archive pool with three HDDs in RAIDZ1. The work pool would be backed up to the archive pool, and the archive pool would be backed up to an off-site mirror.

The reason for the two volume system is to get predictable performance out of the SSD volume, without the hassle of SLOG/L2ARC.

So, for "simplicity", keep current projects on the work volume. After a few idle months, automatically evict them to the archive and leave behind a symlink. Or do it manually (read: only do it when we run out of SSD.) Or just buy more SSD as the SSD runs out, use the archive volume only as backup.

I'm not sure if parity RAID is the right solution for the archive. By definition, the archive won't be getting a lot of reads and writes, and the off-site mirroring run is over GbE, so performance is not a huge issue (120 MB/s streaming reads is enough). Capacity-wise, a single HDD is 5x the current project archive. Having 10 TB of usable space would go a long way. Doing parity array rebuilds on 6TB drives, ugh. Three-disk mirror.

And write some software to max out the IOPS and bandwidth on the RAM, the disks and the network.


Unified Interconnect

Playing with InfiniBand got me thinking. This thing is basically a PCIe to PCIe -bridge. The old kit runs at x4 PCIe 3 speeds, the new stuff is x16 PCIe. The next generation is x16 PCIe 4.0 and 5.0.

Why jump through all the hoops? Thunderbolt is x4 PCIe over a USB connector. What if you bundle four of those, throw in some fiber and transceivers for long distance. You get x16 PCIe between two devices.

And once you start thinking of computers as a series of components hanging off a PCIe bus, your system architecture clarifies dramatically. A CPU is a PCIe 3 device with 40 to 64 lanes. DRAM uses around 16 lanes per channel.

GPUs are now hooked up as 16-lane devices, but could saturate 256 to 1024 lanes. Because of that, GPU RAM is on the GPU board. If the GPU had enough lanes, you could hook GPU RAM up to the PCIe bus with 32 lanes per GDDR5 chip. HBM is probably too close to the GPU to separate.

You could build a computer with 1024 lanes, then start filling them up with the mix of GPUs, CPUs, DRAM channels, NVMe and connectivity that you require. Three CPUs with seven DRAM channels? Sure. Need an extra CPU or two? Just plug them in. How about a GPU-only box with the CPU OS on another node. Or CPUs connected as accelerators via 8 lanes per socket as you'd like to use the extra lanes for other stuff. Or a mix of x86 and ARM CPU cards to let you run mixed workloads at max speed and power efficiency.

Think of a rack of servers, sharing a single PCIe bus. It'd be like one big computer with everything hotpluggable. Or a data center, running a single massive OS instance with 4 million cores and 16 PB of RAM.

Appendix, more devices

Then you've got the rest of the devices, and they're pretty well on the bus already. NVMe comes in 4-lane blocks. Thunderbolt 3, Thunderbolt 2 and USB 3.1 are 4, 2 and 1 lane devices. SAS and SATA are a bit awkward, taking up a bit more than 1 lane or bit more than half a lane. I'd replace them with NVMe connectors.

Display connectors could live on the bus as well (given some QoS to keep up the interactivity). HDMI 2.1 uses 6 lanes, HDMI 2 is a bit more than 2 lanes. DisplayPort's next generation might go up to 7-8 lanes.

Existing kit

[Edit] Hey, someone's got products like this already. Dolphin ICS produces a range of PCIe network devices. They've even got an IP-over-PCIe driver.

[Edit #2] Hey, the Gen-Z Interconnect looks a bit like this Gen-Z Interconnect Core Specification 1.0 Published


InfiniBanding, pt. 2

InfiniBand benchmarks with ConnectX-2 QDR cards (PCIe 2.0 x8 -- very annoying lane spec outside of server gear: either it eats up a PCIe 3.0 x16 slot, or you end up running at half speed, and it's too slow to hit the full 32 Gbps of QDR InfiniBand. Oh yes, I plugged one into a PCIe 2.0 x8 four-lane slot, it gets half the RDMA bandwidth in tests.)

Ramdisk-to-ramdisk, 1.8 GB/s with Samba and IP-over-InfiniBand.

IPoIB iperf2 does 2.9 GB/s with four threads. The single-threaded iperf3 goes 2.5 GB/s if I luck out on the CPU affinity lottery (some cores / NUMA nodes do 1.8 GB/s..)

NFS over RDMA, 64k random reads with QD8 and one thread, fio tells me read bw=2527.4MB/s. Up to 2.8 GB/s with four threads. Up to 3 GB/s with 1MB reads.

The bandwidth limit of PCIe 2.0 x8 that these InfiniBand QDR cards use is around 25.6 Gbps, or 3.2 GB/s. Testing with ib_read_bw, it maxes out at around 3 GB/s.

So. Yeah. There's 200 MB/s of theoretical performance left on the table (might be ESXi PCIe passthrough exposing only 128 byte MaxPayload), but can't complain.

And... There's an upgrade path composed of generations of obsoleted enterprise gear: FDR gets you PCIe 3.0 x4 cards and should also get you the full 4 GB/s bandwidth of the QDR switch. FDR switches aren't too expensive either, for a boost to 5.6 GB/s per link. Then, pick up EDR 100 GbE kit...

Now the issue (if you can call it that) is that the server is going to have 10 GB/s of disk bandwidth available, which is going to be bottlenecked (if you can call it that) by the 3 GB/s network.

I could run multiple IB adapters, but I'll run out of PCIe slots. Possibilities: bifurcate a 16x slot into two 8x slots for IB or four 4x slots for NVMe. Or bifurcate both slots. Or get a dual-FDR/EDR card with a 16x connector to get 8 GB/s on the QDR switch. Or screw it and figure out how to make money out of this and use it to upgrade to dual-100 GbE everywhere.

(Yes, so, we go from "set up NAS for office so that projects aren't lying around everywhere" to "let's build a 400-machine distributed pool of memory, storage and compute with GPU-accelerated compute nodes and RAM nodes and storage nodes and wire it up with fiber for 16-48 GB/s per-node bandwidth". Soon I'll plan some sort of data center and then figure out that we can't afford it and go back to making particle flowers in Unity.)


Quick test with old Infiniband kit

Two IBM ConnectX-2 cards, hooked up to a Voltaire 4036 switch that sounds like a turbocharged hair dryer. CentOS 7, one host bare metal, other on top of ESXi 6.5.

Best I saw thus far: 3009 MB/s RDMA transfer. Around 2.4 GB/s with iperf3. These things seem to be CPU capped, top is showing 100% CPU use. Made an iSER ramdisk too, it was doing 1.5 GB/s-ish with ext4.

Will examine more next week. With later kernel and firmwares and whatnot.

The end goal here would be to get 2+ GB/s file transfers over Samba or NFS. Probably not going happen but eh, give it a try.

That switch though. Need a soundproof cabinet.


OpenVPN settings for 1 Gbps tunnel

Here are the relevant parts of the OpenVPN 2.4 server config that got me 900+ Mbps iperf3 on GbE LAN. The tunnel was between two PCs with high single-core performance, a Xeon 2450v2 and an i7-3770. OpenVPN uses 50% of a CPU core on the client & server when the tunnel is busy. For reference, I tried running the OpenVPN server on my WiFi router, it peaked out at 60 Mbps.

# Use TCP, I couldn't get good perf out of UDP. 

proto tcp

# tun or tap, roughly same perf
dev tun 

# Use AES-256-GCM:
#  - more secure than 128 bit
#  - GCM has built-in authentication, see https://en.wikipedia.org/wiki/Galois/Counter_Mode
#  - AES-NI accelerated, the raw crypto runs at GB/s speeds per core.

cipher AES-256-GCM

# Don't split the jumbo packets traversing the tunnel.
# This is useful when tun-mtu is different from 1500.
# With default value, my tunnel runs at 630 Mbps, with mssfix 0 it goes to 930 Mbps.

mssfix 0

# Use jumbo frames over the tunnel.
# This reduces the number of packets sent, which reduces CPU load.
# On the other hand, now you need 6-7 MTU 1500 packets to send one tunnel packet. 
# If one of those gets lost, it delays the entire jumbo packet.
# Digression:
#   Testing between two VBox VMs on a i7-7700HQ laptop, MTU 9000 pegs the vCPUs to 100% and the tunnel runs at 1 Gbps.
#   A non-tunneled iperf3 runs at 3 Gbps between the VMs.
#   Upping this to 65k got me 2 Gbps on the tunnel and half the CPU use.

tun-mtu 9000

# Send packets right away instead of bundling them into bigger packets.
# Improves latency over the tunnel.


# Increase the transmission queue length.
# Keeps the TUN busy to get higher throughput.
# Without QoS, you should get worse latency though.

txqueuelen 15000

# Increase the TCP queue size in OpenVPN.
# When OpenVPN overflows the TCP queue, it drops the overflow packets.
# Which kills your bandwidth unless you're using a fancy TCP congestion algo.
# Increase the queue limit to reduce packet loss and TCP throttling.

tcp-queue-limit 256

And here is the client config, pretty much the same except that we only need to set tcp-nodelay on the server:

proto tcp
cipher AES-256-GCM
mssfix 0
tun-mtu 9000
txqueuelen 15000
tcp-queue-limit 256

To test, run iperf3 -s on the server and connect to it over the tunnel from the client: iperf3 -c For more interesting tests, run the iperf server on a different host on the endpoint LAN, or try to access network shares.

I'm still tuning this (and learning about the networking stack) to get a Good Enough connection between the two sites, let me know if you got any tips or corrections.

P.S. Here's the iperf3 output.

$ iperf3 -c
Connecting to host, port 5201
[  4] local port 39590 connected to port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   112 MBytes   942 Mbits/sec    0   3.01 MBytes
[  4]   1.00-2.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   3.00-4.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   6.00-7.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   7.00-8.00   sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
[  4]   8.00-9.00   sec   111 MBytes   933 Mbits/sec    0   3.01 MBytes
[  4]   9.00-10.00  sec   110 MBytes   923 Mbits/sec    0   3.01 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.08 GBytes   928 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec                  receiver

iperf Done.


Fast-ish OpenVPN tunnel

500 Mbps OpenVPN throughput over the Internet, nice. Was aiming for 900 Mbps, which seems to work on LAN, but no cigar. [Edit: --tcp-nodelay --tcp-queue-limit 256 got me to 680 Mbps. Which is very close to non-tunneled line speed as measured by a HTTP download.]

OpenVPN performance is very random too. I seem to be getting different results just because I restarted the client or the server.

The config is two wired machines, each with a 1 Gbps fibre Internet connection. The server is a Xeon E3-1231v3, a 3.4 GHz Haswell Xeon. The client is my laptop with a USB3 GbE adapter and i7-7700HQ. Both machines get 900+ Mbps on Speedtest, so the Internet connections are fine.

My OpenVPN is set to TCP protocol (faster and more reliable than UDP in my testing), and uses AES-256-GCM as the cipher. Both machines are capable of pushing multiple gigaBYTES per second over openssl AES-256, so crypto isn't a bottleneck AFAICT. The tun-mtu is set to 9000, which performs roughly as well as 48000 or 60000, but has smaller packets which seems to be less flaky than big mtus. The mssfix setting is set to 0 and fragment to 0 as well, though fragment shouldn't matter over TCP.

Over iperf3 I get 500 Mbps between the endpoints. With HTTP, roughly that too. Copying from a remote SMB share on another host goes at 30 MB per second, but the remote endpoint can transfer from the file server at 110 MB per second (protip: mount.cifs -o vers=3.0). Thinking about it a bit, I need to test with a second NIC in the VPN box, now VPN traffic might be competing with LAN traffic.


Building a storage pyramid

The office IT infrastructure plan is something like this: build interconnected storage pyramids with compute. The storage pyramids consist of compute hooked up to fast memory, then solid state memory to serve mid-IOPS and mid-bandwidth workloads, then big spinning disks as archive. The different layers of the pyramid are hooked up via interconnects that can be machine-local or over the network.

The Storage Pyramid

Each layer of the storage pyramid has different IOPS and bandwidth characteristics. Starting from the top, you've got GPUs with 500 GB/s memory, connected via a 16 GB/s PCIe bus to the CPU, which has 60 GB/s DRAM. The next layer is also on the PCIe bus: Optane-based NVMe SSDs, which can hit 3 GB/s on streaming workloads and 250 MB/s on random workloads (parallelizable to maybe 3x that). After Optane, you've got flash-based SSDs that push 2-3 GB/s streaming accesses and 60 MB/s random accesses. At the next level, you could have SAS/SATA SSDs which are limited to 1200/600 MB/s streaming performance by the bus. And at the bottom lie the HDDs that can do somewhere between 100 to 240 MB/s streaming accesses and around 0.5-1 MB/s random accesses.

The device speeds guide us in picking the interconnects between them. Each HDD can fill a 120 MB/s GbE port. SAS/SATA SSDs plug into 10GbE ports, with their 1 GB/s performance. For PCIe SSDs and Optane, you'd go with either 40GbE or InfiniBand QDR, and hit 3-4 GB/s. After the SSD layer, the interconnect bottlenecks start rearing their ugly heads.

You could use 200Gb InfiniBand to connect single DRAM channels at 20 GB/s, but even then you're starting to get bottlenecked at high DDR4 frequencies. Plus you have to traverse the PCIe bus, which further knocks you down to 16 GB/s over PCIe 3.0 x16. It's still sort of feasible to hook up a cluster with shared DRAM pool, but you're pushing the limits.

Usually you're stuck inside the local node for performance at the DRAM-level. The other storage layers you can run over the network without much performance lost.

The most unbalanced bottleneck in the system is the CPU-GPU interconnect. The GPU's 500 GB/s memory is hooked to the CPU's 60 GB/s memory via a 16 GB/s PCIe bus. Nvidia's NVLink can hook up two GPUs together at 40 GB/s (up to 150 GB/s for Tesla V100), but there's nothing to get faster GPU-to-DRAM access. This is changing with the advent of PCIe 4.0 and PCIe 5.0, which should be able to push 128 GB/s and create a proper DRAM interconnect between nodes and between the GPU and the CPU. The remaining part of the puzzle would be some sort of 1 TB/s interconnect to link GPU memories together.

The Plan

Capacity-wise, my plan is to get 8 GB of GPU RAM, 64 GB of CPU RAM, 256 GB of Optane, 1 TB of NVMe flash, and 16 TB of HDDs. For a nicer-cleaner-more-satisfying progression, you could throw in a 4 TB SATA flash layer but SATA flash is kind of DOA as long as you have NVMe and PCI-E slots to use -- the price difference between NVMe flash and SATA flash is too small compared to the performance difference.

If I can score an InfiniBand interconnect or 40GbE, I'll stick everything from Optane on down to a storage server. It should perform at near-local speeds and simplify storage management. Shared pool of data that can be expanded and upgraded without having to touch the workstations. Would be cool to have a shared pool of DRAM too but eh.

Now, our projects are actually small enough (half a gig each, maybe 2-3 of them under development at once) that I don't believe we will ever hit disk in daily use. All the daily reads and writes should be to client DRAM, which gets pushed to server DRAM and written down to flash / HDD at some point later. That said, those guys over there *points*, they're doing some video work now...

The HDDs are mirrored to an off-site location over GbE. The HDDs are likely capable of saturating a single GbE link, so 2-3 GbE links would be better for live mirroring. For off-site backup (maybe one that runs overnight), 1 GbE should be plenty.

In addition to the off-site storage mirror, there's some clouds and stuff for storing compiled projects, project code and documents. These don't need to sync fast or are small enough to do so.

Business Value

Dubious. But it's fun. And opens up possible uses that are either not doable on the cloud or way too expensive to maintain on the cloud. (As in, a single month of AWS is more expensive than what I paid for the server hardware...)

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes