art with code

2018-03-20

InfiniBanding, pt. 3, now with ZFS

Latest on the weekend fileserver project: ib_send_bw 3.3 GB/s between two native CentOS boxes. The server has two cards so it should manage 6+ GB/s aggregate bandwidth and hopefully feel like a local NVMe SSD to the clients. (Or more like remote page cache.)

Got a few 10m fiber cables to wire the office. Thinking of J-hooks dropping orange cables from the ceiling and "Hey, how about a Thunderbolt-to-PCIe chassis with an InfiniBand card to get laptops on the IB."

Flashing the firmware to the latest version on the ConnectX-2 cards makes ESXi detect them, which somehow breaks the PCI pass-through. With ESXi drivers, they work as ~20 GbE network cards that can be used by all of the VMs. But trying to use the pass-through from a VM fails with an IRQ error and with luck the entire machine locks up. So, I dropped ESXi from the machines for now.

ZFS

Been playing with ZFS with Sanoid for automated hourly snapshots and Syncoid for backup sync. Tested disks getting removed, pools destroyed, pool export and import, disks overwritten with garbage, resilvering to recover, disk replacement, scrubs, rolling back to snapshot, backup to local replica, backup to remote server, recovery from backup, per-file recovery from .zfs/snapshot, hot spares. Backup syncs seem to even work between pools of different sizes, I guess as long as the data doesn't exceed pool size. Hacked Sanoid to make it take hourly snapshots only if there are changes on the disk (zfs get written -o value -p -H).

Copied over data from the workstations, then corrupted the live mounted pool with dd if=/dev/zero over a couple disks, destroyed the pool, and restored it from the backup server, all without rebooting. The Syncoid restore even restored the snapshots, A+++.

After the successful restore, my Windows laptop bluescreen on me and corrupted the effect file I was working on. Git got me back to a 30-min-old version, which wasn't so great. So, hourly snapshots aren't good enough. Dropbox would've saved me there with its per-file version history.

I'm running three 6TB disks in RAIDZ1. Resilvering 250 gigs takes half an hour. Resilvering a full disk should be somewhere between 12 and 24 hours. During which I pray to the Elder Gods to keep either of the two remaining disks from becoming corrupted by malign forces beyond human comprehension. And if they do, buy new disks and restore from the backup server :(

I made a couple of cron jobs. One does hourly Syncoid syncs from production to backup. The others run scrubs. An over-the-weekend scrub for the archive pool, and a nightly scrub on the fast SSD work pool. That is the fast SSD work pool that doesn't exist yet, since my SSDs are NVMe and my server, well, ain't. And the NVMe-to-PCIe -adapters are still on back order.

Plans?

I'll likely go with two pools: a work pool with two SSDs in RAID-1, and the other an archive pool with three HDDs in RAIDZ1. The work pool would be backed up to the archive pool, and the archive pool would be backed up to an off-site mirror.

The reason for the two volume system is to get predictable performance out of the SSD volume, without the hassle of SLOG/L2ARC.

So, for "simplicity", keep current projects on the work volume. After a few idle months, automatically evict them to the archive and leave behind a symlink. Or do it manually (read: only do it when we run out of SSD.) Or just buy more SSD as the SSD runs out, use the archive volume only as backup.

I'm not sure if parity RAID is the right solution for the archive. By definition, the archive won't be getting a lot of reads and writes, and the off-site mirroring run is over GbE, so performance is not a huge issue (120 MB/s streaming reads is enough). Capacity-wise, a single HDD is 5x the current project archive. Having 10 TB of usable space would go a long way. Doing parity array rebuilds on 6TB drives, ugh. Three-disk mirror.

And write some software to max out the IOPS and bandwidth on the RAM, the disks and the network.

No comments:

Blog Archive