art with code

2018-03-11

InfiniBanding, pt. 2

InfiniBand benchmarks with ConnectX-2 QDR cards (PCIe 2.0 x8 -- very annoying lane spec outside of server gear: either it eats up a PCIe 3.0 x16 slot, or you end up running at half speed, and it's too slow to hit the full 32 Gbps of QDR InfiniBand. Oh yes, I plugged one into a PCIe 2.0 x8 four-lane slot, it gets half the RDMA bandwidth in tests.)

Ramdisk-to-ramdisk, 1.8 GB/s with Samba and IP-over-InfiniBand.

IPoIB iperf2 does 2.9 GB/s with four threads. The single-threaded iperf3 goes 2.5 GB/s if I luck out on the CPU affinity lottery (some cores / NUMA nodes do 1.8 GB/s..)

NFS over RDMA, 64k random reads with QD8 and one thread, fio tells me read bw=2527.4MB/s. Up to 2.8 GB/s with four threads. Up to 3 GB/s with 1MB reads.

The bandwidth limit of PCIe 2.0 x8 that these InfiniBand QDR cards use is around 25.6 Gbps, or 3.2 GB/s. Testing with ib_read_bw, it maxes out at around 3 GB/s.

So. Yeah. There's 200 MB/s of theoretical performance left on the table (might be ESXi PCIe passthrough exposing only 128 byte MaxPayload), but can't complain.

And... There's an upgrade path composed of generations of obsoleted enterprise gear: FDR gets you PCIe 3.0 x4 cards and should also get you the full 4 GB/s bandwidth of the QDR switch. FDR switches aren't too expensive either, for a boost to 5.6 GB/s per link. Then, pick up EDR 100 GbE kit...

Now the issue (if you can call it that) is that the server is going to have 10 GB/s of disk bandwidth available, which is going to be bottlenecked (if you can call it that) by the 3 GB/s network.

I could run multiple IB adapters, but I'll run out of PCIe slots. Possibilities: bifurcate a 16x slot into two 8x slots for IB or four 4x slots for NVMe. Or bifurcate both slots. Or get a dual-FDR/EDR card with a 16x connector to get 8 GB/s on the QDR switch. Or screw it and figure out how to make money out of this and use it to upgrade to dual-100 GbE everywhere.

(Yes, so, we go from "set up NAS for office so that projects aren't lying around everywhere" to "let's build a 400-machine distributed pool of memory, storage and compute with GPU-accelerated compute nodes and RAM nodes and storage nodes and wire it up with fiber for 16-48 GB/s per-node bandwidth". Soon I'll plan some sort of data center and then figure out that we can't afford it and go back to making particle flowers in Unity.)


No comments:

Blog Archive