art with code

2017-10-08

Fast marching cubes in JavaScript

Marching cubes! Where do they march? What is their tune? The name of their leader, a mystery if any.

Marching cubes works like this:

  1. You have a 3D array of bits and want to create a mesh out of it.
  2. Read a 2x2x2 cube from the array.
  3. Generate a mesh based on the values of the cube.
  4. Repeat for every 2x2x2 cube in the array and concatenate the meshes.

The individual cube meshes work like Lego blocks, they click together to form a seamless mesh.

How to do it kinda fast:

  1. Create cached meshes and normals for each different 2x2x2 bit cube (there are 2^8 of them). You get an array like cubeMeshes[eightBitCubeIndex].
  2. Create a binary array based on the original data. Unroll loops to process in chunks of 8, do it SIMD-like, pass over the original data and spit out ones and zeroes into a Uint8Array. (You could put 8 bits per byte, but it's a hassle.) 
  3. Create a cube index Uint8Array that's going to be filled with the eight-bit cube indexes of each 2x2x2 cube in the data.
  4. Fill the cube index array by marching a 2x2x2 cube over the binary array and converting the read cube values into eight-bit cube indexes. Increment total mesh vertex count by cubeMeshLengths[eightBitCubeIndex].
  5. Allocate Float32Arrays for vertices and normals based on the total mesh vertex count.
  6. Iterate over the cube index array. Write the mesh corresponding to the cube index to the vertex array, offset each vertex with the xyz-coordinates of the cube index. Write the normals corresponding to the cube index to the vertex array.

Source: fastIsosurface.js - demo

This runs in ~150ms on an Intel i7 7700HQ for a 7 million element data array (256x256x109).

Future directions

As you may notice from the source, it's SIMD-friendly, in case you can use SIMD. The algorithm parallelizes easily too.

Web Workers with transferable objects? Transform feedback in WebGL 2 + a reducer kernel to remove empty triangles? Do it in a fragment shader to a float render target? Magic?

Handwaving

The test dataset contains 7 million 16-bit uints which takes about 14 megabytes of RAM. This means that it won't fit in the 7700HQ's 4x1.5MB L3 cache, much less the 4x256kB L2 or the 4x32kB L1.

By compressing the dataset into a bit array, it would fit in 7 megabits, or 875 kB. Processing that with four cores (8 threads) would keep the read operations in the L2 cache. Chunking the processing into 30 kB cubes would keep the reads mostly in the L1 cache.

The output array for the marching cubes step consists of a byte per cube. The original input array has two bytes per element. The bit array has one byte or one bit per element. The output vertex arrays have up to 90 floats, or 360 bytes per cube (but they're very sparse, the average is 1-2 bytes per cube). There's roughly one cube per input array element.

Taking the sum of the above, we get about 1 + 2 + 1 + 1 = 5 bytes per cube. We could process 6000 cubes in a 32kB L1 cache. That might come to 64x10x10 input elements that output 63x9x9 cubes for total memory use of 29406 bytes and 5103 cubes.

How fast would that be? Let's see. You need to read in the input data. That's going to come from RAM at 40 GB/s => something like 0.05 ns per cube. You can crunch it into the binary array as you go: two comparisons, a logical AND, and a store to L1 would work out to 2 ns per input element at 3GHz. For per-cube time, divide by 8 as each element is used by 8 cubes: 0.25ns per cube.

Then read through it with a 2x2x2 window for the cube generation, do a couple multiply-adds. Updating the window requires avg 4 reads per step plus processing to generate the cube indexes, say 4x7 cycles in total.

Then write the vertices to the vertex array. This might take 6 cycles for the array fetch and write.

Add some fudge, 3 GHz clock rate. Each cube takes 4x7 + 6 = 34 cycles. Estimated runtime 12ns per cube (+ 0.25ns for input data processing). Need 10 million cubes for the mesh: 120 ms. Do it in parallel in four L1 caches => 30 ms.

But, uh, it already runs in 150 ms for some meshes. And crunching the input data takes 20 ms of that. In JavaScript. What.

No comments:

Blog Archive