In this second post of taking a closer look at some of the new features coming along in the proposals process for WebAssembly, we'll check out threads!

The threads proposal, currently in stage 2, but already implemented in Chrome¹, provides a new shared memory type and the atomic memory operators necessary to support threads. The creating and joining of threads is handled by the host and not described in the spec.

Let's take a look at that sparkly new shared memory. ✨

As always, we can either import it from JavaScript

let memory = new WebAssembly.Memory({
  initial: 1,
  maximum: 1,
  shared: true,
});
...
;; Import a shared linear memory with minimum and maximum
;; size of 1 page (64KiB)
(import "env" "memory" (memory 1 1 shared))

or create it in the WebAssembly module and export to JavaScript

;; Define an exported shared linear memory with minimum
;; size of 1 page (64KiB)
(memory (export "memory") 1 shared)
// And in the JavaScript side after instantiating the Wasm Module:
WebAssembly.instantiateStreaming(fetch('main.wasm'))
  .then(obj => {
    const mem = obj.instance.memory.buffer;
    // Create a view on the array buffer and do stuff!
    ...
  });

Sharing is caring 🤲

Well, the beauty of shared memory is actually, sharing it!

For a simple illustration of threads and atomic memory access, we'll be doing an embarrassingly parallel task. We'll perform a Monte Carlo integration of a circle to determine its area. Yeah, I know, it's a pretty pointless endeavour considering the simplicity of the circle's area formula determined by analytical integration, but it proves to be a straightforward and verifiable example of threads! I'll leave the parallelised ray tracing Wasm engine for another post 😉.

If you're not familiar with Monte Carlo integration, or numerical calculus techniques in general, we're basically just going to randomly mark points in a unit square that encloses a circle of diameter 1 inscribed in it and determine the ratio of marked points inside the circle to those in the entire square (inside + outside the circle).

That ratio will actually represent an estimate for the area of the circle. By using more random marked points we can approach more accurate results for the area. One nice thing about this simple example is that we know the area of the circle is π/4, so we expect our ratio to approach that as we keep marking more random points.

For extra fun, we are not going to be touching a higher level language to compile to Wasm. We'll write the actual Monte Carlo integration logic directly in the WebAssembly Text Format (.wat 🤷‍♀️). Then we'll use wat2wasm from wabt with threads enabled to do that jump.

The setup 👀

It really couldn't be simpler. We have four files, excluding the generated .wasm:

  • index.html
  • main.js
  • worker.js
  • monte_carlo.wat

I'm basically giving it away that we'll be using Web Workers with worker.js. Web Workers are how browsers allow us to create multi-threaded applications on the Web. In Chrome, each worker has its own V8 instance, called an isolate.

Jumping straight into main.js we make use of the shiny Web Crypto API and specifically Crypto.getRandomValues() to fill a Uint8Array with some random 8 bit unsigned integers. We're going to use these numbers in pairs as random coordinates. As you can see we are creating an array with the maximum length of numbers that the crypto API can generate in one go. There's not enough initial entropy to guarantee good randomness after that limit.

// We need to generate twice as many random numbers as
// coordinate pairs we wish to use.
// Choose so that length of the Uint8Array < 2^16 (65536)
const points = 30000;
// We are using 4 bytes for a 32 bit counter, hence the + 4.
// We'll zero them later.
const randomNumbers = crypto.getRandomValues(
  new Uint8Array(points * 2 + 4),
);

Next we'll need to create some shared Wasm memory which will be backed by a SharedArrayBuffer. You may notice for cases where the memory is not shared, the maximum pages of memory is optional, but for shared memory you must explicitly set it.

const memory = new WebAssembly.Memory({
  // 1 page is 64KiB which is more than enough for this task.
  initial: 1,
  maximum: 1,
  shared: true,
});

We can now initialise the memory with our random numbers:

const wasmUint8 = new Uint8Array(memory.buffer);
wasmUint8.set(randomNumbers);
const wasmUint32 = new Uint32Array(memory.buffer);
wasmUint32[0] = 0; // Set the counter to 0

We couldn't have just passed the Uint8Array view over the Wasm memory into crypto.getRandomValues since it does not allow populating into a SharedArrayBuffer (yet?).

Now the rest of main.js is just to orchestrate the Workers and produce a result for the area. Each worker is going to send back its own "local ratio" it calculated. We're just going to average over them.²

let doneCount = 0;
let finalResult = 0;
let validResults = 0;

const workerCount = 4;
const start = performance.now();

for (let i = 0; i < 4; i++) {
  const worker = new Worker('./worker.js', { type: 'module' });
  worker.onmessage = (e) => {
    doneCount++;
    if (e.data.result !== 0) {
      validResults++;
      finalResult += e.data.result;
    }

    if (doneCount === workerCount) {
      console.log(`Integration completed in ${performance.now() - start} ms.`);
      console.log(`Approximate area of shape: ${finalResult / validResults}`);
    }
  };
  worker.postMessage({ memory });
}

The Web Worker is quite straightforward. We'll do a streaming instantiation of the compiled Wasm module, providing the memory we created into its import object so it can be accessed from Wasm land. Then we use the exported Wasm function integrate() to do the Monte Carlo integration.

onmessage = async (e) => {
  const obj = await WebAssembly.instantiateStreaming(
    fetch('monte_carlo.wasm'),
    {
      env: {
        memory: e.data.memory,
      },
    },
  );

  const ratio = obj.instance.exports.integrate();
  console.log(ratio);
  postMessage({ result: ratio });
};

Wat about Wasm? 🤷‍♂️

As mentioned before, we'll be writing everything in the WebAssembly Text Format in monte_carlo.wat. I've used a mixture of S-expressions and postfix syntax so you can get familiar with both variants.

(module
  (import "env" "memory" (memory 1 1 shared))
  (func (export "integrate") (result f32)
    (local $x f32)
    (local $y f32)
    (local $pair i32)
    (local $in_circle f32)
    (local $in_square f32)

    ;; Set the local variables.
    (local.set $in_circle (f32.const 0))
    (local.set $in_square (f32.const 1))

    (block $block
      (loop
        i32.const 0 ;; At memory offset 0
        i32.const 1 ;; Add 1 atomically to existing i32 value
        i32.atomic.rmw.add
        ;; Set the resulting value as a local for later usage
        local.set $pair

        ;; Get out of the loop if we hit 30000 pairs
        (br_if $block (i32.ge_u (local.get $pair) (i32.const 30000)))
        
        local.get $pair
        call $getPairOffset
        i32.load8_u
        call $transformCoordinate 
        local.set $x

        local.get $pair
        call $getPairOffset
        i32.const 1
        i32.add
        i32.load8_u
        call $transformCoordinate 
        local.set $y

        ;; Check if our point is within the circle:
        ;; i.e. x^2 + y^2 <= 0.25
        local.get $x
        local.get $x
        f32.mul
        local.get $y
        local.get $y
        f32.mul
        f32.add
        f32.const 0.25
        f32.le
        if
          ;; If we are in the circle then make
          ;; sure to increment $in_circle too.
          local.get $in_circle
          f32.const 1
          f32.add
          local.set $in_circle
        end
        local.get $in_square
        f32.const 1
        f32.add
        local.set $in_square

        (br 0)
      )
    )

    ;; Calculate the ratio which will be
    ;; popped off as the result of this
    ;; function invocation.
    local.get $in_circle
    local.get $in_square
    f32.div
  )


  ;; Given the nth pair, we can calculate its offset
  ;; in linear memory.
  (func $getPairOffset (param $pair i32) (result i32)
    local.get $pair
    i32.const 2
    i32.mul
    i32.const 2
    i32.add
  )

  ;; Here we'll transform the coordinates from the range
  ;; [0,255] to the range [-0.5,0.5].
  (func $transformCoordinate (param $coord i32) (result f32)
    local.get $coord
    f32.convert_i32_u
    f32.const 255
    f32.div
    f32.const 0.5
    f32.sub
  )
)

You'll need to compile monte_carlo.wat to monte_carlo.wasm using wabt's wat2wasm with the --enable-threads flag.

wat2wasm --enable-threads monte_carlo.wat

And of course we can't forget the last piece, index.html. You'll need to be running a local server otherwise fetch will not work.

<!DOCTYPE html>
<html>
  <meta charset="utf-8" />
  <head>
    <script src="main.js"></script>
  </head>
</html>

Results

Unsurprisingly this does not really make great use of the threads. It's a fairly simple problem and sometimes the other threads don't even get a chance to work on anything before the first thread is done with everything. On my machine the number of threads that ended up doing work varied across the entire range of threads. The performance didn't really have much of a difference either. We'd need to look at a longer task.

With 30 000 random points on a 255 x 255 point square got me around an area of 0.78 which is close to the actual area of 0.78539816339...

Footnotes

  1. WebAssembly threads depends on SharedArrayBuffer to allow shared memory access. Unfortunately due to Spectre side-channel attacks, back in January 2018 all major browsers disabled SharedArrayBuffer. Chrome was only able to later re-enable it on platforms where site isolation is present, thus mitigating the vulnerability. Interestingly, two new security related headers being introduced will help bring back SharedArrayBuffer to all major browsers by default. These are COOP (Cross-Origin-Opener-Policy) and COEP (Cross-Origin-Embedder-Policy). These can currently be used to allow browsers to opt-in to being cross-origin isolated and will allow usage of SharedArrayBuffer.
  2. Actually, what happens sometimes is that some workers might not even get a chance to get going on the problem in time (considering this is a piece of cake problem), that some might return zero as they were too slow to start working with any random points.