I Built a WebAssembly Runtime in 5 Days Because I Was Tired of Paying for Cloud Run

How a bootstrapped hardware startup, an $8 VPS, and five days of hacker-mode debugging became a working multi-tenant sandbox platform


I co-founded a hardware audio startup. We needed infrastructure for firmware signing, device activation, and OTA delivery. I looked at the hyperscaler options — AWS KMS, Azure signing services — and the pricing made no sense for a bootstrapped company. Hundreds of dollars a month for workloads that consume almost nothing.

My background before this was embedded security. ESP32-S3 hardware secure boot, ARM TrustZone-M, HKDF key derivation on constrained hardware. The mental model from embedded work is: every single layer can fail, so you layer your defenses and verify everything. You never just trust.

WebAssembly seemed like the right tool. It's sandboxed by design, it compiles to a tiny binary, and WASI Preview 2 had just landed real networking support — meaning a WASM module could make HTTPS calls from inside the sandbox without me writing HTTP plumbing as a host function. That's what I wanted: run untrusted code safely, let it talk to the network, charge for compute, bill by instruction count.

The goal: build something like a minimal Cloudflare Workers, self-hosted, on commodity hardware. I called it Badwater.


Day 1 — The API Doesn't Work the Way the Tutorial Says

I had never used Wasmtime before. I knew Rust and I knew Linux, but the WASM component model was completely new to me. The first thing I did was copy a tutorial example that used wasmtime::Func::wrap to register a host function. The example was for core modules. WASM components are different.

When you work with components, Func::wrap just doesn't work. You need wasmtime::component::Linker instead of wasmtime::Linker, and you navigate to the right interface with .instance("my:pkg/iface") before you can wrap anything. None of the first tutorials I found made this distinction obvious — they all showed the core module API, which is shorter and cleaner.

Then I got an actual working guest compiled and tried to run it. The guest made an HTTPS call using ureq with rustls — pure Rust TLS, no host HTTP function. It compiled. I hit the endpoint. The host panicked:

thread 'tokio-rt-worker' panicked at wasmtime-wasi-43.0.1/src/runtime.rs:108:15
Cannot start a runtime from within a runtime. This happens because a function (like block_on)
attempted to block the current thread while the thread is being used to drive asynchronous tasks.

The problem: Wasmtime's synchronous WASI implementation internally calls block_on to drive its async internals. If you're already inside a Tokio worker thread (which you are when an Axum handler runs), this panics because Tokio refuses to nest runtimes.

The fix is spawn_blocking. Move the entire Wasmtime invocation off the async thread pool:

let result = tokio::task::spawn_blocking(move || {
    // wasmtime execution here — now we're on a blocking thread
    // can use blocking I/O, no nested runtime panic
    ...
}).await;

Once the WASM execution lives entirely in spawn_blocking, the blocking I/O inside it — TLS handshakes, network calls — runs on a thread pool that's designed for blocking work. The outer async runtime is unaffected.

By end of day 1, I had a guest function that could make an HTTPS request to Cloudflare's trace API from inside a WASM sandbox and return the result. It took the whole day.


Day 2 — Thinking Through What Could Go Wrong

Working WASM runtime is one thing. A multi-tenant WASM runtime is something else. The question is isolation: if tenant A's code crashes, can it affect tenant B? If tenant A has a bug or is doing something malicious, what can they access?

I started thinking about this the way I think about embedded systems. What fails? What's the blast radius? In embedded work you always ask: if this component fails, what does it take down? The answer shapes the architecture.

This led to a design decision that turned out to be correct: two separate binaries.

  • badwater-dispatcher — the HTTP server. Handles requests, fetches WASM from storage, manages timeouts.
  • badwater-runner — the WASM executor. Runs as a completely separate process per request, communicates over a Unix socketpair.

The key insight: if the runner crashes, panics, or is killed by the OS, the dispatcher is completely unaffected. They share no memory. The only interface between them is a typed binary protocol over a socket. The dispatcher sends WASM bytes and a request descriptor; the runner sends back a response. That's it.

But there's a bigger question: what stops the runner from accessing the host filesystem, or seeing other tenants' processes, or escalating privileges? WASM sandboxing is good, but Wasmtime has had security bugs before. You need a second layer.

I looked at options. Docker was too heavy — cold start overhead from the daemon alone. crun was interesting but had container complexity I didn't want to deal with. containerd was overkill. gVisor was too complex to self-host without an existing ops team. Then I found bubblewrap — the same sandbox tool that Flatpak uses for untrusted application isolation. It sets up Linux namespaces (user, pid, ipc, uts, cgroup), drops all capabilities, gives you a fresh tmpfs root. It's a single static binary, no daemon, auditable in an afternoon.

HTTP Request
  │
  ▼
[ badwater-dispatcher ] (Axum, Tokio)
  │
  ├─ fetch .cwasm 
  ├─ socketpair() ──────┐ 
  │                     │ (Unix socket)
  ▼                     ▼
[ child.kill() ]     [ bwrap sandbox (namespaces, --cap-drop ALL) ]
 (Hard timeout)         │
                        ▼
                     [ badwater-runner (PID 2) ]
                        │
                        └─ Wasmtime (Fuel metering, WASI P2)

The architecture for day 2: every request spawns a fresh bwrap sandbox. WASM bytes stream over a socketpair file descriptor that gets dup2'd into the child process before exec. The runner is killed if it exceeds the wall-clock timeout.


Day 3 — bwrap Doesn't Like Being Told Nothing

Getting bubblewrap to actually work took most of day 3. I understood the theory. The practice was different.

First attempt at a sandboxed shell to verify isolation was working:

bwrap --unshare-all --ro-bind / / -- /bin/bash

Output:

bash: /dev/null: Permission denied
bash: /dev/null: Permission denied
bash: /dev/null: Permission denied
... (19 more times)

Twenty lines of the same error. --ro-bind / / bind-mounts the host root directory into the sandbox, but it's not a recursive bind. Your host's /dev is a separate devtmpfs mount that happens to be under /. When you --ro-bind / /, you get an empty /dev directory inside the sandbox. Bash's initialization tries to redirect things to /dev/null and fails every time.

Fix: --dev /dev. This tells bwrap to mount a minimal devtmpfs at /dev inside the sandbox — /dev/null, /dev/zero, /dev/urandom, the basic set. One flag.

Then: network worked in bwrap on the host, but DNS was broken. The sandbox had no /etc/resolv.conf because I hadn't explicitly bound it in. The sandbox can't see any host paths unless you explicitly mount them. Fix: --ro-bind /etc/resolv.conf /etc/resolv.conf. One more flag.

Then I switched from Podman to Docker for building the final scratch image, and hit a completely different wall:

bwrap: creating new namespace failed

Root cause: Docker's default seccomp profile blocks CLONE_NEWUSER — the syscall bwrap needs to create a user namespace. Podman rootless mode allows it because rootless containers run with unprivileged user namespace support enabled by default. Docker runs containers as root and its default seccomp profile is more conservative.

Fix: --privileged on the Docker container, or a custom seccomp profile. For development, --privileged is fine.

Then the FROM scratch image had no /etc/resolv.conf to bind-mount because FROM scratch contains nothing. I added a dirprep build stage that creates the empty directory structure before the final COPY.

Each fix revealed the next problem. Each problem was actually bwrap working correctly — it just needed to be told explicitly about everything it needed. By end of day 3, this was in the logs:

INFO badwater_runner: runner: starting (pid=2)
INFO badwater_runner: runner: socket acquired
INFO badwater_runner: runner: wasm completed - elapsed=555ms, fuel_consumed=182279633

pid=2. Inside the new PID namespace, the runner sees itself as PID 2. That's the isolation. The socketpair fd-passing worked. The whole pipeline ran end to end.


Day 4 — The JIT issue

First Cloud Run deploy. I'd been testing locally and the numbers looked fine. On Cloud Run, cold starts were showing ~2500ms. Way too slow for a platform that was supposed to be lightweight.

Every single request. Wasmtime was JIT-compiling the WASM component on every request from scratch. Cranelift — Wasmtime's code generator — was re-running the entire compilation pipeline each time. For a 1MB WASM binary that includes rustls and the TLS stack, that's a long time of compilation before the code even starts running.

Wasmtime has a solution for this: wasmtime compile produces a .cwasm file — native machine code pre-compiled for the host CPU. Component::deserialize_file loads it in ~20ms, bypassing JIT entirely. When I wrote a test to compare JIT and native WASM execution side by side, it showed the exact difference:

test tests::compare_precompiled_vs_jit ... runner: JIT-compiling component from ./function.wasm
runner: wasm completed - elapsed=2081ms, fuel_consumed=182319268, fuel_remaining=4817680732, fuel_limit=5000000000, utilization=3.6%
runner: loading precompiled component from ./function.cwasm
runner: wasm completed - elapsed=117ms, fuel_consumed=182279609, fuel_remaining=4817720391, fuel_limit=5000000000, utilization=3.6%

And the 2081ms dropped to 117ms. The fuel was still 182M — same work, same cost, just not recompiling on every request.

I built badwater-build-cwasm, a small tool that compiles .wasm.cwasm with the same Wasmtime config as the runner. Deployed to Cloud Run.

New cold start: ~100ms. 25x improvement.

Then it broke. Intermittently. Some requests failed with:

compilation setting "has_avx512bitalg" is enabled, but not available on the host

I had compiled the .cwasm on my 7800X3D desktop. Cranelift detects the host CPU's features at compile time and emits native code using whatever extensions are available. My Ryzen supports AVX-512 variants. Cloud Run's instances do not.

The failure was intermittent because Cloud Run has multiple CPU generations in its fleet. The .cwasm loaded fine on newer instances that happened to have AVX-512, and failed on older ones. Same image, different behavior depending on which physical machine the container landed on. This took me a while to figure out because I couldn't reproduce it locally — my desktop always had AVX-512.

Fix: tell Wasmtime to target a generic x86_64 baseline instead of the host CPU:

config.target("x86_64-unknown-linux-musl")?;

That one line makes Cranelift compile for the lowest common x86_64 denominator — no AVX-512, no host-specific extensions. AVX2 is still used; it's universally available on cloud hardware from the last decade.

After that: consistent sub-100ms server-side execution, every request, every Cloud Run instance.


Day 5 — $8/Month, Live Domain

Cloud Run is elegant for development but the pricing doesn't make sense at scale. A 4-core, 8GB VPS from OVH US-West costs $7.60/month. Same region as Cloud Run us-west1. The math was obvious.

The VPS doesn't have GCP's Workload Identity, so the GCS authentication model breaks. (Plus, egress fees from GCP to OVH would add up fast). I needed a different storage backend. Cloudflare R2 is S3-compatible with a generous free tier and global edge distribution. The signing protocol is AWS SigV4 — an HMAC-SHA256 key derivation chain. I didn't want to pull in the AWS SDK for this.

The SigV4 implementation ended up being 242 lines of Rust with no external auth dependencies. The interesting part was the date math — SigV4 needs a formatted date string as part of the signing key, but I didn't want to pull in chrono just for that. Howard Hinnant has an algorithm for computing the calendar date from a raw Unix timestamp using pure integer arithmetic. I implemented that directly. It works with AWS S3, Cloudflare R2, MinIO, Wasabi, DigitalOcean Spaces — any S3-compatible endpoint.

Deployed to OVH. Put Cloudflare Tunnel in front — cloudflared running as a systemd service, outbound-only connection, no open ports on the VPS. Got a domain: badwater.app.

Final numbers from the live deployment:

# Health-check (WASM on disk, no storage fetch)
WASM execution:    1ms
Server-side total: 9ms
End-to-end:        208ms (including Cloudflare TLS + proxy)

# trace.cwasm (4.9MB cold fetch from R2 + outbound HTTPS inside sandbox)
R2 fetch:          14ms
Sandbox startup:   6ms
WASM execution:    162ms (includes full TLS 1.3 handshake + HTTP to Cloudflare)
Server-side total: 182ms
Fuel consumed:     3.6% of limit

No cache. Every request is a fresh sandbox from scratch.

The 9ms server-side on health-check is the floor of the model. A fresh Linux process, new user/pid/ipc namespaces, capability drop, WASM deserialization, and execution — all in under 10ms. The jump to 182ms on trace.cwasm is almost entirely the R2 fetch (14ms) and the guest's own outbound HTTPS call (the TLS handshake inside the sandbox). The sandbox itself is 6ms.


What I Ended Up With

The whole platform is 2,270 lines of Rust. No unnecessary dependencies. The image is 16MB. It runs on an $8 VPS.

The architecture that felt right on Day 2 shares the same general pattern that Cloudflare Workers uses — language-runtime sandbox (Wasmtime/V8) plus process-level isolation. I didn't know this when I designed it. I got there by asking "what fails and what's the blast radius?" That question is the embedded engineer's instinct applied to cloud infrastructure — the same thinking that ran through my ARM TrustZone-M work.

The JIT fix was the biggest single performance win.

The AVX-512 bug is obvious in retrospect. In embedded work you always specify the target CPU — you never compile for "the current machine" and ship to production. Somehow that gets forgotten in cloud deployments. The rule is the same: know your deployment ISA, don't assume the dev machine matches it.


What’s Next: The Multi-Tenant Reality Check

I built Badwater to solve my own problem, and it runs my production workloads flawlessly. But a platform isn't truly multi-tenant just because it runs in a sandbox. There are two major architectural hurdles left to solve before I’d let strangers upload arbitrary WASM to my servers.

1. Hiding the Sandbox Overhead (The Warm Pool)

I drove the cold-start floor down to 9ms, but 9ms on the critical path is still an eternity in high-throughput systems. Right now, every single HTTP request triggers a synchronous fork(), a bubblewrap namespace initialization, and a Wasmtime engine boot-up. To achieve the sub-millisecond overhead of commercial FaaS platforms, Badwater needs a warm pool. The dispatcher needs to maintain a queue of pre-spawned, sandboxed runners simply waiting for a payload on their socket, removing the OS-level process creation from the request lifecycle entirely.

2. The SSRF Metadata Problem

Bubblewrap provides excellent process and filesystem isolation, but currently, the sandboxes share the host's ability to route outbound network traffic. On my OVH box, this is fine. But if someone deploys Badwater to GCP or AWS, a malicious tenant could write a WASM function that makes an HTTP GET request to 169.254.169.254—the cloud provider's internal metadata endpoint. Without strict network isolation, that tenant could easily steal the underlying virtual machine's IAM credentials. Real multi-tenancy requires isolating the network namespaces or implementing strict UID-based packet filtering at the host level to drop those internal routing requests.

3. Tiered Image Caching (Killing the Network on Cold Starts)

Even with a warm pool of initialized runners, a cold start is eventually required—either because a tenant deploys new code or the pool scales up. Right now, a cold start requires pulling the .cwasm binary over the public internet from GCS or R2. In my benchmarks, fetching a 4.9MB component takes 14ms.

At a low request volume, 14ms is fine. At high throughput, saturating the host's network link to repeatedly download the same binary from S3 is architectural malpractice. But on an $8 VPS with only 8GB of RAM, you can't just cache every tenant's binary in memory.

The platform needs a multi-tier LRU (Least Recently Used) caching strategy. When the dispatcher needs a WASM image, it should check an in-memory cache first (Tier 1, sub-millisecond). If there's a cache miss, it falls back to a local SSD cache (Tier 2, ~1ms), and only if that misses does it go out to object storage (Tier 3, 14ms). Building a cache invalidation and eviction policy that respects the strict memory limits of commodity hardware is the next major infrastructure hurdle.

The Code

The runtime is open source under GPL-3.0-or-later.

Source: github.com/peterw22/badwater

Prebuilt images:

docker pull ghcr.io/peterw22/badwater:gcs-0.09  # Google Cloud
docker pull ghcr.io/peterw22/badwater:s3-0.09   # Anywhere else

Live demo:

curl https://us-west.badwater.app/invoke/trace.cwasm

If you're an infrastructure nerd, I'd love for you to try breaking out of the sandbox, or open an issue if you see a flaw in the bubblewrap configuration.