A plan for making Io substantially faster, along two implementation paths: a pure JavaScript interpreter, and a more aggressively optimized WASM VM.
The Performance page tracks the current numbers: roughly 8–13 million slot accesses per second and ~2.3 million instantiations per second on Apple Silicon under wasmtime. Profiles of the current VM show the cost is spread across a few structural sources rather than one hot spot:
Number is a full heap object. 1 + 2 allocates, and arithmetic-heavy loops spend much of their time in the allocator and collector rather than adding.(tag, slotValue, slotVersion) triple. Polymorphic call sites miss every time the receiver type alternates, falling back to a cuckoo-hash slot walk up the proto chain.Both paths below attack these same costs with different tools. Progress on either is measured the same way: the correctness suite is the contract, and every change lands as a point on the benchmark history.
The browser is now Io's primary target, and a JS implementation changes the economics:
The risk is the ceiling: performance depends on staying monomorphic from the engine's point of view. That constraint shapes the whole design.
slots, protos, data) created by one constructor, so the engine assigns a single hidden class to all of them. Slot tables are plain objects or Maps depending on what profiles show; never mix the two.Number is a raw JS number; Sequence (immutable) is a JS string; true/false/nil are singletons. A typeof check replaces tag dispatch on the hot path, and the engine inlines it.callcc, and resumable exceptions port directly: the heap-allocated frame state machine from the stackless work is plain data and works identically in JS. No generators, no async/await coloring — the eval loop is a loop.new Function, guarded by receiver-shape checks. Falls back to the closure interpreter under CSP restrictions that forbid runtime codegen.WeakLink maps to WeakRef + FinalizationRegistry.The bench harness already supports this: bench/run.mjs can drive a Node entry point instead of wasmtime, and history entries carry a label per series, so the JS interpreter shows up as its own line on the chart from day one.
Two VMs can drift semantically — the correctness suite is the only defense, so it grows with every divergence found (the lazy-argument and special-form bugs found in the stackless evaluator are exactly the kind of thing that bites here). Deopt cliffs in the engine need profiling discipline: one megamorphic site or one accidental shape change can silently halve throughput.
The C VM keeps full control of memory layout and pays no JS-engine unpredictability — the optimizations are classical interpreter engineering, applied under WASM's constraints (no runtime codegen in-module, no computed goto).
wasmtime --profile and browser CPU profiles; every item below is gated on showing up in a profile, and every change lands as a point on the benchmark history.IONUMBER, CNUMBER, tag checks) plus the collector's mark path.while, assignment forms. The special-form/lazy-args machinery already identifies most of these sites.switch with mutually tail-calling opcode handlers removes the dispatch-loop branch mispredictions that dominate interpreter profiles.slotVersion with per-object shape versions so unrelated slot writes stop invalidating every cache.-O3 + LTO, a wasm-opt pass over the binary (typically another 5–15%), and relaxed-SIMD for Vector where it helps.Table.set() is the legal form of code patching. Hot call sites dispatch through call_indirect on a table index; the JIT swaps in the compiled function, and the next send lands in machine code reading the same heap.The supporting cast: the VM runs in a Web Worker because the main thread refuses synchronous compilation of modules beyond a few KB (workers have no such limit). Inline caches are data-driven (loaded from memory and compared) since instruction-level patching is impossible — a few cycles slower than patched ICs, and what production wasm-hosted runtimes do anyway. Accumulated one-method modules pay cross-module call overhead and block cross-function inlining, so a background pass periodically recompiles everything hot into one consolidated module (WebAssembly.compile is async) and swaps it at a safe point.
The stackless design is what makes the swap — and tiering in general — cheap: all execution state lives in heap frames, so between eval-loop steps the native stack is empty. Attaching a new instance to the same WebAssembly.Memory and continuing the loop is the state migration. The same property gives near-free on-stack replacement (tier up a hot loop by rewriting its heap frame's resume target) and a ready-made deoptimization target (fall back to the frame machine when a guard fails).
Items 2 and 3 are the foundation and pay for themselves independently; 4–6 build on the bytecode representation; 7 is orthogonal and can proceed in parallel; 9 builds on all of it and only makes sense once the interpreter itself is no longer the bottleneck.
How do the two paths compare, assuming both are pushed hard?
Rough scorecard against the current interpreter on the micro set: a mature JS path lands around 10–30×; a first-generation WASM JIT around 3–8×; a fully mature WASM JIT overlaps the JS path's range, ahead on allocation and numeric work, behind on bridge-heavy work. Per unit of engineering effort it is not close: closure compilation plus new Function codegen is perhaps 15% of the work of bytecode + profiler + type feedback + template JIT + specialization + generational GC + deopt paths. Part of the JS path's value is that it cheaply establishes how fast Io's semantics can go when a world-class JIT does the lifting — the honest bar any custom backend must clear.
Coroutines, callcc, and resumable exceptions are where both JIT stories get uncomfortable, because the optimization that makes calls fast — using the host's native stack — is the same one that makes suspension expensive. The current interpreter is near-optimal on this axis: a coroutine switch is a couple of pointer swaps and callcc capture is O(1), because frames already live on the heap. The bench suite tracks this directly (coroutineSwitches in the micro set, cheapconcurrency in the program set) so neither path can regress it silently.
The options, per path:
signal/withHandler doesn't unwind: handler lookup walks a heap-resident handler stack and the handler runs as an ordinary call. Compilation doesn't disturb it. The expensive citizens are callcc and coroutine switches — and serializable, network-transmittable continuations require the heap representation regardless; no stack-based scheme can deliver them.The design conclusion is the same for both paths: the heap-frame machine stays the canonical representation; compiled code is an optimization for regions that don't suspend, entered through guards and abandoned via deoptimization when suspension occurs. Actor-style code that switches constantly runs near today's speeds; straight-line and numeric code gets the full JIT win; and code that is both send-dense and suspension-prone keeps perhaps 2–5× rather than 10–30× — in either path — unless WasmFX ships and tilts the board toward the WASM side.
The paths aren't exclusive — the JS interpreter is also the best way to discover how fast Io's semantics can go when the world's best dynamic-language JITs do the heavy lifting, which sets an honest target for the WASM VM. If one must come first: the JS path has the higher ceiling in the browser and deletes the most code (collector, bridge, WASI shim); the WASM path keeps a single implementation and runs everywhere wasmtime does.