Stackless Evaluator Report

← Stackless
Io

Design and status notes on the heap-allocated frame-based evaluator.

Overview

The stackless branch replaces Io's recursive C-stack-based evaluator with a heap-allocated frame-based iterative evaluator. Every piece of execution state — message pointer, target, locals, arguments, control flow — lives in GC-managed frame objects on the heap rather than in C stack frames.

Goals

  1. First-class continuationscallcc captures the entire execution state as a serializable, network-transmittable object. Impossible with C stack recursion. Note: callcc is disabled by default (behind #ifdef IO_CALLCC) because undelimited continuations can rewind past cleanup code and corrupt handler stacks. Enable with -DIO_CALLCC at build time.
  2. Portable coroutines — No platform-specific assembly, no setjmp/longjmp, no ucontext, no fibers. Coroutine switching is just swapping a frame pointer.
  3. No stack overflow — Tail call optimization reuses frames. Deep recursion is bounded by heap, not C stack depth.
  4. Performance — The iterative evaluator with object pooling and inline caching is 1.1x–4x faster than the recursive evaluator across all benchmarks.

Non-Goals

  • Changing the Io language semantics. All existing Io code runs unmodified.
  • JIT compilation or bytecode. The evaluator still interprets the message tree directly.

Architecture

The Eval Loop

A single C function, IoState_evalLoop_(), runs a while(1) loop that processes frames from a linked stack. Each frame has a state field (a state machine enum) that tells the loop what to do next. There is no C recursion for message evaluation — argument evaluation, block activation, and control flow are all driven by pushing/popping frames and transitioning states.

IoState_evalLoop_

while(1) {
    frame = state->currentFrame
    switch(frame->state) {
        START → LOOKUP_SLOT → ACTIVATE
            ├── CFunction: call directly
            └── Block: push child frame
        CONTINUE_CHAIN → next message or RETURN
        RETURN → pop frame, store result
        IF_*, WHILE_*, FOR_*, ... (control flow)
    }
}

Frame Structure

Frames are full IoObjects managed by the garbage collector (typedef IoObject IoEvalFrame). Their data payload is IoEvalFrameData:

FieldPurpose
messageCurrent message being evaluated (instruction pointer)
targetReceiver object (self)
localsEnclosing scope for slot lookup
resultAccumulated result of this frame
stateCurrent state machine state
parentParent frame (for returning results)
argValues / inlineArgs[4]Pre-evaluated argument values
blockLocalsBlock's local scope (if block activation)
call / savedCallCall introspection objects
controlFlowUnion of per-primitive state (for/while/if/foreach/etc.)

Frame State Machine

The evaluator has 24 states organized by function:

Core evaluation: STARTLOOKUP_SLOTEVAL_ARGSACTIVATECONTINUE_CHAINRETURN

Control flow (each primitive has its own states):

  • if: IF_EVAL_CONDITIONIF_CONVERT_BOOLEANIF_EVAL_BRANCH
  • while: WHILE_EVAL_CONDITIONWHILE_CHECK_CONDITIONWHILE_DECIDEWHILE_EVAL_BODY
  • for: FOR_EVAL_SETUPFOR_EVAL_BODYFOR_AFTER_BODY
  • loop: LOOP_EVAL_BODYLOOP_AFTER_BODY
  • foreach: FOREACH_EVAL_BODYFOREACH_AFTER_BODY
  • callcc: CALLCC_EVAL_BLOCK
  • coroutines: CORO_WAIT_CHILD, CORO_YIELDED
  • do/doString/doFile: DO_EVAL, DO_WAIT

No C Stack Manipulation

The entire design avoids platform-specific stack tricks:

  • No setjmp/longjmp — errors set state->errorRaised = 1 and return normally. The eval loop unwinds frames.
  • No ucontext or fibers — coroutines save/restore a frame pointer.
  • No assembly — continuations copy frame chains on the heap.

This makes the VM fully portable to any platform with a C99 compiler.

New Language Features

First-Class Continuations

callcc(block) captures the current execution state into a Continuation object:

result := callcc(block(escape,
    list(1, 2, 3) foreach(v,
        if(v == 3, escape invoke("found it"))
    )
    "not found"
))
result println  // => "found it"

Continuation API:

MethodDescription
invoke(value)Restore captured frames, return value at callcc site
copyDeep-copy the frame chain (enables multi-shot use)
isInvokedReturns true if this continuation has been invoked
frameCountNumber of captured frames
frameStatesList of state names per frame
frameMessagesList of current messages per frame
asMap / fromMapSerialize/deserialize continuation state

Continuations are one-shot by default. Use copy to create a fresh continuation for multi-shot patterns (generators, backtracking). See docs/IoContinuationsExamples.md for detailed examples.

Frame Introspection

Live execution frames are exposed to Io code for debugging and metaprogramming:

f := Coroutine currentCoroutine currentFrame
while(f != nil,
    f description println
    f = f parent
)

EvalFrame methods: message, target, locals, state, parent, result, depth, call, blockLocals, description

Resumable Exceptions

signal is the resumable counterpart to raise. A handler installed with withHandler can inspect an exception and supply a replacement value, resuming execution at the signal site:

result := withHandler(Exception,
    block(e, resume, resume invoke(42)),
    Exception signal("need value") + 1
)
result println  // => 43

How it works:

  • withHandler(proto, handler, body) pushes a handler entry onto the current coroutine's handlerStack, evaluates the body, then pops the handler.
  • signal(error) clones the exception, walks handlerStack from the current coroutine up through parentCoroutine (matching by isKindOf), and calls the first matching handler as a regular function. The handler's return value becomes signal's return value.
  • raise(error) is unchanged — non-resumable, unwinds frames via errorRaised.
  • If signal finds no matching handler, it falls back to raise.

Implementation: Entirely in Io (libs/iovm/io/Exception.io), no C changes. The body is evaluated directly via doMessage rather than in a child coroutine, which avoids a known interaction between continuation capture and nested C eval loops. Handler lookup walks the parentCoroutine chain, so handlers installed outside a try are visible inside it. See docs/IoContinuationsExamples.md for detailed examples.

Portable Coroutines

Coroutines work by saving and restoring the frame pointer — no C stack switching:

o := Object clone
o work := method(
    for(i, 1, 5, i println; yield)
)
o @@work
for(i, 1, 5, yield)

A suspended coroutine's entire state is a single pointer to its saved frame chain. Switching coroutines is O(1).

Performance Optimizations

Object Pooling

Four pools eliminate allocation overhead in hot paths:

PoolSizeWhat's Reused
Frame pool256GC-managed EvalFrame objects
Block locals pool8PHash-allocated locals for block activation
Call pool8IoCallData-allocated Call objects
Number data freelist512IoObjectData allocations for Numbers

All pooled objects are GC-marked through IoCoroutine_mark() to prevent premature collection.

Inline Argument Buffer

95% of method calls have 4 or fewer arguments. These use a stack-allocated inlineArgs[4] buffer instead of heap-allocating an argument array.

Monomorphic Inline Cache

Each IoMessage has a one-entry cache for slot lookups:

if (tag matches && slotVersion matches && no local shadow)
    → return cached value (skip proto chain walk)

The cache stores (tag, slotValue, context, version) and only caches proto-chain hits. A local-slot shadow guard prevents stale results when different objects of the same type have overriding local slots (e.g., false.isTrue shadowing Object.isTrue).

Special Form Detection

Each CFunction that needs lazy argument evaluation carries an isLazyArgs flag, set at VM init time. This includes control flow primitives (if, while, for, loop, callcc), block constructors (method, block), iteration (foreach, foreachSlot), and others. Since Io's getSlot returns the same CFunction object, aliases automatically inherit the flag (e.g., false.elseif := Object getSlot("if")). The result is cached per-message-site for fast subsequent lookups.

Cached Literal Fast Paths

When a control flow body is a single cached literal (nil, number, or string), the evaluator skips frame allocation entirely and uses the cached value directly. For-loops with literal bodies (for(i, 1, 1000000, nil)) run as tight C loops.

Tail Call Optimization

Two complementary mechanisms keep frame stacks flat:

  1. Direct TCO: When a Block call is the last message in a block body, activateBlockTCO_() reuses the current frame instead of pushing a new one.
  2. TCO through if: When if() is the last message in a chain, the selected branch evaluates in-place without a child frame. This enables idiomatic Io recursion:
factorial := method(n, acc,
    if(n <= 1, acc, factorial(n - 1, acc * n))
)
factorial(100000, 1)  // no stack overflow

Boolean Singleton Fast Path

if and while conditions that are true, false, or nil skip the asBoolean frame push — the singleton is used directly.

Number Cache

Pre-allocated Number objects for the range [-10, 1024] eliminate allocation for most loop counters and arithmetic.

Hybrid Reference Counting (Optional)

An optional RC layer (disabled by default, enable with #define COLLECTOR_USE_REFCOUNT 1 in CollectorMarker.h) promptly reclaims short-lived objects. The existing mark/sweep GC remains as backup for cycles.

When enabled, for-loop counter Numbers are reclaimed immediately via RC drain, keeping the freed list populated and avoiding calloc. This gives a 1.5x speedup on for(i, 1, 1000000, i + 1) at the cost of ~7% regression on method-heavy workloads from the refcount increment on every IOREF.

Benchmarks

All benchmarks on macOS, Release build. Times are best-of-3 wall clock.

Stackless vs Master (Recursive Evaluator)

BenchmarkMasterStacklessSpeedup
for(i, 1, 1M, nil)0.32s0.08s4.0x
for(i, 1, 1M, i+1)0.69s0.42s1.6x
x = x + 1 (500K)0.41s0.29s1.4x
500K method(y, y+1)0.74s0.28s2.6x
while(i < 1M)0.70s0.64s1.1x
fib(30) recursive3.10s1.77s1.8x
List ops (100K)1.74s1.14s1.5x
Test suite (239 tests)0.93s0.83s1.1x

Stackless is faster on every benchmark. Method calls and tight for-loops benefit most from the iterative eval loop, frame pooling, and inline caches. Recursive workloads like fib(30) benefit from reduced frame allocation overhead.

With Optional RC Enabled

BenchmarkWithout RCWith RCChange
for(i, 1, 1M, i+1)0.42s0.35s1.2x faster
500K method(y, y+1)0.28s0.32s12% slower
fib(30)1.77s1.96s11% slower

RC is a targeted optimization for allocation-heavy for-loops. It trades ~10% overhead on general workloads for prompt reclamation of loop temporaries.

Test Results

  • 30/30 C tests (TCO, continuations, exceptions, coroutines, ? operator, asMap)
  • 249/249 Io tests via run.io (230 original + 9 EvalFrame introspection + 10 resumable exception tests)
  • SwitchTest: 6 pre-existing failures (same on master, not in run.io suite)

Key Files

FilePurpose
libs/iovm/source/IoState_iterative.cIterative eval loop, state machine, control flow
libs/iovm/source/IoEvalFrame.h / .cFrame structure, state enum, introspection methods
libs/iovm/source/IoObject_flow.cControl flow primitives (if, while, for, loop, etc.)
libs/iovm/source/IoContinuation.h / .cContinuation capture, invoke, copy, serialization
libs/iovm/source/IoCoroutine.cFrame-based coroutine switching
libs/iovm/source/IoState_eval.cEntry points (doCString, runCLI)
libs/iovm/source/IoState_inline.hInline helpers, pre-eval arg access
libs/iovm/source/IoState.hVM state, pools, cached symbols
libs/iovm/tests/correctness/EvalFrameTest.ioFrame introspection tests
libs/iovm/tests/correctness/ResumableExceptionTest.ioResumable exception tests
libs/garbagecollector/source/Collector_inline.hRC increment/decrement (optional)
agents/C_STACK_ELIMINATION_PLAN.mdArchitecture design document
agents/CONTINUATIONS_TODO.mdPhase tracker and implementation notes