Codegen Pipeline¶

This page walks through each stage of anvil's code generation in detail. For the implementation specifics, see NumericalFunction Internals.

Stage 1: Tracing¶

When you access a NumericalFunction's codegen properties for the first time, anvil creates symbolic input tensors via Tensor.empty(shape, dtype) and calls the wrapped function on them. tinygrad internally builds a UOp (micro-operation) graph representing the computation.

flowchart LR
    A["Tensor.empty(1024)"] --> B["fn(x)"]
    B --> C["UOp graph<br/>(BUFFER → ops → output)"]

Two fixups are applied to trace outputs:

Multi-dim outputs get .contiguous() to guarantee C-contiguous buffer layout (prevents PERMUTE chains from producing non-row-major kernels)
Input-aliased outputs (where output is a view of an input, e.g. lambda x: x[:2]) get + 0 to force a separate output buffer

Stage 2: Callification and scheduling¶

The traced UOp graph is transformed into tinygrad's scheduling form:

flowchart LR
    A["UOp graph"] --> B["symbolic_simple<br/>(simplification)"]
    B --> C["transform_to_call<br/>(CALL/PARAM form)"]
    C --> D["complete_create_schedule<br/>(linear ExecItems)"]

Symbolic simplification: constant folding, algebraic identities
Callification: rewrites the graph into a top-level CALL op with PARAM placeholders for inputs. Decides which outputs need fresh buffers and which constants need COPY ops.
Scheduling: lowers the CALL graph into a linear list of ExecItems, each representing a kernel invocation or memory copy.

Note

Scheduling operates on separate tensor copies to avoid mutating the original trace outputs, which vmap and other transforms need unmodified.

Stage 3: Kernel rendering¶

Each ExecItem is rendered to platform-specific source code:

flowchart LR
    A["ExecItem<br/>(SINK op)"] --> B["get_program<br/>(tinygrad)"]
    B --> C["Renderer<br/>(Clang/Metal/CUDA)"]
    C --> D["RenderedKernel<br/>(source + metadata)"]

A RenderedKernel contains:

src: the rendered C/Metal/CUDA source for the kernel
ast: the UOp AST
globals: all buffer references
ins, outs: input and output buffers
global_size, local_size: GPU launch dimensions (None for CPU)

COPY operations (for constants and data movement) are rendered as std::memcpy calls rather than compute kernels.

Stage 4: Final assembly¶

Jinja2 templates combine everything into complete C++ source:

flowchart TB
    A["RenderedKernel[]"] --> E["Jinja2 templates"]
    B["Buffer metadata<br/>(shapes, dtypes)"] --> E
    C["Constant data"] --> E
    D["Codegen constants"] --> E
    E --> F[".hpp header<br/>(types, declarations)"]
    E --> G[".cpp source<br/>(kernels, init_ws, call)"]

The generated code contains:

Header: Buffer<T, Ns...> template, typed input/output/workspace aliases, sparse metadata (for SparseNumericalFunction), function declarations, codegen constants
Source: kernel implementations, static constexpr constant arrays, init_ws() (workspace allocation + constant copy), call() (kernel dispatch sequence)

Buffer categories¶

flowchart TB
    subgraph Inputs
        I["IN0_t, IN1_t, ..."]
    end
    subgraph Outputs
        O["OUT0_t, OUT1_t, ..."]
    end
    subgraph Workspace["Workspace (WS_t)"]
        K["Constants<br/>(copied in init_ws)"]
        T["Intermediates<br/>(kernel temporaries)"]
    end

    I --> |"call()"| Kernels["Kernel 1 → Kernel 2 → ..."]
    K --> Kernels
    T --> Kernels
    Kernels --> O

Four buffer categories exist:

Input: from the function signature. Passed by reference to call().
Output: fresh buffers allocated by the caller. Written by kernels.
Constant: captured closure data (matrices, weights). Embedded as static constexpr arrays and copied into the workspace during init_ws().
Intermediate: temporaries shared between kernels. Packed into the workspace with 16-byte alignment.