How anvil Works¶

anvil intercepts tinygrad's compilation pipeline at the point where kernels would normally be executed, and instead emits standalone C++ source code. This page gives a high-level view of the pipeline and compares it with tinygrad's native execution flow.

The anvil pipeline¶

flowchart LR
    A["Python function<br/>(Tensor ops)"] --> B["Trace<br/>(UOp graph)"]
    B --> C["Schedule<br/>(ExecItems)"]
    C --> D["Render kernels<br/>(C/Metal/CUDA)"]
    D --> E["Assemble C++<br/>(Jinja2 templates)"]
    E --> F1[".hpp + .cpp<br/>(AOT)"]
    E --> F2[".dylib/.so<br/>(JIT)"]

Trace: The Python function is called on symbolic Tensor.empty() inputs. tinygrad builds a UOp graph representing the computation.
Schedule: The UOp graph is lowered into a linear sequence of ExecItems -- each representing either a compute kernel or a memory copy. This step decides buffer allocation and data movement.
Render kernels: Each ExecItem with a compute kernel is rendered to source code using tinygrad's renderer (ClangRenderer for CPU, MetalRenderer for Metal, CUDARenderer for CUDA).
Assemble C++: Jinja2 templates combine the rendered kernels with buffer type declarations, workspace management code, and constant data to produce complete C++ source.
Output: Either standalone .hpp/.cpp files (AOT) or a compiled shared library loaded via ctypes (JIT).

Comparison with tinygrad¶

tinygrad's native pipeline executes computations eagerly on the current device. anvil diverges after the scheduling step:

flowchart TB
    subgraph both["Shared steps"]
        A["Tensor operations"] --> B["UOp graph"]
        B --> C["Schedule (ExecItems)"]
    end

    subgraph tg["tinygrad (native)"]
        C --> D1["Compile kernels<br/>(clang/Metal/CUDA)"]
        D1 --> E1["Execute on device"]
        E1 --> F1["Result in memory"]
    end

    subgraph anvil_path["anvil (codegen)"]
        C --> D2["Render kernel source"]
        D2 --> E2["Assemble C++ via<br/>Jinja2 templates"]
        E2 --> F2a["AOT: .hpp/.cpp files"]
        E2 --> F2b["JIT: compile + ctypes"]
    end

Key differences:

	tinygrad	anvil
Goal	Execute now	Generate code for later
Compiler	Built-in (compiles + runs)	Renders source only
Buffers	Runtime device memory	Compile-time typed `Buffer<T, Ns...>`
Constants	In-memory	Embedded as `static constexpr` arrays
Intermediates	Per-kernel allocation	Shared workspace with pre-computed layout
Output	Tensor in memory	C++ files or shared library

What anvil adds on top¶

Beyond code generation, anvil layers several features that tinygrad doesn't natively provide:

Automatic differentiation at the UOp graph level (JVP and VJP), with sparse Jacobian support via graph coloring
Vectorizing transforms (vmap) via UOp graph rewriting
SQP solver template that generates a complete optimization loop with PIQP as the QP backend
Multistage problem structure that exploits block-tridiagonal sparsity
Workspace management with compile-time buffer layout and alignment

See Codegen Pipeline for a detailed walkthrough of each stage, and JIT vs AOT for how the two execution modes differ.