Skip to content

How anvil Works

anvil intercepts tinygrad's compilation pipeline at the point where kernels would normally be executed, and instead emits standalone C++ source code. This page gives a high-level view of the pipeline and compares it with tinygrad's native execution flow.

The anvil pipeline

flowchart LR
    A["Python function<br/>(Tensor ops)"] --> B["Trace<br/>(UOp graph)"]
    B --> C["Schedule<br/>(ExecItems)"]
    C --> D["Render kernels<br/>(C/Metal/CUDA)"]
    D --> E["Assemble C++<br/>(Jinja2 templates)"]
    E --> F1[".hpp + .cpp<br/>(AOT)"]
    E --> F2[".dylib/.so<br/>(JIT)"]
  1. Trace: The Python function is called on symbolic Tensor.empty() inputs. tinygrad builds a UOp graph representing the computation.

  2. Schedule: The UOp graph is lowered into a linear sequence of ExecItems -- each representing either a compute kernel or a memory copy. This step decides buffer allocation and data movement.

  3. Render kernels: Each ExecItem with a compute kernel is rendered to source code using tinygrad's renderer (ClangRenderer for CPU, MetalRenderer for Metal, CUDARenderer for CUDA).

  4. Assemble C++: Jinja2 templates combine the rendered kernels with buffer type declarations, workspace management code, and constant data to produce complete C++ source.

  5. Output: Either standalone .hpp/.cpp files (AOT) or a compiled shared library loaded via ctypes (JIT).

Comparison with tinygrad

tinygrad's native pipeline executes computations eagerly on the current device. anvil diverges after the scheduling step:

flowchart TB
    subgraph both["Shared steps"]
        A["Tensor operations"] --> B["UOp graph"]
        B --> C["Schedule (ExecItems)"]
    end

    subgraph tg["tinygrad (native)"]
        C --> D1["Compile kernels<br/>(clang/Metal/CUDA)"]
        D1 --> E1["Execute on device"]
        E1 --> F1["Result in memory"]
    end

    subgraph anvil_path["anvil (codegen)"]
        C --> D2["Render kernel source"]
        D2 --> E2["Assemble C++ via<br/>Jinja2 templates"]
        E2 --> F2a["AOT: .hpp/.cpp files"]
        E2 --> F2b["JIT: compile + ctypes"]
    end

Key differences:

tinygrad anvil
Goal Execute now Generate code for later
Compiler Built-in (compiles + runs) Renders source only
Buffers Runtime device memory Compile-time typed Buffer<T, Ns...>
Constants In-memory Embedded as static constexpr arrays
Intermediates Per-kernel allocation Shared workspace with pre-computed layout
Output Tensor in memory C++ files or shared library

What anvil adds on top

Beyond code generation, anvil layers several features that tinygrad doesn't natively provide:

  • Automatic differentiation at the UOp graph level (JVP and VJP), with sparse Jacobian support via graph coloring
  • Vectorizing transforms (vmap) via UOp graph rewriting
  • SQP solver template that generates a complete optimization loop with PIQP as the QP backend
  • Multistage problem structure that exploits block-tridiagonal sparsity
  • Workspace management with compile-time buffer layout and alignment

See Codegen Pipeline for a detailed walkthrough of each stage, and JIT vs AOT for how the two execution modes differ.