How anvil Works¶
anvil intercepts tinygrad's compilation pipeline at the point where kernels would normally be executed, and instead emits standalone C++ source code. This page gives a high-level view of the pipeline and compares it with tinygrad's native execution flow.
The anvil pipeline¶
flowchart LR
A["Python function<br/>(Tensor ops)"] --> B["Trace<br/>(UOp graph)"]
B --> C["Schedule<br/>(ExecItems)"]
C --> D["Render kernels<br/>(C/Metal/CUDA)"]
D --> E["Assemble C++<br/>(Jinja2 templates)"]
E --> F1[".hpp + .cpp<br/>(AOT)"]
E --> F2[".dylib/.so<br/>(JIT)"]
-
Trace: The Python function is called on symbolic
Tensor.empty()inputs. tinygrad builds aUOpgraph representing the computation. -
Schedule: The UOp graph is lowered into a linear sequence of
ExecItems -- each representing either a compute kernel or a memory copy. This step decides buffer allocation and data movement. -
Render kernels: Each
ExecItemwith a compute kernel is rendered to source code using tinygrad's renderer (ClangRenderer for CPU, MetalRenderer for Metal, CUDARenderer for CUDA). -
Assemble C++: Jinja2 templates combine the rendered kernels with buffer type declarations, workspace management code, and constant data to produce complete C++ source.
-
Output: Either standalone
.hpp/.cppfiles (AOT) or a compiled shared library loaded via ctypes (JIT).
Comparison with tinygrad¶
tinygrad's native pipeline executes computations eagerly on the current device. anvil diverges after the scheduling step:
flowchart TB
subgraph both["Shared steps"]
A["Tensor operations"] --> B["UOp graph"]
B --> C["Schedule (ExecItems)"]
end
subgraph tg["tinygrad (native)"]
C --> D1["Compile kernels<br/>(clang/Metal/CUDA)"]
D1 --> E1["Execute on device"]
E1 --> F1["Result in memory"]
end
subgraph anvil_path["anvil (codegen)"]
C --> D2["Render kernel source"]
D2 --> E2["Assemble C++ via<br/>Jinja2 templates"]
E2 --> F2a["AOT: .hpp/.cpp files"]
E2 --> F2b["JIT: compile + ctypes"]
end
Key differences:
| tinygrad | anvil | |
|---|---|---|
| Goal | Execute now | Generate code for later |
| Compiler | Built-in (compiles + runs) | Renders source only |
| Buffers | Runtime device memory | Compile-time typed Buffer<T, Ns...> |
| Constants | In-memory | Embedded as static constexpr arrays |
| Intermediates | Per-kernel allocation | Shared workspace with pre-computed layout |
| Output | Tensor in memory | C++ files or shared library |
What anvil adds on top¶
Beyond code generation, anvil layers several features that tinygrad doesn't natively provide:
- Automatic differentiation at the UOp graph level (JVP and VJP), with sparse Jacobian support via graph coloring
- Vectorizing transforms (
vmap) via UOp graph rewriting - SQP solver template that generates a complete optimization loop with PIQP as the QP backend
- Multistage problem structure that exploits block-tridiagonal sparsity
- Workspace management with compile-time buffer layout and alignment
See Codegen Pipeline for a detailed walkthrough of each stage, and JIT vs AOT for how the two execution modes differ.