Codegen Pipeline¶
This page walks through each stage of anvil's code generation in detail. For the implementation specifics, see NumericalFunction Internals.
Stage 1: Tracing¶
When you access a NumericalFunction's codegen properties for the first time, anvil creates symbolic input tensors via Tensor.empty(shape, dtype) and calls the wrapped function on them. tinygrad internally builds a UOp (micro-operation) graph representing the computation.
flowchart LR
A["Tensor.empty(1024)"] --> B["fn(x)"]
B --> C["UOp graph<br/>(BUFFER → ops → output)"]
Two fixups are applied to trace outputs:
- Multi-dim outputs get
.contiguous()to guarantee C-contiguous buffer layout (prevents PERMUTE chains from producing non-row-major kernels) - Input-aliased outputs (where output is a view of an input, e.g.
lambda x: x[:2]) get+ 0to force a separate output buffer
Stage 2: Callification and scheduling¶
The traced UOp graph is transformed into tinygrad's scheduling form:
flowchart LR
A["UOp graph"] --> B["symbolic_simple<br/>(simplification)"]
B --> C["transform_to_call<br/>(CALL/PARAM form)"]
C --> D["complete_create_schedule<br/>(linear ExecItems)"]
- Symbolic simplification: constant folding, algebraic identities
- Callification: rewrites the graph into a top-level
CALLop withPARAMplaceholders for inputs. Decides which outputs need fresh buffers and which constants needCOPYops. - Scheduling: lowers the
CALLgraph into a linear list ofExecItems, each representing a kernel invocation or memory copy.
Note
Scheduling operates on separate tensor copies to avoid mutating the original trace outputs, which vmap and other transforms need unmodified.
Stage 3: Kernel rendering¶
Each ExecItem is rendered to platform-specific source code:
flowchart LR
A["ExecItem<br/>(SINK op)"] --> B["get_program<br/>(tinygrad)"]
B --> C["Renderer<br/>(Clang/Metal/CUDA)"]
C --> D["RenderedKernel<br/>(source + metadata)"]
A RenderedKernel contains:
src: the rendered C/Metal/CUDA source for the kernelast: the UOp ASTglobals: all buffer referencesins,outs: input and output buffersglobal_size,local_size: GPU launch dimensions (None for CPU)
COPY operations (for constants and data movement) are rendered as std::memcpy calls rather than compute kernels.
Stage 4: Final assembly¶
Jinja2 templates combine everything into complete C++ source:
flowchart TB
A["RenderedKernel[]"] --> E["Jinja2 templates"]
B["Buffer metadata<br/>(shapes, dtypes)"] --> E
C["Constant data"] --> E
D["Codegen constants"] --> E
E --> F[".hpp header<br/>(types, declarations)"]
E --> G[".cpp source<br/>(kernels, init_ws, call)"]
The generated code contains:
- Header:
Buffer<T, Ns...>template, typed input/output/workspace aliases, sparse metadata (forSparseNumericalFunction), function declarations, codegen constants - Source: kernel implementations,
static constexprconstant arrays,init_ws()(workspace allocation + constant copy),call()(kernel dispatch sequence)
Buffer categories¶
flowchart TB
subgraph Inputs
I["IN0_t, IN1_t, ..."]
end
subgraph Outputs
O["OUT0_t, OUT1_t, ..."]
end
subgraph Workspace["Workspace (WS_t)"]
K["Constants<br/>(copied in init_ws)"]
T["Intermediates<br/>(kernel temporaries)"]
end
I --> |"call()"| Kernels["Kernel 1 → Kernel 2 → ..."]
K --> Kernels
T --> Kernels
Kernels --> O
Four buffer categories exist:
- Input: from the function signature. Passed by reference to
call(). - Output: fresh buffers allocated by the caller. Written by kernels.
- Constant: captured closure data (matrices, weights). Embedded as
static constexprarrays and copied into the workspace duringinit_ws(). - Intermediate: temporaries shared between kernels. Packed into the workspace with 16-byte alignment.