NumericalFunction internals¶
The NumericalFunction class is implemented as a frozen dataclass with cached properties and no other methods (in particular methods that take inputs).
See Frozen dataclasses with cached properties for details on the structure and its tradeoffs.
For the JIT compilation pipeline (how __call__ dispatches to native code), see CPU JIT compilation.
Codegen stages¶
The code generation of a NumericalFunction is divided in the following stages:
- tracing: happens lazily when calling the underlying function on a set of symbolic input tensors created with
Tensor.empty. tinygrad internally constructs the AST as aUOpgraph that you can query from the tensors returned by the function call. All of this happens inNumericalFunction.function_graph. Two fixups are applied to the trace outputs before building the graph body: - multi-dim outputs get
.contiguous()to guarantee C-contiguous buffer layout (prevents PERMUTE view chains from producing non-row-major kernels). - input-aliased outputs (where the output is a view of an input buffer, e.g.
lambda x: x[:2]) get+ 0to force tinygrad to allocate a separate output buffer and generate a copy kernel. - callification + scheduling (
_schedule_and_output_uops): before tinygrad can schedule the graph, we first run a lightweight symbolic simplification pass, thentransform_to_call(...). This rewrites the traced output graph into a top-levelCALLwhose body is a function graph withPARAMops as placeholder inputs. During this step, tinygrad also decides which outputs need fresh buffers and which captured constants need explicitCOPYops. We then lower thatCALLgraph into a linear list ofExecItems withcomplete_create_schedule_with_vars(...). Important: scheduling operates on separate tensor copies (sched_outputs) to avoid mutating_trace_outputs, whichsvmap/vmapneeds unmodified. - kernel rendering: platform-specific optimizations are performed on the kernels (e.g. beam search, devectorization, unrolling, etc.) and source code is generated for each kernel. This happens in
NumericalFunction.rendered_kernelsby calling a mix of tinygrad functions and custom functions (to have more control over the codegen). - final rendering: we assemble the global header and source code that defines appropriate type for each input/output buffer, static constexprs for constant buffers embedded in the function definition, and a public function that calls all the kernels sequentially.
Buffer types¶
There are four buffer categories in NumericalFunction:
- input: the buffers associated with the
Tensor.emptyplaceholders we create for tracing. - output: after callification, outputs that require materialization are remapped to fresh CPU buffers.
NumericalFunctionapplies that remap only to its own output tensors, so shared constants captured by multiple functions are left untouched.output_bufsreads from the scheduled UOps (stored separately in_schedule_and_output_uops), not from_trace_outputs. - constant: captured closure data appears in the schedule as
COPYsources from non-CPU devices such asPYTHON,NPY, orDISK. These source buffers are embedded asstatic constexprarrays in the generated C++. - intermediate: all scheduled kernel globals that are neither function arguments nor constant sources. This includes ordinary temporaries and the CPU-side destinations of constant-copy kernels.