Skip to content

NumericalFunction internals

The NumericalFunction class is implemented as a frozen dataclass with cached properties and no other methods (in particular methods that take inputs). See Frozen dataclasses with cached properties for details on the structure and its tradeoffs.

For the JIT compilation pipeline (how __call__ dispatches to native code), see CPU JIT compilation.

Codegen stages

The code generation of a NumericalFunction is divided in the following stages:

  1. tracing: happens lazily when calling the underlying function on a set of symbolic input tensors created with Tensor.empty. tinygrad internally constructs the AST as a UOp graph that you can query from the tensors returned by the function call. All of this happens in NumericalFunction.function_graph. Two fixups are applied to the trace outputs before building the graph body:
  2. multi-dim outputs get .contiguous() to guarantee C-contiguous buffer layout (prevents PERMUTE view chains from producing non-row-major kernels).
  3. input-aliased outputs (where the output is a view of an input buffer, e.g. lambda x: x[:2]) get + 0 to force tinygrad to allocate a separate output buffer and generate a copy kernel.
  4. callification + scheduling (_schedule_and_output_uops): before tinygrad can schedule the graph, we first run a lightweight symbolic simplification pass, then transform_to_call(...). This rewrites the traced output graph into a top-level CALL whose body is a function graph with PARAM ops as placeholder inputs. During this step, tinygrad also decides which outputs need fresh buffers and which captured constants need explicit COPY ops. We then lower that CALL graph into a linear list of ExecItems with complete_create_schedule_with_vars(...). Important: scheduling operates on separate tensor copies (sched_outputs) to avoid mutating _trace_outputs, which svmap/vmap needs unmodified.
  5. kernel rendering: platform-specific optimizations are performed on the kernels (e.g. beam search, devectorization, unrolling, etc.) and source code is generated for each kernel. This happens in NumericalFunction.rendered_kernels by calling a mix of tinygrad functions and custom functions (to have more control over the codegen).
  6. final rendering: we assemble the global header and source code that defines appropriate type for each input/output buffer, static constexprs for constant buffers embedded in the function definition, and a public function that calls all the kernels sequentially.

Buffer types

There are four buffer categories in NumericalFunction:

  1. input: the buffers associated with the Tensor.empty placeholders we create for tracing.
  2. output: after callification, outputs that require materialization are remapped to fresh CPU buffers. NumericalFunction applies that remap only to its own output tensors, so shared constants captured by multiple functions are left untouched. output_bufs reads from the scheduled UOps (stored separately in _schedule_and_output_uops), not from _trace_outputs.
  3. constant: captured closure data appears in the schedule as COPY sources from non-CPU devices such as PYTHON, NPY, or DISK. These source buffers are embedded as static constexpr arrays in the generated C++.
  4. intermediate: all scheduled kernel globals that are neither function arguments nor constant sources. This includes ordinary temporaries and the CPU-side destinations of constant-copy kernels.