Skip to content

JIT compilation

NumericalFunction.__call__ and DenseSQPFunction.__call__ dispatch to natively compiled C++ via lazy JIT compilation. On first call, the function's C++ source is generated, compiled to a shared library with clang++, loaded via ctypes, and cached on disk. Subsequent calls go straight to native code.

GPU backends (Metal, CUDA) follow the same lazy JIT pipeline — the only differences are the rendered kernel source (GPU shading language instead of C), the host-side template (GPU API calls instead of direct function calls), and the compiler flags.

Architecture

__call__(np.ndarray args)
    ├─ _jit_call_fn  ← ctypes function pointer (cached)
    │      │
    │      ├─ _jit_module  ← ctypes.CDLL wrapping the .dylib/.so (cached)
    │      │      │
    │      │      ├─ jit_source  ← single self-contained .cpp string (cached)
    │      │      │      │
    │      │      │      └─ generate_jit_source()  ← collects deps, renders templates, mangles keywords
    │      │      │
    │      │      └─ compile_to_shared_lib()  ← clang++ -shared -O2, cached in ~/.cache/anvil/jit/
    │      │
    │      └─ _jit_ws  ← workspace pointer (heap-allocated via init_ws, freed via weakref.finalize)
    └─ result: np.ndarray

Files

File Role
src/anvil/codegen/jit.py generate_jit_source, compile_to_shared_lib, JitModule, C++ keyword mangling
templates/jit_module.j2 Single-file .cpp combining Buffer template + namespace + all functions
templates/jit_shim_numerical_function.j2 extern "C" wrappers: init_ws, free_ws, call
templates/jit_shim_sqp_function.j2 extern "C" wrappers: init_ws, deinit_ws, call, get_x/lam_*

NumericalFunction JIT flow

  1. __call__ converts inputs to contiguous numpy arrays with the declared dtype
  2. Allocates output arrays with np.empty(shape, dtype)
  3. Passes all data pointers + workspace pointer to the native call() via ctypes
  4. Returns the output array(s)

SQP JIT flow

  1. __call__ prepares input arrays (zeros for None, validates parameter shape)
  2. Passes settings as flat scalars, array pointers, and output scalar pointers to native call()
  3. After the call, reads results via accessor functions (get_x, get_lam_bounds, etc.) that return pointers into the workspace struct
  4. Copies results to numpy arrays and returns SQPResult

Output buffer correctness

The JIT __call__ allocates C-contiguous numpy arrays and passes their data pointers to the C++ kernel. The kernel must write to these buffers in C-contiguous order. Three cases require special handling in function_graph and _schedule_and_output_uops:

Multi-dimensional outputs (.contiguous())

Problem: tinygrad's scheduler may produce kernels that write 2D+ outputs in non-C-contiguous order. This happens when the output tensor has PERMUTE view chains (e.g., a Jacobian built by stacking columns then transposing). The flat buffer is written column-major, but numpy interprets it row-major.

Example:

jac_fn = jacobian(fn, argnum=1)  # output shape (3, 2)
# UOp chain: BUFFER(6) → RESHAPE(2,3) → PERMUTE(1,0) → output(3,2)
# Kernel writes in column-major order: [a00, a10, a20, a01, a11, a21]

Fix: In function_graph, trace outputs with ndim > 1 get .contiguous():

elif t.uop.op is not Ops.CONTIGUOUS and len(t.shape) > 1:
    trace_outputs += (t.contiguous(),)

This is a no-op for already-contiguous tensors (.contiguous() returns the same UOp identity). For non-contiguous views, it inserts a CONTIGUOUS op that makes the scheduler produce a C-order copy kernel.

Input-aliased outputs (+ 0)

Problem: When a function returns a view of its input (e.g., lambda x: x[:2]), tinygrad reuses the same Buffer object for both input and output. This causes two issues: 1. arg_bufs contains the same Buffer twice, producing duplicate C++ parameter names 2. The scheduler generates no kernel (the data is "already in the right place")

Example:

fn = NumericalFunction("slice", lambda x: x[:2], (Arg(5),))
# output UOp: SHRINK(BUFFER) — a view of the input buffer
# input_bufs and output_bufs share the same Buffer object

Fix: In function_graph, detect when an output's base buffer is an input buffer and inject + 0:

if is_aliased:
    trace_outputs += (t + 0,)

The + 0 creates an ADD UOp that forces tinygrad to allocate a new output buffer and generate a kernel that copies the data. This survives symbolic_simple because the operands have different UOp identities (unlike CONST + 0 which folds).

CONST outputs (Python bridge fallback)

Problem: Functions that return compile-time constants (lambda x: Tensor(42.0)) have no buffer after scheduling. The + 0 trick doesn't work because symbolic_simple folds CONST + 0 → CONST. The .contiguous() approach also fails — the scheduler produces no kernel for CONTIGUOUS(CONST).

Fix: Detected after scheduling via _has_const_output and falls back to the tinygrad Python bridge:

if self._has_const_output:
    result = self.fn(*tensor_args)
    return result.numpy()

This only affects degenerate functions that ignore their inputs entirely — not encountered in real codegen workloads.

Non-destructive scheduling

Problem: The scheduling pipeline (_schedule_and_output_uops) modifies trace output tensors' UOps as a side effect (pre-simplification, callification, buffer remapping). This breaks svmap/vmap which reads the original unmodified function graph for vectorization.

Fix: Scheduling operates on separate tensor copies (sched_outputs), never touching fg._trace_outputs:

sched_outputs = tuple(t.contiguous() if ... else Tensor(t.uop) for t in fg._trace_outputs)
sink = UOp.sink(*[x.uop for x in sched_outputs])
# ... all scheduling mutations happen on sched_outputs ...
scheduled_uops = tuple(t.uop for t in sched_outputs)

output_bufs reads from the scheduled_uops (which carry the post-scheduling buffer references) instead of fg._trace_outputs.

C++ keyword mangling

Problem: NumericalFunction names like "double" or "float" produce invalid C++ namespace declarations (namespace double { ... }).

Fix: generate_jit_source post-processes the rendered source with regex replacements:

source = re.sub(r"\b" + re.escape(fn.name) + r"::", mangled + "::", source)
source = re.sub(r"namespace\s+" + re.escape(fn.name) + r"\s*\{", f"namespace {mangled} {{", source)

The word boundary \b prevents false matches (e.g., _fn_double:: is not re-mangled). The module namespace is also mangled via _mangle(module_name) before rendering.

Caching

Compiled shared libraries are cached in ~/.cache/anvil/jit/ keyed by {name}_{sha256(source)}.{dylib|so}. Compilation writes to a .tmp file first, then atomically renames, so concurrent processes don't corrupt the cache. A cache hit skips compilation entirely.

Workspace lifecycle

  • NumericalFunction: init_ws() allocates a Buffer<char, N> (flat workspace for intermediate buffers). free_ws() frees it. Cleanup registered via weakref.finalize.
  • SQP: init_ws() allocates the full Ws struct (PIQP workspace, QP data, all intermediate/working buffers). deinit_ws() calls piqp_cleanup and frees all buffers. Also cleaned up via weakref.finalize. The SQP shared library links against libpiqpc with -Wl,-rpath for runtime discovery.

GPU backends

NumericalFunction accepts device="METAL" or device="CUDA" to target GPU execution. The public call() C++ interface is identical to CPU — GPU dispatch is fully encapsulated.

Rendering pipeline

  1. Tracing + scheduling: Unchanged. Happens on CPU symbolic tensors. The UOp graph and schedule are device-agnostic.
  2. Kernel rendering: render_kernel(ast, bufs, renderer) uses _CustomMetalRenderer() or _CustomCUDARenderer(arch) instead of CustomClangRenderer. These are thin wrappers around tinygrad's MetalRenderer/CUDARenderer with a no-op compiler (we only need the rendered source — actual GPU kernel compilation happens at runtime in the generated C++ init_ws()). Returns RenderedKernel with GPU kernel source + launch dimensions (global_size, local_size).
  3. Code generation: GPU-specific Jinja2 templates produce Objective-C++ (Metal) or C++ with CUDA driver API code. GPU kernel sources are embedded as C++ raw string literals (R"METAL(...)METAL" / R"CUDA(...)CUDA").
  4. Compilation: compile_to_shared_lib uses -x objective-c++ -framework Metal -framework Foundation (Metal) or -I/usr/local/cuda/include -lcuda -lnvrtc (CUDA).

GPU workspace and call() flow

init_ws() performs one-time setup: - Creates the GPU context (Metal device+queue / CUDA context+stream) - Compiles each kernel from embedded source at runtime (Metal newLibraryWithSource: / NVRTC nvrtcCompileProgram → PTX → cuModuleLoadData) - Allocates GPU buffers for inputs, outputs, intermediates, and constants - Copies constant data to GPU buffers and executes COPY kernels

call() on each invocation: 1. memcpy inputs from host to GPU buffers 2. Dispatch each kernel (Metal command buffer+encoder / cuLaunchKernel) 3. Synchronize (Metal waitUntilCompleted / cuStreamSynchronize) 4. memcpy outputs from GPU buffers back to host

Files

File Role
templates/numerical_function_gpu_metal.j2 Metal template: kernel embedding, WS_t struct, init_ws/call with Metal API
templates/numerical_function_gpu_cuda.j2 CUDA template: kernel embedding, WS_t struct, init_ws/call with CUDA driver API + NVRTC
numerical_function.py _CustomMetalRenderer, _CustomCUDARenderer, _NoOpCompiler, _detect_cuda_arch(), GPU cached properties (gpu_kernel_sources, gpu_buf_nbytes, gpu_kernel_buf_indices)

Dtype handling

Metal does not support float64. A ValueError is raised at NumericalFunction construction time if any Arg has dtype=float64 with device="METAL". CUDA supports float64 natively.

Known limitations

  • Constants must be float32 on Metal: Captured Tensor constants in closures must be float32 — a float64 constant produces a Metal kernel with double* parameters which Metal rejects. The dtype validation only checks Arg inputs, not captured constants.
  • One GPU backend per compilation unit: All GPU functions in a single SQP solver (or module) must use the same backend. Mixed Metal + CUDA is not supported.
  • SQP with GPU dependencies: Not yet implemented. The SQP loop runs on CPU; GPU support for individual NumericalFunction dependencies within SQP is planned.