JIT compilation¶
NumericalFunction.__call__ and DenseSQPFunction.__call__ dispatch to natively compiled C++ via lazy JIT compilation. On first call, the function's C++ source is generated, compiled to a shared library with clang++, loaded via ctypes, and cached on disk. Subsequent calls go straight to native code.
GPU backends (Metal, CUDA) follow the same lazy JIT pipeline — the only differences are the rendered kernel source (GPU shading language instead of C), the host-side template (GPU API calls instead of direct function calls), and the compiler flags.
Architecture¶
__call__(np.ndarray args)
│
├─ _jit_call_fn ← ctypes function pointer (cached)
│ │
│ ├─ _jit_module ← ctypes.CDLL wrapping the .dylib/.so (cached)
│ │ │
│ │ ├─ jit_source ← single self-contained .cpp string (cached)
│ │ │ │
│ │ │ └─ generate_jit_source() ← collects deps, renders templates, mangles keywords
│ │ │
│ │ └─ compile_to_shared_lib() ← clang++ -shared -O2, cached in ~/.cache/anvil/jit/
│ │
│ └─ _jit_ws ← workspace pointer (heap-allocated via init_ws, freed via weakref.finalize)
│
└─ result: np.ndarray
Files¶
| File | Role |
|---|---|
src/anvil/codegen/jit.py |
generate_jit_source, compile_to_shared_lib, JitModule, C++ keyword mangling |
templates/jit_module.j2 |
Single-file .cpp combining Buffer template + namespace + all functions |
templates/jit_shim_numerical_function.j2 |
extern "C" wrappers: init_ws, free_ws, call |
templates/jit_shim_sqp_function.j2 |
extern "C" wrappers: init_ws, deinit_ws, call, get_x/lam_* |
NumericalFunction JIT flow¶
__call__converts inputs to contiguous numpy arrays with the declared dtype- Allocates output arrays with
np.empty(shape, dtype) - Passes all data pointers + workspace pointer to the native
call()via ctypes - Returns the output array(s)
SQP JIT flow¶
__call__prepares input arrays (zeros forNone, validates parameter shape)- Passes settings as flat scalars, array pointers, and output scalar pointers to native
call() - After the call, reads results via accessor functions (
get_x,get_lam_bounds, etc.) that return pointers into the workspace struct - Copies results to numpy arrays and returns
SQPResult
Output buffer correctness¶
The JIT __call__ allocates C-contiguous numpy arrays and passes their data pointers to the C++ kernel. The kernel must write to these buffers in C-contiguous order. Three cases require special handling in function_graph and _schedule_and_output_uops:
Multi-dimensional outputs (.contiguous())¶
Problem: tinygrad's scheduler may produce kernels that write 2D+ outputs in non-C-contiguous order. This happens when the output tensor has PERMUTE view chains (e.g., a Jacobian built by stacking columns then transposing). The flat buffer is written column-major, but numpy interprets it row-major.
Example:
jac_fn = jacobian(fn, argnum=1) # output shape (3, 2)
# UOp chain: BUFFER(6) → RESHAPE(2,3) → PERMUTE(1,0) → output(3,2)
# Kernel writes in column-major order: [a00, a10, a20, a01, a11, a21]
Fix: In function_graph, trace outputs with ndim > 1 get .contiguous():
This is a no-op for already-contiguous tensors (.contiguous() returns the same UOp identity). For non-contiguous views, it inserts a CONTIGUOUS op that makes the scheduler produce a C-order copy kernel.
Input-aliased outputs (+ 0)¶
Problem: When a function returns a view of its input (e.g., lambda x: x[:2]), tinygrad reuses the same Buffer object for both input and output. This causes two issues:
1. arg_bufs contains the same Buffer twice, producing duplicate C++ parameter names
2. The scheduler generates no kernel (the data is "already in the right place")
Example:
fn = NumericalFunction("slice", lambda x: x[:2], (Arg(5),))
# output UOp: SHRINK(BUFFER) — a view of the input buffer
# input_bufs and output_bufs share the same Buffer object
Fix: In function_graph, detect when an output's base buffer is an input buffer and inject + 0:
The + 0 creates an ADD UOp that forces tinygrad to allocate a new output buffer and generate a kernel that copies the data. This survives symbolic_simple because the operands have different UOp identities (unlike CONST + 0 which folds).
CONST outputs (Python bridge fallback)¶
Problem: Functions that return compile-time constants (lambda x: Tensor(42.0)) have no buffer after scheduling. The + 0 trick doesn't work because symbolic_simple folds CONST + 0 → CONST. The .contiguous() approach also fails — the scheduler produces no kernel for CONTIGUOUS(CONST).
Fix: Detected after scheduling via _has_const_output and falls back to the tinygrad Python bridge:
This only affects degenerate functions that ignore their inputs entirely — not encountered in real codegen workloads.
Non-destructive scheduling¶
Problem: The scheduling pipeline (_schedule_and_output_uops) modifies trace output tensors' UOps as a side effect (pre-simplification, callification, buffer remapping). This breaks svmap/vmap which reads the original unmodified function graph for vectorization.
Fix: Scheduling operates on separate tensor copies (sched_outputs), never touching fg._trace_outputs:
sched_outputs = tuple(t.contiguous() if ... else Tensor(t.uop) for t in fg._trace_outputs)
sink = UOp.sink(*[x.uop for x in sched_outputs])
# ... all scheduling mutations happen on sched_outputs ...
scheduled_uops = tuple(t.uop for t in sched_outputs)
output_bufs reads from the scheduled_uops (which carry the post-scheduling buffer references) instead of fg._trace_outputs.
C++ keyword mangling¶
Problem: NumericalFunction names like "double" or "float" produce invalid C++ namespace declarations (namespace double { ... }).
Fix: generate_jit_source post-processes the rendered source with regex replacements:
source = re.sub(r"\b" + re.escape(fn.name) + r"::", mangled + "::", source)
source = re.sub(r"namespace\s+" + re.escape(fn.name) + r"\s*\{", f"namespace {mangled} {{", source)
The word boundary \b prevents false matches (e.g., _fn_double:: is not re-mangled). The module namespace is also mangled via _mangle(module_name) before rendering.
Caching¶
Compiled shared libraries are cached in ~/.cache/anvil/jit/ keyed by {name}_{sha256(source)}.{dylib|so}. Compilation writes to a .tmp file first, then atomically renames, so concurrent processes don't corrupt the cache. A cache hit skips compilation entirely.
Workspace lifecycle¶
- NumericalFunction:
init_ws()allocates aBuffer<char, N>(flat workspace for intermediate buffers).free_ws()frees it. Cleanup registered viaweakref.finalize. - SQP:
init_ws()allocates the fullWsstruct (PIQP workspace, QP data, all intermediate/working buffers).deinit_ws()callspiqp_cleanupand frees all buffers. Also cleaned up viaweakref.finalize. The SQP shared library links against libpiqpc with-Wl,-rpathfor runtime discovery.
GPU backends¶
NumericalFunction accepts device="METAL" or device="CUDA" to target GPU execution. The public call() C++ interface is identical to CPU — GPU dispatch is fully encapsulated.
Rendering pipeline¶
- Tracing + scheduling: Unchanged. Happens on CPU symbolic tensors. The UOp graph and schedule are device-agnostic.
- Kernel rendering:
render_kernel(ast, bufs, renderer)uses_CustomMetalRenderer()or_CustomCUDARenderer(arch)instead ofCustomClangRenderer. These are thin wrappers around tinygrad'sMetalRenderer/CUDARendererwith a no-op compiler (we only need the rendered source — actual GPU kernel compilation happens at runtime in the generated C++init_ws()). ReturnsRenderedKernelwith GPU kernel source + launch dimensions (global_size,local_size). - Code generation: GPU-specific Jinja2 templates produce Objective-C++ (Metal) or C++ with CUDA driver API code. GPU kernel sources are embedded as C++ raw string literals (
R"METAL(...)METAL"/R"CUDA(...)CUDA"). - Compilation:
compile_to_shared_libuses-x objective-c++ -framework Metal -framework Foundation(Metal) or-I/usr/local/cuda/include -lcuda -lnvrtc(CUDA).
GPU workspace and call() flow¶
init_ws() performs one-time setup:
- Creates the GPU context (Metal device+queue / CUDA context+stream)
- Compiles each kernel from embedded source at runtime (Metal newLibraryWithSource: / NVRTC nvrtcCompileProgram → PTX → cuModuleLoadData)
- Allocates GPU buffers for inputs, outputs, intermediates, and constants
- Copies constant data to GPU buffers and executes COPY kernels
call() on each invocation:
1. memcpy inputs from host to GPU buffers
2. Dispatch each kernel (Metal command buffer+encoder / cuLaunchKernel)
3. Synchronize (Metal waitUntilCompleted / cuStreamSynchronize)
4. memcpy outputs from GPU buffers back to host
Files¶
| File | Role |
|---|---|
templates/numerical_function_gpu_metal.j2 |
Metal template: kernel embedding, WS_t struct, init_ws/call with Metal API |
templates/numerical_function_gpu_cuda.j2 |
CUDA template: kernel embedding, WS_t struct, init_ws/call with CUDA driver API + NVRTC |
numerical_function.py |
_CustomMetalRenderer, _CustomCUDARenderer, _NoOpCompiler, _detect_cuda_arch(), GPU cached properties (gpu_kernel_sources, gpu_buf_nbytes, gpu_kernel_buf_indices) |
Dtype handling¶
Metal does not support float64. A ValueError is raised at NumericalFunction construction time if any Arg has dtype=float64 with device="METAL". CUDA supports float64 natively.
Known limitations¶
- Constants must be float32 on Metal: Captured
Tensorconstants in closures must befloat32— afloat64constant produces a Metal kernel withdouble*parameters which Metal rejects. The dtype validation only checksArginputs, not captured constants. - One GPU backend per compilation unit: All GPU functions in a single SQP solver (or module) must use the same backend. Mixed Metal + CUDA is not supported.
- SQP with GPU dependencies: Not yet implemented. The SQP loop runs on CPU; GPU support for individual
NumericalFunctiondependencies within SQP is planned.