Tinygrad Findings & Optimization Notes¶
This file documents findings regarding the tinygrad compilation pipeline, IR (UOps), and potential optimizations.
Loop Peeling¶
Status¶
Loop peeling is partially implemented in tinygrad under the name "loop splitting" or "range cutting".
Implementation Details¶
In tinygrad/codegen/simplify.py, the function cut_store_range (part of pm_split_store) implements a form of loop peeling by splitting a range based on constant comparisons within the loop body.
# From tinygrad/codegen/simplify.py
def cut_store_range(ctx, store:UOp, r:UOp):
# only cut ranges on CPU for now
if r.src[0].op is not Ops.CONST or ctx!="CPU": return None
if not (cuts:=[c.src[1].arg for c in store.get_consumer_map()[r] if c.op is Ops.CMPLT and r is c.src[0] and c.src[1].op is Ops.CONST]): return None
cuts = sorted(dedup([0] + cuts + [r.src[0].arg]))
ranges = [UOp.range((end-start), *(r.arg[0:-1]+(i,r.arg[-1]))) for i,(start,end) in enumerate(zip(cuts[:-1], cuts[1:]))]
return UOp.group(*[store.substitute({r: new_r+start}).end(new_r) for new_r, start in zip(ranges, cuts[:-1])])
- Logic: If it finds a
STOREinside aRANGEr, and there areCMPLT(less than) operations onrwith constants, it splits the loop into multiple segments. - Constraint: It is currently restricted to the
CPUdevice (ctx == "CPU").
How to implement a general version¶
To implement a generic loop peeler (e.g., peel first \(N\) iterations), one could add a rule to a PatternMatcher that:
1. Matches an Ops.RANGE.
2. Splits it into a constant range 0..N and a remaining range N..End.
3. Uses UOp.group to emit both loops.
Integration¶
You can inject custom transformations without modifying tinygrad source by using Renderer.pre_matcher. This PatternMatcher is executed during the full_rewrite_to_sink pass, right before the kernel is finalized for rendering.
Register Reuse and SSA Pressure¶
The Problem¶
Tinygrad's IR is primarily SSA-based. In very complex kernels, this results in a massive number of unique variable names (alu0, alu1, ...). While C++ compilers are good at register allocation, extremely large SSA blocks can lead to:
1. Slow compilation times (register allocation can be \(O(N^2)\) or worse).
2. Increased register pressure if the live ranges are long.
3. Spilling to the stack if the allocator gives up.
Potential Improvements¶
- Sinking/Hoisting: Use
pre_matcherto moveLOADs closer to their first use, orSTOREs closer to their last definition, shortening live ranges. - Sub-kernel splitting: If a kernel is too complex, it might be better to split it into two smaller kernels at the
Schedulelevel, although this introduces memory round-trips. - Grouping: Group related operations that use the same inputs to allow the compiler to see the reuse pattern more easily.
- Linearizer Scheduling: The register pressure is heavily dependent on the order of UOps. Tinygrad's
linearizepass (intinygrad/codegen/late/linearizer.py) determines this order. Adjusting the topological sort heuristic in the linearizer can significantly affect liveness.
Debugging Schedule Errors¶
- Check tensor spec first: Use
type_verify(sink, tensor_spec)to see if UOps pass validation - Trace rangeify: Use
run_rangeify(sink, debug=True)to see range assignments - Inspect UOp tree: Check
tensor.uop.opand traverse.srcto find problematic ops - Look for EXPANDs: Empty tensors often become EXPAND after rangeify
spjacobian with Zero Columns¶
When a sparse jacobian has 0 columns (input dimension is 0), spjacobian should return an empty tensor of shape (0,) representing 0 nonzero values. The fix in jacobian.py:
Missing .end() for RANGE loops in custom kernels¶
Problem: When using UOp.range() in custom kernels (like eq_constraints_jac_assembler), you must call .end(range_uop) on the grouped stores before calling .sink(). Otherwise, the generated C++ code will have a for loop without its closing brace }, causing compilation errors like "function definition is not allowed here".
Symptoms:
- C++ compilation error: "function definition is not allowed here"
- Mismatched braces in generated kernel code (more { than })
- Next function definition appears inside the previous function body
Incorrect:
r = UOp.range(N - 1, 0)
all_copies = [output[base_offset + ...].store(input[...]) for ...] # uses r
return UOp.group(*all_copies).sink(arg=KernelInfo(name="my_kernel")) # WRONG - missing .end(r)
Correct:
r = UOp.range(N - 1, 0)
all_copies = [output[base_offset + ...].store(input[...]) for ...] # uses r
return UOp.group(*all_copies).end(r).sink(arg=KernelInfo(name="my_kernel")) # CORRECT
Key insight: The RANGE op generates a for (...) { in C++, and END generates the closing }. Without .end(), tinygrad's ClangRenderer only generates the opening brace but never the closing one.
Devectorizer get_idx assertion with large sparse tensors¶
Problem: fold_expanded_index in tinygrad/codegen/late/devectorizer.py asserts self.dtype.scalar() is dtypes.index on a node that has a non-index dtype. This happens when realizing sparse Jacobian/Hessian tensors produced by svmap at certain horizon sizes.
Symptoms:
atdevectorizer.py:76 inside fold_expanded_index.
Trigger pattern: Sparse tensors produced by svmap-based MultistageProblem functions (e.g. ineq_constraints_fn, cost_hessian_fn). The failure is N-dependent with a period-4 pattern: for N >= 12, it fails when N % 4 ∈ {0, 1} and succeeds when N % 4 ∈ {2, 3}. This suggests a SIMD vector-width alignment issue in the devectorizer.
Reproduction (anvil):
from anvil import Tensor, numerical_function
from anvil.opti import MultistageProblem, SparseSQPFunction, SQPSettings, HessianApproximation
nz, nx_dim = 10, 4
for N in range(10, 25):
nhorizon = N
nparam = nx_dim
@numerical_function(f"eq_i_{N}", (nz, nparam), (8,))
def eq_i(z, p): return Tensor.cat(z[:4] - p[:4], z[6:])
@numerical_function(f"eq_d_{N}", (nz, nz, nparam), (8,))
def eq_d(z, zn, p): return Tensor.cat(z[:4] - zn[:4], zn[6:])
@numerical_function(f"ci_{N}", (nz, nparam), ((),))
def ci(z, p): return z[:4].dot(z[:4])
@numerical_function(f"iq_{N}", (nz, nparam), (2,))
def iq(z, p): return Tensor.stack(z[0], z[1])
prob = MultistageProblem(
stage_dim=nz, horizon_size=nhorizon,
z_lb=Tensor.full((nz * (N + 1),), -10.0),
z_ub=Tensor.full((nz * (N + 1),), 10.0),
eq_rhs=Tensor.zeros(8 + N * 8),
eq_initial_fn=eq_i, eq_interstage_fn=eq_d,
ineq_stage_fn=iq, ineq_terminal_fn=iq,
ineq_lb=Tensor.full((2 * N,), -5.0),
ineq_ub=Tensor.full((2 * N,), 5.0),
cost_initial_fn=ci, cost_stage_fn=ci, cost_terminal_fn=ci,
)
solver = SparseSQPFunction(
name=f"s{N}", problem=prob,
settings=SQPSettings(max_iter=2, hessian_approximation=HessianApproximation.EXACT_NO_CONSTRAINTS),
)
try:
solver(initial_x=Tensor.zeros(nz * (N + 1)), p=Tensor.zeros(nx_dim))
print(f"N={N}: OK")
except AssertionError:
print(f"N={N}: FAIL")
Status: Unresolved tinygrad bug. Workaround: use N values where N % 4 ∈ {2, 3}, or use codegen (the bug is in the interpreted-mode kernel compiler).