Skip to content

Tinygrad Findings & Optimization Notes

This file documents findings regarding the tinygrad compilation pipeline, IR (UOps), and potential optimizations.

Loop Peeling

Status

Loop peeling is partially implemented in tinygrad under the name "loop splitting" or "range cutting".

Implementation Details

In tinygrad/codegen/simplify.py, the function cut_store_range (part of pm_split_store) implements a form of loop peeling by splitting a range based on constant comparisons within the loop body.

# From tinygrad/codegen/simplify.py

def cut_store_range(ctx, store:UOp, r:UOp):
  # only cut ranges on CPU for now
  if r.src[0].op is not Ops.CONST or ctx!="CPU": return None
  if not (cuts:=[c.src[1].arg for c in store.get_consumer_map()[r] if c.op is Ops.CMPLT and r is c.src[0] and c.src[1].op is Ops.CONST]): return None
  cuts = sorted(dedup([0] + cuts + [r.src[0].arg]))
  ranges = [UOp.range((end-start), *(r.arg[0:-1]+(i,r.arg[-1]))) for i,(start,end) in enumerate(zip(cuts[:-1], cuts[1:]))]

  return UOp.group(*[store.substitute({r: new_r+start}).end(new_r) for new_r, start in zip(ranges, cuts[:-1])])
  • Logic: If it finds a STORE inside a RANGE r, and there are CMPLT (less than) operations on r with constants, it splits the loop into multiple segments.
  • Constraint: It is currently restricted to the CPU device (ctx == "CPU").

How to implement a general version

To implement a generic loop peeler (e.g., peel first \(N\) iterations), one could add a rule to a PatternMatcher that: 1. Matches an Ops.RANGE. 2. Splits it into a constant range 0..N and a remaining range N..End. 3. Uses UOp.group to emit both loops.

Integration

You can inject custom transformations without modifying tinygrad source by using Renderer.pre_matcher. This PatternMatcher is executed during the full_rewrite_to_sink pass, right before the kernel is finalized for rendering.


Register Reuse and SSA Pressure

The Problem

Tinygrad's IR is primarily SSA-based. In very complex kernels, this results in a massive number of unique variable names (alu0, alu1, ...). While C++ compilers are good at register allocation, extremely large SSA blocks can lead to: 1. Slow compilation times (register allocation can be \(O(N^2)\) or worse). 2. Increased register pressure if the live ranges are long. 3. Spilling to the stack if the allocator gives up.

Potential Improvements

  1. Sinking/Hoisting: Use pre_matcher to move LOADs closer to their first use, or STOREs closer to their last definition, shortening live ranges.
  2. Sub-kernel splitting: If a kernel is too complex, it might be better to split it into two smaller kernels at the Schedule level, although this introduces memory round-trips.
  3. Grouping: Group related operations that use the same inputs to allow the compiler to see the reuse pattern more easily.
  4. Linearizer Scheduling: The register pressure is heavily dependent on the order of UOps. Tinygrad's linearize pass (in tinygrad/codegen/late/linearizer.py) determines this order. Adjusting the topological sort heuristic in the linearizer can significantly affect liveness.

Debugging Schedule Errors

  1. Check tensor spec first: Use type_verify(sink, tensor_spec) to see if UOps pass validation
  2. Trace rangeify: Use run_rangeify(sink, debug=True) to see range assignments
  3. Inspect UOp tree: Check tensor.uop.op and traverse .src to find problematic ops
  4. Look for EXPANDs: Empty tensors often become EXPAND after rangeify

spjacobian with Zero Columns

When a sparse jacobian has 0 columns (input dimension is 0), spjacobian should return an empty tensor of shape (0,) representing 0 nonzero values. The fix in jacobian.py:

if shape[1] == 0:
    return Tensor.empty((0,), dtype=dtype)  # 1D tensor, not 2D (shape[0], 0)

Missing .end() for RANGE loops in custom kernels

Problem: When using UOp.range() in custom kernels (like eq_constraints_jac_assembler), you must call .end(range_uop) on the grouped stores before calling .sink(). Otherwise, the generated C++ code will have a for loop without its closing brace }, causing compilation errors like "function definition is not allowed here".

Symptoms: - C++ compilation error: "function definition is not allowed here" - Mismatched braces in generated kernel code (more { than }) - Next function definition appears inside the previous function body

Incorrect:

r = UOp.range(N - 1, 0)
all_copies = [output[base_offset + ...].store(input[...]) for ...]  # uses r
return UOp.group(*all_copies).sink(arg=KernelInfo(name="my_kernel"))  # WRONG - missing .end(r)

Correct:

r = UOp.range(N - 1, 0)
all_copies = [output[base_offset + ...].store(input[...]) for ...]  # uses r
return UOp.group(*all_copies).end(r).sink(arg=KernelInfo(name="my_kernel"))  # CORRECT

Key insight: The RANGE op generates a for (...) { in C++, and END generates the closing }. Without .end(), tinygrad's ClangRenderer only generates the opening brace but never the closing one.

Devectorizer get_idx assertion with large sparse tensors

Problem: fold_expanded_index in tinygrad/codegen/late/devectorizer.py asserts self.dtype.scalar() is dtypes.index on a node that has a non-index dtype. This happens when realizing sparse Jacobian/Hessian tensors produced by svmap at certain horizon sizes.

Symptoms:

AssertionError: Can only call get_idx on index dtype
at devectorizer.py:76 inside fold_expanded_index.

Trigger pattern: Sparse tensors produced by svmap-based MultistageProblem functions (e.g. ineq_constraints_fn, cost_hessian_fn). The failure is N-dependent with a period-4 pattern: for N >= 12, it fails when N % 4 ∈ {0, 1} and succeeds when N % 4 ∈ {2, 3}. This suggests a SIMD vector-width alignment issue in the devectorizer.

Reproduction (anvil):

from anvil import Tensor, numerical_function
from anvil.opti import MultistageProblem, SparseSQPFunction, SQPSettings, HessianApproximation

nz, nx_dim = 10, 4
for N in range(10, 25):
    nhorizon = N
    nparam = nx_dim

    @numerical_function(f"eq_i_{N}", (nz, nparam), (8,))
    def eq_i(z, p): return Tensor.cat(z[:4] - p[:4], z[6:])
    @numerical_function(f"eq_d_{N}", (nz, nz, nparam), (8,))
    def eq_d(z, zn, p): return Tensor.cat(z[:4] - zn[:4], zn[6:])
    @numerical_function(f"ci_{N}", (nz, nparam), ((),))
    def ci(z, p): return z[:4].dot(z[:4])
    @numerical_function(f"iq_{N}", (nz, nparam), (2,))
    def iq(z, p): return Tensor.stack(z[0], z[1])

    prob = MultistageProblem(
        stage_dim=nz, horizon_size=nhorizon,
        z_lb=Tensor.full((nz * (N + 1),), -10.0),
        z_ub=Tensor.full((nz * (N + 1),), 10.0),
        eq_rhs=Tensor.zeros(8 + N * 8),
        eq_initial_fn=eq_i, eq_interstage_fn=eq_d,
        ineq_stage_fn=iq, ineq_terminal_fn=iq,
        ineq_lb=Tensor.full((2 * N,), -5.0),
        ineq_ub=Tensor.full((2 * N,), 5.0),
        cost_initial_fn=ci, cost_stage_fn=ci, cost_terminal_fn=ci,
    )
    solver = SparseSQPFunction(
        name=f"s{N}", problem=prob,
        settings=SQPSettings(max_iter=2, hessian_approximation=HessianApproximation.EXACT_NO_CONSTRAINTS),
    )
    try:
        solver(initial_x=Tensor.zeros(nz * (N + 1)), p=Tensor.zeros(nx_dim))
        print(f"N={N}: OK")
    except AssertionError:
        print(f"N={N}: FAIL")

Status: Unresolved tinygrad bug. Workaround: use N values where N % 4 ∈ {2, 3}, or use codegen (the bug is in the interpreted-mode kernel compiler).