GPU Acceleration¶

NumericalFunction supports GPU execution via Metal (macOS) and CUDA (Linux). The Python and C++ interfaces are identical to CPU -- GPU dispatch is fully encapsulated.

Usage¶

Pass the device parameter when creating a NumericalFunction:

import anvil as av

# Metal (macOS)
fn_metal = av.NumericalFunction(
    "my_fn", lambda x: x.square(),
    (av.Arg(1024, dtype=av.dtypes.float32),),
    device="METAL",
)

# CUDA
fn_cuda = av.NumericalFunction(
    "my_fn", lambda x: x.square(),
    (av.Arg(1024),),
    device="CUDA",
)

Calling and code generation work exactly the same as CPU:

import numpy as np
result = fn_metal(np.random.randn(1024).astype(np.float32))

av.generate_module("gpu_module", [fn_metal])

Device constraints¶

Metal¶

float32 only: Metal does not support float64. All Arg inputs and captured constants must use dtypes.float32.
macOS only, requires Metal framework.

CUDA¶

Supports both float32 and float64.
Requires CUDA toolkit 12.0+ with nvcc and nvrtc.
Architecture is auto-detected via nvidia-smi, falling back to sm_80.

Generated code¶

For GPU targets, generate_module produces a single .hpp file (instead of .hpp + .cpp) containing:

GPU kernel source embedded as C++ raw string literals
Host-side code for runtime kernel compilation (Metal newLibraryWithSource: / NVRTC)
Buffer allocation on GPU and host-GPU data transfers
The same init_ws() / call() interface as CPU

The call() function handles:

Copy inputs from host to GPU
Dispatch kernels
Synchronize
Copy outputs from GPU to host

Known limitations¶

SQP with GPU: Not yet supported. The SQP loop runs on CPU; GPU NumericalFunction dependencies within SQP are planned.
One GPU backend per module: All functions in a single module must use the same device. Mixed Metal + CUDA is not supported.
Captured constants on Metal: Must be float32. The dtype validation only checks Arg inputs, not captured Tensor constants.