GPU Acceleration¶
NumericalFunction supports GPU execution via Metal (macOS) and CUDA (Linux). The Python and C++ interfaces are identical to CPU -- GPU dispatch is fully encapsulated.
Usage¶
Pass the device parameter when creating a NumericalFunction:
import anvil as av
# Metal (macOS)
fn_metal = av.NumericalFunction(
"my_fn", lambda x: x.square(),
(av.Arg(1024, dtype=av.dtypes.float32),),
device="METAL",
)
# CUDA
fn_cuda = av.NumericalFunction(
"my_fn", lambda x: x.square(),
(av.Arg(1024),),
device="CUDA",
)
Calling and code generation work exactly the same as CPU:
import numpy as np
result = fn_metal(np.random.randn(1024).astype(np.float32))
av.generate_module("gpu_module", [fn_metal])
Device constraints¶
Metal¶
- float32 only: Metal does not support
float64. AllArginputs and captured constants must usedtypes.float32. - macOS only, requires Metal framework.
CUDA¶
- Supports both
float32andfloat64. - Requires CUDA toolkit 12.0+ with
nvccandnvrtc. - Architecture is auto-detected via
nvidia-smi, falling back tosm_80.
Generated code¶
For GPU targets, generate_module produces a single .hpp file (instead of .hpp + .cpp) containing:
- GPU kernel source embedded as C++ raw string literals
- Host-side code for runtime kernel compilation (Metal
newLibraryWithSource:/ NVRTC) - Buffer allocation on GPU and host-GPU data transfers
- The same
init_ws()/call()interface as CPU
The call() function handles:
- Copy inputs from host to GPU
- Dispatch kernels
- Synchronize
- Copy outputs from GPU to host
Known limitations¶
- SQP with GPU: Not yet supported. The SQP loop runs on CPU; GPU
NumericalFunctiondependencies within SQP are planned. - One GPU backend per module: All functions in a single module must use the same device. Mixed Metal + CUDA is not supported.
- Captured constants on Metal: Must be
float32. The dtype validation only checksArginputs, not capturedTensorconstants.