Skip to content

GPU Acceleration

NumericalFunction supports GPU execution via Metal (macOS) and CUDA (Linux). The Python and C++ interfaces are identical to CPU -- GPU dispatch is fully encapsulated.

Usage

Pass the device parameter when creating a NumericalFunction:

import anvil as av

# Metal (macOS)
fn_metal = av.NumericalFunction(
    "my_fn", lambda x: x.square(),
    (av.Arg(1024, dtype=av.dtypes.float32),),
    device="METAL",
)

# CUDA
fn_cuda = av.NumericalFunction(
    "my_fn", lambda x: x.square(),
    (av.Arg(1024),),
    device="CUDA",
)

Calling and code generation work exactly the same as CPU:

import numpy as np
result = fn_metal(np.random.randn(1024).astype(np.float32))

av.generate_module("gpu_module", [fn_metal])

Device constraints

Metal

  • float32 only: Metal does not support float64. All Arg inputs and captured constants must use dtypes.float32.
  • macOS only, requires Metal framework.

CUDA

  • Supports both float32 and float64.
  • Requires CUDA toolkit 12.0+ with nvcc and nvrtc.
  • Architecture is auto-detected via nvidia-smi, falling back to sm_80.

Generated code

For GPU targets, generate_module produces a single .hpp file (instead of .hpp + .cpp) containing:

  • GPU kernel source embedded as C++ raw string literals
  • Host-side code for runtime kernel compilation (Metal newLibraryWithSource: / NVRTC)
  • Buffer allocation on GPU and host-GPU data transfers
  • The same init_ws() / call() interface as CPU

The call() function handles:

  1. Copy inputs from host to GPU
  2. Dispatch kernels
  3. Synchronize
  4. Copy outputs from GPU to host

Known limitations

  • SQP with GPU: Not yet supported. The SQP loop runs on CPU; GPU NumericalFunction dependencies within SQP are planned.
  • One GPU backend per module: All functions in a single module must use the same device. Mixed Metal + CUDA is not supported.
  • Captured constants on Metal: Must be float32. The dtype validation only checks Arg inputs, not captured Tensor constants.