CUDA Graphs

The TensorRT compiler can sweep through the graph to choose the best kernel for each operation and available GPU.

Crucially, it can also identify patterns in the graph where multiple operations are good candidates for being fused into a single kernel.

This reduces the required amount of memory movement and the overhead of launching multiple GPU kernels.

TensorRT also compiles the graph of operations into a single CUDA Graph that can be launched all at one time, further reducing the kernel launch overhead.

What is a CUDA Graph?

CUDA graphs are a feature introduced in CUDA 10 to help accelerate applications with iterative workflows where the same series of GPU operations is repeated many times.

Graphs reduce the CPU overhead of launching GPU kernels, especially for short kernels where the launch overhead is significant compared to the kernel runtime

A CUDA graph is a series of operations (kernel launches, memory copies, etc.) connected by dependencies. The graph is defined separately from its execution, enabling the "define once, run repeatedly" usage pattern

Separating graph definition from execution allows the graph to be defined once, instantiated, and then executed repeatedly with very low overhead

Finally, graphs also enable additional optimisations by the CUDA runtime and driver since the entire workflow is visible ahead of time.

Use Cases and Benefits

Main use case is accelerating repetitive workflows with short GPU kernels where launch overhead is a significant portion of the end-to-end time
Reduces CPU overhead of scheduling and launching work, freeing up CPU cycles
Enables optimisations based on global view of the workflow
Provides more modularity and clearer expression of workflow than stream-based code
Heterogeneous - graphs can include GPU kernels, memory copies, CPU callbacks
Supports multi-GPU synchronization and execution

NVIDA Presentation on CUDA Graphs

Key Concepts

A CUDA graph consists of nodes representing GPU operations connected by edges representing their dependencies.
Graph nodes can be kernel launches, memory copies, CPU function calls, or sub-graphs.
A graph is defined separately from its execution. The graph definition specifies the operations and dependencies, but not when it executes.
An instance of the graph is created which can then be launched repeatedly with low overhead since all the setup work only needs to be done once.
CUDA graphs can be constructed explicitly using APIs like cudaGraphCreate, cudaGraphAddNode, etc.
Graphs can also be implicitly captured from stream-based code using the cudaStreamBeginCapture and cudaStreamEndCapture APIs.
While stream-based code can be mapped to a graph, the two models are complementary and suit different use cases. Graphs excel at accelerating repetitive workflows.

Constructing and Executing CUDA Graphs

Define the graph's nodes and dependencies. This can be done:
- Explicitly using APIs like cudaGraphAddNode, specifying each node's type (kernel, memcpy, etc.), parameters, and dependencies, OR
- Implicitly by surrounding existing stream-based code with cudaStreamBeginCapture/cudaStreamEndCapture to capture the operations into a graph
Instantiate the graph with cudaGraphInstantiate. This creates an executable graph instance and performs setup and optimizations.
Launch the instantiated graph repeatedly using cudaGraphLaunch. The same graph instance can be launched many times with low CPU overhead.

Graph nodes can represent

Kernel launches - specify the kernel function, grid/block dimensions, shared memory, kernel parameters
Memory copies - e.g. device-to-device, host-to-device, etc.
Memory sets - initialize memory to a fixed value
Host (CPU) nodes - execute a CPU function, can be used for synchronization
Child graph nodes - nest one graph inside another
Empty nodes - no-ops, used for synchronization and organizing dependencies

Interoperability with Graphics APIs

CUDA graphs are very useful for applications that combine graphics and compute work, such as:
- Scientific visualization where CUDA computes vertex data that is rendered via OpenGL/DirectX/Vulkan
- Image/video processing applications where frames are captured and rendered via graphics APIs but enhanced by CUDA
- Generating procedural content in games/media with CUDA that is then rendered
CUDA 10 added support for sharing memory and synchronization primitives between CUDA and Vulkan/DX12
- Enables direct sharing of memory allocations and semaphores between the APIs
- Complements existing support for OpenGL and DX11

PreviousTensorRT-LLM build workflow - process NextExperimentation with CUDA Graphs

Last updated 1 year ago

Was this helpful?