NCCL

NVIDIA Collective Communication Library (NCCL) Documentation Notes

Introduction

NCCL (pronounced "Nickel") is a library of multi-GPU collective communication primitives.
It is designed to be topology-aware and easily integrated into applications.
NCCL focuses on accelerating collective communication primitives, not a full-blown parallel programming framework.
It removes the need for developers to optimize their applications for specific machines.

Supported Collectives

AllReduce: Reduces data from all processes and distributes the result back to all processes.
Broadcast: Broadcasts data from one process to all other processes.
Reduce: Reduces data from all processes to a single process.
AllGather: Gathers data from all processes and distributes the combined data to all processes.
ReduceScatter: Reduces data from all processes and scatters the result based on a given distribution.

Key Features

Implements collectives in a single kernel, handling both communication and computation operations.
Allows for fast synchronization and minimizes resources needed to reach peak bandwidth.
Provides fast collectives over multiple GPUs both within and across nodes.
Supports various interconnect technologies: PCIe, NVLink, InfiniBand Verbs, and IP sockets.
Automatically patterns communication strategy to match the system's underlying GPU interconnect topology.

Ease of Use

Simple C API that can be easily accessed from various programming languages.
Closely follows the popular collectives API defined by MPI (Message Passing Interface).
Uses a "stream" argument for direct integration with the CUDA programming model.
Compatible with virtually any multi-GPU parallelization model (single-threaded, multi-threaded, multi-process).

Applications

Widely used in deep learning frameworks for efficient scaling of neural network training.
Heavily utilized for the AllReduce collective in neural network training.

Prerequisites

Software Requirements:
- glibc 2.17 or higher
- CUDA 10.0 or higher
Hardware Requirements:
- Supports all CUDA devices with a compute capability of 3.5 and higher.

Installation

Requires registration for the NVIDIA Developer Program.
Available for download from the NVIDIA NCCL home page.
Installation process varies based on the Linux distribution (Ubuntu, RHEL/CentOS, or other distributions).
Detailed installation steps are provided for each supported distribution.

Usage

Similar to using any other library in your code.
Requires linking to the NCCL library and including the nccl.h header file.
Involves creating a communicator and using the NCCL API for collective operations.
Refer to the NCCL API documentation for maximizing performance.

Migration from NCCL 1.x to NCCL 2.x

APIs have changed slightly between NCCL 1.x and NCCL 2.x.
NCCL 2.x supports all collectives from NCCL 1.x with slight modifications to the API.
NCCL 2.x requires the usage of the Group API when a single thread manages NCCL calls for multiple GPUs.
Changes in initialization, communication, counts, in-place usage, AllGather arguments order, datatypes, and error codes.

Troubleshooting and Support
- Register for the NVIDIA Developer Program to report bugs, issues, and make feature requests.
- Refer to the NCCL open source documentation for additional support.

NCCL is a powerful library that simplifies and accelerates collective communication operations across multiple GPUs.

It abstracts away the complexities of optimising for specific hardware topologies and provides a simple API for efficient multi-GPU communication.

With its wide adoption in deep learning frameworks and support for various interconnect technologies, NCCL has become a crucial component in scaling neural network training on multi-GPU systems.

PreviousNVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server

Last updated 1 year ago

Was this helpful?