LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

NCCL

NVIDIA Collective Communication Library (NCCL) Documentation Notes

Introduction

  • NCCL (pronounced "Nickel") is a library of multi-GPU collective communication primitives.

  • It is designed to be topology-aware and easily integrated into applications.

  • NCCL focuses on accelerating collective communication primitives, not a full-blown parallel programming framework.

  • It removes the need for developers to optimize their applications for specific machines.

Supported Collectives

  • AllReduce: Reduces data from all processes and distributes the result back to all processes.

  • Broadcast: Broadcasts data from one process to all other processes.

  • Reduce: Reduces data from all processes to a single process.

  • AllGather: Gathers data from all processes and distributes the combined data to all processes.

  • ReduceScatter: Reduces data from all processes and scatters the result based on a given distribution.

Key Features

  • Implements collectives in a single kernel, handling both communication and computation operations.

  • Allows for fast synchronization and minimizes resources needed to reach peak bandwidth.

  • Provides fast collectives over multiple GPUs both within and across nodes.

  • Supports various interconnect technologies: PCIe, NVLink, InfiniBand Verbs, and IP sockets.

  • Automatically patterns communication strategy to match the system's underlying GPU interconnect topology.

Ease of Use

  • Simple C API that can be easily accessed from various programming languages.

  • Closely follows the popular collectives API defined by MPI (Message Passing Interface).

  • Uses a "stream" argument for direct integration with the CUDA programming model.

  • Compatible with virtually any multi-GPU parallelization model (single-threaded, multi-threaded, multi-process).

Applications

  • Widely used in deep learning frameworks for efficient scaling of neural network training.

  • Heavily utilized for the AllReduce collective in neural network training.

Prerequisites

  • Software Requirements:

    • glibc 2.17 or higher

    • CUDA 10.0 or higher

  • Hardware Requirements:

    • Supports all CUDA devices with a compute capability of 3.5 and higher.

Installation

  • Requires registration for the NVIDIA Developer Program.

  • Available for download from the NVIDIA NCCL home page.

  • Installation process varies based on the Linux distribution (Ubuntu, RHEL/CentOS, or other distributions).

  • Detailed installation steps are provided for each supported distribution.

Usage

  • Similar to using any other library in your code.

  • Requires linking to the NCCL library and including the nccl.h header file.

  • Involves creating a communicator and using the NCCL API for collective operations.

  • Refer to the NCCL API documentation for maximizing performance.

Migration from NCCL 1.x to NCCL 2.x

  • APIs have changed slightly between NCCL 1.x and NCCL 2.x.

  • NCCL 2.x supports all collectives from NCCL 1.x with slight modifications to the API.

  • NCCL 2.x requires the usage of the Group API when a single thread manages NCCL calls for multiple GPUs.

  • Changes in initialization, communication, counts, in-place usage, AllGather arguments order, datatypes, and error codes.

  1. Troubleshooting and Support

    • Register for the NVIDIA Developer Program to report bugs, issues, and make feature requests.

    • Refer to the NCCL open source documentation for additional support.

NCCL is a powerful library that simplifies and accelerates collective communication operations across multiple GPUs.

It abstracts away the complexities of optimising for specific hardware topologies and provides a simple API for efficient multi-GPU communication.

With its wide adoption in deep learning frameworks and support for various interconnect technologies, NCCL has become a crucial component in scaling neural network training on multi-GPU systems.

PreviousNVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server

Last updated 1 year ago

Was this helpful?

Page cover image