LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Key points and advice from the talk
  • Some amazing CUDA facts

Was this helpful?

  1. CUDA Introduction

CUDA Architecture

From Steven Jones, one of the architects of CUDA

PreviousCUDA IntroductionNextStream Multiprocessors: The Heart of GPU Computing

Last updated 1 year ago

Was this helpful?

The transcript is a talk by Steven Jones, one of the architects of CUDA, about the CUDA programming model for GPU computing.

The main theme of the talk is explaining why CUDA is designed the way it is, and how its design is fundamentally shaped by the laws of physics and the need to maximise performance and efficiency on the GPU hardware.

Key points and advice from the talk

  1. Memory bandwidth is the primary limiting factor for GPU performance, not the raw computational power (FLOPS). Efficiently utilising memory bandwidth is crucial.

  2. Memory access patterns have a huge impact on performance due to the physics of how DRAM works (capacitance, row/column access). Random access can reduce effective bandwidth by up to 92% compared to linear access. Strive for coalesced, linear memory access patterns.

  3. The CUDA programming model with grids, blocks, and threads is designed to enable efficient memory access. Threads within a warp (32 threads) should access adjacent memory locations for optimal performance. Blocks should have at least 128 threads to maximize memory throughput.

  4. Resource utilisation and occupancy on the Streaming Multiprocessors (SMs) is the second biggest factor for performance after memory throughput. The key limiters are threads per SM, registers per thread, and shared memory per block. Carefully tune these to maximize occupancy (number of active threads).

  5. Aim to keep the GPU oversubscribed with work by exposing concurrency and independence between operations. Use CUDA streams to express dependencies and allow the hardware to optimize scheduling and resource utilisation.

  6. Focus on memory layout, SM occupancy, and exposing concurrency as the top priorities for optimization. Getting these right is far more impactful than micro-optimizing code.

  7. Achieving 50% of peak GPU performance is still a massive speedup for most applications. Perfect optimization is not necessary.

Here are some additional insights and key takeaways about CUDA and how to approach using it effectively:

Mindset

Approach CUDA with the understanding that its design is fundamentally shaped by the physical realities and constraints of the GPU hardware.

Embrace the need to work with the hardware, not against it. Have a performance-oriented mindset but balance it with the pragmatic realisation that achieving a significant fraction of peak theoretical performance is still a huge win.

Mental model

Build an accurate mental model of how the GPU works - the memory hierarchy, the streaming multiprocessors, warps, blocks, grids, etc.

Understand the key limiters (memory bandwidth, SM resources) and let that guide your optimisation efforts. Think in terms of parallel threads but be aware of warp execution.

Memory is king

Treat memory layout and access patterns as the highest priority. Structure your data to enable linear, coalesced access by threads in a warp. Avoid random access. Be willing to revamp your data structures for this.

Pack those SMs

After optimising memory, focus on occupancy - maximising the number of active threads per SM. Balance usage of shared memory, registers and threads. Use occupancy as a key metric to gauge optimization.

Oversubscribe

Don't starve the GPU. Give it lots of work by exposing concurrency and independence in your application. Use streams judiciously. Let the hardware pick the optimal scheduling and packing.

Iterate and analyze

CUDA profiling tools are your friends. Measure, experiment, tweak in an iterative fashion. Focus on the high order bits first (memory, occupancy, concurrency) before micro-optimizations.

Some amazing CUDA facts

  1. A single modern GPU can provide memory bandwidth on the order of 1 TB/s - that's an astounding amount of data throughput!

  2. GPUs can manage and schedule thousands of threads concurrently across dozens of SMs. The scale of parallelism is mind-boggling.

  3. The CUDA warp size of 32 threads is not arbitrary - it's carefully chosen based on GPU hardware constraints like memory bandwidth, cache line size etc.

  4. CUDA has enabled GPUs to be used for an incredible diversity of applications beyond graphics - scientific computing, AI/ML, computational finance, bioinformatics, and more. It has truly democratized high performance parallel computing.

  5. CUDA has been evolving for over a decade in tight harmony with GPU hardware. The close coupling of hardware and software has enabled remarkable synergies and optimizations.

In summary, CUDA is a fascinating technology that showcases the art of the possible in parallel computing on specialised hardware.

Mastering it requires a performance-oriented mindset, a solid grasp of the underlying hardware, and a focus on key optimizations like memory, occupancy and concurrency. The rewards, in terms of achievable performance, can be tremendous.

Page cover image