LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. CUDA Introduction

Stream Multiprocessors: The Heart of GPU Computing

Graphics Processing Units (GPUs) have emerged as a powerful tool for accelerating a wide range of applications, from gaming and computer graphics to scientific research and artificial intelligence.

At the heart of a GPU's parallel processing capabilities lie Stream Multiprocessors (SMs), which are the individual processing units responsible for executing tasks concurrently.

SMs are designed to handle thousands of small threads simultaneously, making GPUs highly efficient at parallel processing tasks.

Each SM contains multiple processing cores, such as Arithmetic Logic Units (ALUs) and Floating-Point Units (FPUs), which are optimized for mathematical and arithmetic operations. , These cores work together to execute instructions in parallel, enabling GPUs to achieve high computational throughput.

One of the key architectural features of SMs is their Single Instruction, Multiple Thread (SIMT) execution model.

In SIMT, a single instruction is executed across multiple threads simultaneously, with each thread operating on different data.

This allows for efficient parallel execution of identical operations on large datasets.

Threads are grouped into "warps," typically consisting of 32 threads, which execute in lockstep, sharing the same program counter and executing the same instruction at the same time. This enables efficient utilisation of SM resources.

SMs also handle resource allocation and management for threads. They have their own set of registers, shared memory, and cache to store data and intermediate results. The shared memory allows threads within a warp to communicate and collaborate efficiently, while the cache hierarchy helps in reducing memory access latency. SMs manage the scheduling and execution of warps, ensuring optimal utilization of processing resources.

Modern GPUs often include specialised processing units within SMs to further enhance their computational capabilities. For example, Tensor Cores are designed to accelerate deep learning workloads, while RT Cores enable real-time ray tracing for advanced graphics rendering. These specialised cores complement the general-purpose CUDA cores found in SMs, allowing GPUs to excel in specific domains.

The scalability of SMs is another key aspect of GPU architecture

The number of SMs in a GPU can vary depending on the specific model and intended use case. High-end GPUs designed for demanding workloads typically feature a larger number of SMs, providing greater parallel processing power. This scalability allows GPUs to handle increasingly complex and computationally intensive tasks.

To harness the power of SMs, developers rely on compute APIs such as CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language).

These APIs provide programming models and frameworks that enable developers to write parallel code and leverage the parallel processing capabilities of GPUs. By efficiently mapping algorithms and data structures to the SM architecture, developers can achieve significant performance gains compared to traditional CPU-based implementations.

However, achieving optimal performance on SMs requires careful consideration of various factors.

Developers need to optimise thread organisation, memory access patterns, and resource utilisation to maximise the efficiency of SM execution.

Techniques such as coalesced memory accesses, minimising branch divergence, and ensuring high occupancy (the ratio of active threads to the maximum possible threads) are crucial for extracting maximum performance from SMs.

In conclusion, Stream Multiprocessors are the powerhouses behind the parallel processing capabilities of GPUs.

Their ability to execute thousands of threads concurrently, coupled with their specialised processing units and efficient resource management, makes them well-suited for a wide range of parallel computing tasks.

As GPUs continue to evolve and incorporate more advanced SM architectures, they will undoubtedly play a crucial role in pushing the boundaries of high-performance computing, enabling breakthroughs in fields such as scientific simulations, machine learning, and beyond.

Summary Points

Parallel processing

SMs are designed to execute thousands of small threads concurrently, making GPUs highly efficient at parallel processing tasks. This parallelism is achieved through the use of multiple processing cores within each SM.

SIMT architecture

SMs employ a Single Instruction, Multiple Thread (SIMT) architecture. In SIMT, a single instruction is executed across multiple threads simultaneously, with each thread operating on different data. This allows for efficient parallel execution of identical operations on large datasets.

Warp execution

Threads are grouped into "warps," which are the basic units of execution in SMs. Warps typically consist of 32 threads that execute in lockstep, meaning they share the same program counter and execute the same instruction at the same time. This enables efficient utilisation of SM resources.

Resource management

SMs handle resource allocation and management for threads. They have their own set of registers, shared memory, and cache to store data and intermediate results. SMs also manage the scheduling and execution of warps, ensuring efficient utilisation of processing resources.

Specialised cores

In addition to general-purpose CUDA cores, modern SMs often include specialised processing units such as Tensor Cores for accelerating deep learning workloads and RT Cores for real-time ray tracing. These specialised cores further enhance the computational capabilities of GPUs for specific domains.

Scalability

The number of SMs in a GPU can vary depending on the specific GPU model and architecture. High-end GPUs tend to have a larger number of SMs, allowing for greater parallel processing power. The scalability of SMs enables GPUs to handle increasingly complex and demanding workloads.

Compute APIs

SMs support various compute APIs, such as CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language). These APIs provide programming models and frameworks that allow developers to write parallel code and leverage the parallel processing capabilities of GPUs.

Performance optimisation

To achieve optimal performance on SMs, developers need to consider factors such as thread organisation, memory access patterns, and efficient utilisation of SM resources. Techniques like coalesced memory accesses, minimising branch divergence, and maximising occupancy can help in extracting maximum performance from SMs.

PreviousCUDA ArchitectureNextPre Installation

Last updated 1 year ago

Was this helpful?

Page cover image