LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. CUDA Introduction

NVCC: The NVIDIA CUDA Compiler

NVCC, which stands for NVIDIA CUDA Compiler, is a proprietary compiler by NVIDIA that compiles CUDA C/C++ code for execution on CUDA-enabled GPUs (Graphics Processing Units).

NVCC acts as a compiler driver, controlling the compilation flow and linking process, while delegating the actual code generation to other tools like the host compiler and the CUDA backend compiler.

CUDA Programming Model

CUDA follows a heterogeneous programming model where the host code runs on the CPU and the device code, also known as kernels, run on the GPU.

The host code is responsible for memory allocation on the device, data transfer between host and device, and launching kernels on the GPU. Kernels are C++ functions marked with the global keyword, indicating that they are callable from the host and execute on the device.

NVCC Workflow

NVCC processes CUDA source files (typically with a .cu extension) and separates the device code from the host code.

It then compiles the device code using the CUDA backend compiler, which generates a PTX (Parallel Thread Execution) assembly file or a cubin (CUDA binary) object file.

The host code is modified to include the necessary CUDA runtime function calls and is then passed to a standard C++ compiler for compilation.

Supported Host Compilers

NVCC relies on a host compiler for preprocessing, parsing, and code generation of the host code.

It supports various host compilers such as GCC, Clang, and Microsoft Visual C++ (MSVC) on different platforms. The specific host compiler used can be specified using the -ccbin option followed by the path to the compiler executable.

CUDA Compilation Trajectory

The CUDA compilation trajectory involves several stages:

  1. Preprocessing: The CUDA source files are preprocessed to handle includes, macros, and conditional compilation.

  2. Compilation:

    • Device code is compiled to PTX assembly or cubin object files.

    • Host code is modified and compiled using the host compiler.

  3. Linking:

    • Device object files are linked together using nvlink.

    • The resulting device code is embedded into the host object files.

    • Host object files are linked using the host linker to create an executable.

NVCC Compiler Options

NVCC provides a wide range of compiler options to control the compilation process.

Some key options include:

  • -gpu-architecture (-arch): Specifies the target GPU architecture (e.g., compute_80 for NVIDIA Ampere).

  • -gpu-code (-code): Specifies the target GPU code (e.g., sm_80 for NVIDIA Ampere).

  • -rdc: Enables relocatable device code, allowing separate compilation and linking of device code.

  • -dc: Compiles device code only, without host code compilation.

  • -Xcompiler: Passes options directly to the host compiler.

  • -Xlinker: Passes options directly to the host linker.

Separate Compilation and Linking

NVCC supports separate compilation and linking of device code.

This allows device code to be split across multiple files and linked together using nvlink.

To enable separate compilation, the -rdc option is used to generate relocatable device code.

The compiled objects can then be linked using nvlink, and the resulting device code is embedded into the host executable.

Optimisations: NVCC provides various optimisation options to improve the performance of CUDA code. Some notable optimisations include:

  • -O3: Enables aggressive optimisations.

  • -ftz: Flushes denormal values to zero.

  • -prec-div: Controls the precision of division operations.

  • -use_fast_math: Enables fast math optimisations.

Code Generation

NVCC generates device code in two forms:

PTX assembly and cubin object files.

PTX is a low-level virtual machine and instruction set architecture that provides a stable interface for CUDA code across different GPU architectures.

PTX code is compiled to binary code by the CUDA runtime during execution, allowing for portability and forward compatibility.

Cubin, on the other hand, is a pre-compiled binary format specific to a particular GPU architecture.

Virtual Architectures and Just-in-Time Compilation

To enable forward compatibility and optimisation for specific GPU architectures, NVCC introduces the concept of virtual architectures.

Virtual architectures (compute_) define a set of features and capabilities that are common across a range of physical architectures (sm_).

NVCC compiles device code to a virtual architecture, which is then compiled to binary code for a specific physical architecture at runtime through Just-in-Time (JIT) compilation.

This allows CUDA applications to run on newer GPU architectures without recompilation.

Debugging and Profiling

NVCC provides options for debugging and profiling CUDA code.

The -g option enables debugging symbols, allowing for source-level debugging using tools like cuda-gdb.

The -lineinfo option generates line number information for device code, enabling profiling and performance analysis using tools like NVIDIA Visual Profiler.

Conclusion

NVCC is a powerful compiler that simplifies the process of compiling and linking CUDA C/C++ code for execution on NVIDIA GPUs.

It handles the intricate details of separating device code from host code, compiling device code to PTX or cubin, and linking everything together into a final executable.

With its wide range of compiler options, optimizations, and support for separate compilation and linking, NVCC provides developers with the tools necessary to write efficient and high-performance CUDA applications.

PreviousCompatibility AssessmentNextInstalling Cuda

Last updated 1 year ago

Was this helpful?

Page cover image