LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • There are two options for building TensorRT-LLM
  • Build TensorRT-LLM in One Step
  • Build Step-by-Step

Was this helpful?

Building TensorRT-LLM

Building TensorRT-LLM from source is recommended for users needing optimal performance, debugging capabilities, or compatibility with the GNU C++11 ABI (Application Binary Interface).

The GNU C++11 ABI (Application Binary Interface)

What is an ABI?

An Application Binary Interface (ABI) is a set of rules and conventions that define how different components of a program interact with each other at the binary level.

It specifies the low-level details of how functions are called, how parameters are passed, how data structures are laid out in memory, and how the program interacts with the operating system.

The ABI is crucial for ensuring compatibility between different parts of a program, such as the application code, libraries, and the operating system. It allows compiled object code to be linked together and executed correctly, even if the components were compiled separately or with different compilers.

Some key aspects of an ABI include

Calling conventions: The rules for how functions are called, including how parameters are passed (e.g., through registers or the stack) and how return values are handled.

Data type representation: The size, alignment, and layout of data types in memory, such as integers, floating-point numbers, and structures.

Name mangling: The scheme used to encode function and variable names in the compiled binary to avoid naming conflicts between different modules.

Object file format: The format used for storing compiled object code, such as ELF (Executable and Linkable Format) on Linux and PE (Portable Executable) on Windows.

ABIs are specific to a particular architecture, operating system, and programming language.

For example, the ABI for C++ on Linux x86-64 is different from the ABI for C++ on Windows x86-64 or the ABI for C on Linux x86-64.

When an ABI change occurs, such as when a new version of a compiler or library introduces incompatible changes, it can cause issues with existing compiled code.

To minimise disruption, compiler and library authors often provide dual ABI support, allowing users to choose between the old and new ABIs during a transition period.

Tools and guidelines are also provided to help manage the transition and ensure compatibility between different components.

Understanding ABIs is essential for developers working on low-level software components, libraries, or cross-language interoperability. It helps ensure the correct integration and execution of compiled code across different parts of a program.

There are two options for building TensorRT-LLM

Build TensorRT-LLM in One Step

This option uses a single Make command to create a Docker image with TensorRT-LLM built inside it.

You can optionally specify the CUDA architectures to target, which can help reduce compilation time by restricting the supported GPU architectures.

Once the image is built, you can run the Docker container using another Make command.

Build Step-by-Step

This option is more flexible and allows you to create a development container in which you can build TensorRT-LLM (inside the container).

The process involves creating a Docker image for development, running the container, and then building TensorRT-LLM inside the container using a Python script (build_wheel.py).

The script supports various options, such as incremental builds, cleaning the build directory, and restricting the compilation to specific CUDA architectures.

The build_wheel.py script also compiles the library containing the C++ runtime of TensorRT-LLM.

PreviousTensor CoresNextBuilding from Source

Last updated 1 year ago

Was this helpful?

Page cover image