LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. Building TensorRT-LLM

TensorRT-LLM Dockerfile

This is a multi-stage Dockerfile that builds and packages the TensorRT-LLM library.

Let's break it down section by section:

Base image and environment setup

ARG BASE_IMAGE=nvcr.io/nvidia/pytorch
ARG BASE_TAG=24.02-py3
ARG DEVEL_IMAGE=devel

FROM ${BASE_IMAGE}:${BASE_TAG} as base

ENV BASH_ENV=${BASH_ENV:-/etc/bash.bashrc}
ENV ENV=${ENV:-/etc/shinit_v2}
SHELL ["/bin/bash", "-c"]

This section sets up the base image using the NVIDIA PyTorch image and specifies the tag for the image.

It also sets environment variables for the bash configuration and shell initialization. The SHELL instruction sets the default shell to Bash.

Development stage

FROM base as devel

COPY docker/common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh

COPY docker/common/install_cmake.sh install_cmake.sh
RUN bash ./install_cmake.sh && rm install_cmake.sh

COPY docker/common/install_ccache.sh install_ccache.sh
RUN bash ./install_ccache.sh && rm install_ccache.sh

This stage builds upon the base image and installs necessary dependencies, CMake, and ccache.

The installation scripts are copied into the image and executed using RUN instructions. After each script is run, it is removed to keep the image size small.

TensorRT installation

ARG TRT_VER
ARG CUDA_VER
ARG CUDNN_VER
ARG NCCL_VER
ARG CUBLAS_VER

COPY docker/common/install_tensorrt.sh install_tensorrt.sh
RUN bash ./install_tensorrt.sh \
    --TRT_VER=${TRT_VER} \
    --CUDA_VER=${CUDA_VER} \
    --CUDNN_VER=${CUDNN_VER} \
    --NCCL_VER=${NCCL_VER} \
    --CUBLAS_VER=${CUBLAS_VER} && \
    rm install_tensorrt.sh

This section installs TensorRT using the provided installation script.

The script takes various arguments for the versions of TensorRT, CUDA, cuDNN, NCCL, and cuBLAS.

These arguments are passed as build arguments to the Dockerfile.

Additional dependencies

COPY docker/common/install_polygraphy.sh install_polygraphy.sh
RUN bash ./install_polygraphy.sh && rm install_polygraphy.sh

COPY docker/common/install_mpi4py.sh install_mpi4py.sh
RUN bash ./install_mpi4py.sh && rm install_mpi4py.sh

This section installs additional dependencies, namely Polygraphy and mpi4py, using their respective installation scripts.

PyTorch installation

ARG TORCH_INSTALL_TYPE="skip"
COPY docker/common/install_pytorch.sh install_pytorch.sh
RUN bash ./install_pytorch.sh $TORCH_INSTALL_TYPE && rm install_pytorch.sh

This section installs PyTorch using the provided installation script. The TORCH_INSTALL_TYPE argument specifies the type of installation (default is "skip").

Wheel building stage

FROM ${DEVEL_IMAGE} as wheel

WORKDIR /src/tensorrt_llm

COPY benchmarks benchmarks
COPY cpp cpp
COPY benchmarks benchmarks
COPY scripts scripts
COPY tensorrt_llm tensorrt_llm
COPY 3rdparty 3rdparty
COPY setup.py requirements.txt requirements-dev.txt ./

ARG BUILD_WHEEL_ARGS="--clean --trt_root /usr/local/tensorrt --python_bindings --benchmarks"
RUN python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS}

This stage builds the TensorRT-LLM wheel package.

It starts from the development image and sets the working directory to /src/tensorrt_llm.

The necessary source files and directories are copied into the image.

The build_wheel.py script is run with the specified build arguments to create the wheel package.

Release stage

FROM ${DEVEL_IMAGE} as release

WORKDIR /app/tensorrt_llm

COPY --from=wheel /src/tensorrt_llm/build/tensorrt_llm*.whl .
RUN pip install tensorrt_llm*.whl --extra-index-url https://pypi.nvidia.com && \
    rm tensorrt_llm*.whl

COPY README.md ./
COPY docs docs
COPY cpp/include include

RUN ln -sv $(python3 -c 'import site; print(f"{site.getsitepackages()[0]}/tensorrt_llm/libs")') lib && \
    test -f lib/libnvinfer_plugin_tensorrt_llm.so && \
    ln -sv lib/libnvinfer_plugin_tensorrt_llm.so lib/libnvinfer_plugin_tensorrt_llm.so.9 && \
    echo "/app/tensorrt_llm/lib" > /etc/ld.so.conf.d/tensorrt_llm.conf && \
    ldconfig

ARG SRC_DIR=/src/tensorrt_llm
COPY --from=wheel ${SRC_DIR}/benchmarks benchmarks

ARG CPP_BUILD_DIR=${SRC_DIR}/cpp/build
COPY --from=wheel \
    ${CPP_BUILD_DIR}/benchmarks/bertBenchmark \
    ${CPP_BUILD_DIR}/benchmarks/gptManagerBenchmark \
    ${CPP_BUILD_DIR}/benchmarks/gptSessionBenchmark \
    benchmarks/cpp/

COPY examples examples
RUN chmod -R a+w examples && \
    rm -v \
    benchmarks/cpp/bertBenchmark.cpp \
    benchmarks/cpp/gptManagerBenchmark.cpp \
    benchmarks/cpp/gptSessionBenchmark.cpp \
    benchmarks/cpp/CMakeLists.txt

ARG GIT_COMMIT
ARG TRT_LLM_VER
ENV TRT_LLM_GIT_COMMIT=${GIT_COMMIT} \
    TRT_LLM_VERSION=${TRT_LLM_VER}

The release stage starts from the development image and sets the working directory to /app/tensorrt_llm.

It copies the built wheel package from the previous stage and installs it using pip.

The README, documentation, and include files are copied into the image.

The RUN instruction creates symbolic links for the TensorRT-LLM libraries and updates the library configuration.

The benchmark files and examples are copied from the wheel stage into the release image. Some unnecessary files are removed, and the examples directory is given write permissions.

Finally, the GIT_COMMIT and TRT_LLM_VER build arguments are used to set environment variables in the image.

This multi-stage Dockerfile allows for efficient building and packaging of the TensorRT-LLM library, separating the development dependencies from the final release image.

PreviousBuilding from SourceNextBase Image

Last updated 1 year ago

Was this helpful?

Page cover image