LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Using Docker Volumes for Persistence
  • Building and Installing TensorRT-LLM Inside the Container
  • Persisting the Environment
  • Committing Changes to a New Docker Image (Optional)
  • Using the New Image or Mounted Volume

Was this helpful?

  1. Building TensorRT-LLM
  2. TensorRT-LLM Dockerfile

Persistence

To ensure that your Docker container retains the TensorRT-LLM library and any installed models, you need to create a persistent environment within the Docker container.

This involves a few key steps:

This involves a few key steps:

Using Docker Volumes for Persistence

When you run a Docker container and install software or make changes within it, these changes are lost once the container is stopped or removed unless you've set up a persistent storage solution.

Docker volumes are the preferred way to persist data in Docker containers.

  • Mount a Docker Volume: When you run your container, mount a volume to a specific path inside the container.

  • This volume will store all the data you want to persist, such as the installed TensorRT-LLM library and models.

Example command:

docker run --rm -it \
           --gpus=all \
           --volume my_tensorrt_llm_data:/path/in/container \
           tensorrt_llm/devel:latest

Replace my_tensorrt_llm_data with your volume name and /path/in/container with the path where you want to persist data inside the container.

Building and Installing TensorRT-LLM Inside the Container

Once you have your development container running with the mounted volume, proceed to build and install TensorRT-LLM.

  • Build the TensorRT-LLM: Use the provided build_wheel.py script to compile the TensorRT-LLM from source.

  • Ensure you are doing this within the directory that is mounted to your Docker volume.

python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt
  • Install the TensorRT-LLM: After building, install the TensorRT-LLM library using pip. This should also be done within the mounted volume directory.

pip install ./build/tensorrt_llm*.whl

Persisting the Environment

Every time you want to work with TensorRT-LLM, make sure to run the container with the volume attached. This ensures that the installed library and any models you've added will be available in subsequent sessions.

Committing Changes to a New Docker Image (Optional)

Alternatively, you can commit the changes made in your container to a new Docker image. This way, the environment with the installed TensorRT-LLM is saved in a new image, and you can use this image directly in the future.

  • Commit the Container to a New Image: After installing TensorRT-LLM in the container, open a new terminal and use the docker commit command to create a new image from your container's current state.

docker commit [CONTAINER_ID] my_tensorrt_llm_image:latest

Using the New Image or Mounted Volume

For future use, you can either run a container from the new image you created or keep using the original image with the volume mounted.

Both methods will retain the installed TensorRT-LLM library and models.

By following these steps, you will create a Docker environment where the TensorRT-LLM library and its models persist between container runs, ensuring that your setup is saved and reusable.

PreviousDocker MakefileNextRunning with persistent volumes

Last updated 1 year ago

Was this helpful?

Page cover image