LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Persistence Explained
  • Docker Volumes vs. Bind Mounts

Was this helpful?

  1. Building TensorRT-LLM
  2. TensorRT-LLM Dockerfile

Running with persistent volumes

The command provided for running the Docker container does not create a "persistent volume" in the Docker sense but mounts a host directory into the container at runtime.

The --volume (or -v) flag is used to mount a file or directory from the host into the container.

This allows for sharing files between the host and the container. Here's a breakdown of the docker run command and its implications for persistence:

docker run --rm -it \
           --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
           --volume ${PWD}:/code/tensorrt_llm \
           --workdir /code/tensorrt_llm \
           tensorrt_llm/devel:latest
  • --rm: Automatically removes the container when it exits. This means that any data or changes made inside the container that are not in a volume or bind-mounted directory will be lost.

  • -it: Runs the container in interactive mode with a tty, allowing you to interact with the container via the command line.

  • --ipc=host, --ulimit memlock=-1, --ulimit stack=67108864, --gpus=all: Various options to configure the container's system resources and GPU access.

  • --volume ${PWD}:/code/tensorrt_llm: Mounts the current working directory (${PWD}) on the host to /code/tensorrt_llm inside the container. This is where the persistence aspect comes into play, but it's based on a host directory, not a Docker-managed volume.

  • --workdir /code/tensorrt_llm: Sets the working directory inside the container to /code/tensorrt_llm.

Persistence Explained

The persistence of data using the --volume flag is tied to the lifecycle of the host directory being mounted. This means:

  • Within the Container: Any changes made inside the container to the contents of /code/tensorrt_llm will reflect back to the ${PWD} on the host, and vice versa. This allows for a persistent development workflow where changes made on the host can be immediately reflected and tested within the container.

  • Across Container Sessions: Because the data lives on the host and is merely mounted into the container, it persists across multiple docker run invocations, as long as you mount the same host directory each time.

Docker Volumes vs. Bind Mounts

The approach described above uses a bind mount (mounting a host directory), not a Docker-managed volume.

Docker volumes are managed by Docker and are a more robust solution for persisting data generated by and used by Docker containers.

They're stored in a part of the host filesystem managed by Docker (/var/lib/docker/volumes/ on Linux).

Unlike bind mounts, volumes are completely managed by Docker and abstract away the underlying operating system details, providing a more portable and encapsulated way to handle data persistence.

PreviousPersistenceNextTensorRT-LLM Architecture and Process

Last updated 1 year ago

Was this helpful?

Page cover image