TensorRT-LLM Tutorial
Introduction TensorRT-LLM is an open-source library developed by NVIDIA that accelerates and optimizes inference performance for large language models (LLMs) on NVIDIA GPUs.
It incorporates various optimization techniques and provides a user-friendly Python API for defining and building new models.
In this tutorial, we will walk through the steps to get started with TensorRT-LLM, including installation,
Prerequisites
Docker installed on your system
NVIDIA GPU with CUDA support
Access to the NVIDIA NGC catalog (for Triton Inference Server)
Step 1:
Retrieving the Model Weights Before using TensorRT-LLM, you need to obtain the trained model weights. You can either use your own weights or download pretrained weights from repositories like the HuggingFace Hub. In this tutorial, we will use the weights for the Llama 2 7B parameter model.
Note: Agree to the terms and authenticate with Hugging Face to download the necessary files.
Step 2:
Installing TensorRT-LLM Launch a Docker container and install the TensorRT-LLM library using the following commands:
Step 3:
Compiling the Model The next step is to compile the model into a TensorRT engine using the model weights and the TensorRT-LLM Python API.
The API provides predefined model architectures, including the Llama model definition.
python convert_checkpoint.py
run the convert_checkpoint.py
script
--model_dir ./llama-2-7b-chat-hf
specifies the directory path where the pre-trained model checkpoint is located.
--output_dir ./tllm_checkpoint_1gpu_bf16
:
This argument specifies the directory path where the converted checkpoint will be saved.
In this case, the output directory is ./tllm_checkpoint_1gpu_bf16
,
indicating that the converted checkpoint will be optimized for running on a single GPU using the bfloat16 data type.
--dtype bfloat16
:
This argument specifies the data type to be used for the converted checkpoint. In this case, the data type is set to bfloat16
, which is a 16-bit floating-point format that provides a balance between precision and performance. Using bfloat16 can help reduce memory usage and improve inference speed compared to using the default 32-bit floating-point format (float32).
When you run this command, the convert_checkpoint.py
script will:
Load the pre-trained LLaMA model checkpoint from the specified
--model_dir
.Convert the model weights and other necessary components to a format compatible with TensorRT LLM.
Apply any specified optimizations, such as using the bfloat16 data type.
Save the converted checkpoint to the specified
--output_dir
.
After the conversion process is complete, you will have a new checkpoint in the ./tllm_checkpoint_1gpu_bf16
directory that is ready to be loaded and used with TensorRT LLM
--checkpoint_dir ./tllm_checkpoint_1gpu_bf16
This specifies the directory path where the converted checkpoint (obtained from running convert_checkpoint.py
) is located.
--output_dir ./tmp/llama/7B/trt_engines/bf16/1-gpu
This specifies the directory path where the built TensorRT engine will be saved.
--gpt_attention_plugin bfloat16
This argument specifies the data type to be used for the GPT attention plugin in the TensorRT engine. In this case, it is set to bfloat16
.
--gemm_plugin bfloat16
This argument specifies the data type to be used for the GEMM (General Matrix Multiplication) plugin in the TensorRT engine. In this case, it is also set to bfloat16
.
When you run this command, the trtllm-build
tool will:
Load the converted checkpoint from the specified
--checkpoint_dir
.Build and optimize the TensorRT engine using the specified plugins and data types.
Save the built TensorRT engine to the specified
--output_dir
.
After the build process is complete, you will have a TensorRT engine file in the ./tmp/llama/7B/trt_engines/bf16/1-gpu
directory that can be used for efficient inference with TensorRT LLM.
To get more detailed information about the trtllm-build
command and its available options,
I recommend referring to the documentation or running it with the --help
flag as mentioned earlier.
The compilation process optimises the model graph, fuses operations into efficient kernels, and generates a compiled engine file.
Output
The provided output is from the execution of the trtllm-build
command-line tool,
The command is building a TensorRT engine for the LLaMA-7B model using bfloat16 precision on a single GPU. Let's analyze the output in detail:
Command and Version
The
trtllm-build
command is executed with specific arguments to build the engine.The TensorRT-LLM version is displayed as "0.10.0.dev2024041600".
Plugin Configuration
The output shows the configuration of various plugins used by TensorRT-LLM.
The
gpt_attention_plugin
andgemm_plugin
are set to use bfloat16 precision.Other plugins like
bert_attention_plugin
,nccl_plugin
,moe_plugin
, etc., are set to their default values.
Warnings
Several warnings are displayed regarding the mismatch of data types between inputs of certain layers (e.g., IElementWiseLayer).
The warnings indicate that the first input has type BFloat16, while the second input has type Float.
These warnings are related to the internal workings of TensorRT and the model architecture.
Memory Usage
The output tracks the memory usage changes during the engine building process.
It shows the memory usage on both the CPU and GPU at different stages of the process.
The memory usage is reported in MiB (mebibytes).
Engine Building
The TensorRT engine named "Unnamed Network 0" is being built.
Warnings about unused inputs (e.g., "position_ids") are displayed.
The global timing cache is in use, and profiling results will be stored.
The approximate region cut reduction algorithm is called for graph reduction.
The number of input and output tensors in the network is detected.
Memory Allocation
The total host persistent memory, device persistent memory, and scratch memory are reported.
The block assignment algorithm (ShiftNTopDown) is used to assign block shifts to nodes.
The total activation memory and weights memory are displayed.
Engine Generation
The engine generation process is completed, and the time taken is reported (11.9708 seconds in this case).
Peak memory usage of TensorRT CPU and GPU memory allocators is shown.
Serialization
The generated engine is serialized, and the serialization process is timed.
The code generator cache, compilation cache, and timing cache entries are serialized.
The timing cache is saved to a file named "model.cache".
The serialized engine is saved to "./tmp/llama/7B/trt_engines/bf16/1-gpu/rank0.engine".
Total Time:
The total time taken for building all engines is reported (00:00:36 in this case).
The output provides detailed information about the engine building process, including the configuration of plugins, memory usage, warnings encountered, and the time taken for different stages of the process.
It helps in understanding the progress and performance of the TensorRT-LLM engine building for the LLaMA-7B model using bfloat16 precision on a single GPU.
Step 4:
Running the Model Locally To execute the compiled model locally, use the TensorRT-LLM C++ runtime, which handles token sampling, KV cache management, and request batching.
Step 5:
Deploying with Triton Inference Server For production-ready deployment, you can use NVIDIA Triton Inference Server with the TensorRT-LLM backend. First, create a model repository with the necessary artifacts:
Modify the configuration files to specify the compiled model engine, tokenizer, and memory allocation settings:
Launch the Triton Inference Server container and start the server:
Step 6:
Sending Requests To interact with the running Triton Inference Server, you can use the client libraries or send HTTP requests to the generate endpoint.
Conclusion In this tutorial, we covered the steps to get started with TensorRT-LLM, including installation, model compilation, local execution, and deployment using NVIDIA Triton Inference Server.
TensorRT-LLM provides a powerful toolkit for optimizing and deploying LLMs efficiently on NVIDIA GPUs. By leveraging its capabilities, you can harness the potential of these models and build innovative AI-driven applications.
For more information and resources, refer to the following:
TensorRT-LLM GitHub repository: /NVIDIA/TensorRT-LLM
NVIDIA NeMo: End-to-end framework for generative AI deployments
TensorRT-LLM documentation and sample code on GitHub
Last updated