TensorRT-LLM Tutorial

Introduction TensorRT-LLM is an open-source library developed by NVIDIA that accelerates and optimizes inference performance for large language models (LLMs) on NVIDIA GPUs.

It incorporates various optimization techniques and provides a user-friendly Python API for defining and building new models.

In this tutorial, we will walk through the steps to get started with TensorRT-LLM, including installation,

Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available | NVIDIA Technical BlogNVIDIA Technical Blog

Prerequisites

Docker installed on your system
NVIDIA GPU with CUDA support
Access to the NVIDIA NGC catalog (for Triton Inference Server)

Step 1:

Retrieving the Model Weights Before using TensorRT-LLM, you need to obtain the trained model weights. You can either use your own weights or download pretrained weights from repositories like the HuggingFace Hub. In this tutorial, we will use the weights for the Llama 2 7B parameter model.

git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Note: Agree to the terms and authenticate with Hugging Face to download the necessary files.

Step 2:

Installing TensorRT-LLM Launch a Docker container and install the TensorRT-LLM library using the following commands:

docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04

apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

python3 -c "import tensorrt_llm"

Step 3:

Compiling the Model The next step is to compile the model into a TensorRT engine using the model weights and the TensorRT-LLM Python API.

The API provides predefined model architectures, including the Llama model definition.

huggingface-cli login --token *****

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf \
                              --output_dir ./tllm_checkpoint_1gpu_bf16 \
                              --dtype bfloat16

python convert_checkpoint.py

run the convert_checkpoint.py script

--model_dir ./llama-2-7b-chat-hf

specifies the directory path where the pre-trained model checkpoint is located.

--output_dir ./tllm_checkpoint_1gpu_bf16:

This argument specifies the directory path where the converted checkpoint will be saved.

In this case, the output directory is ./tllm_checkpoint_1gpu_bf16,

indicating that the converted checkpoint will be optimized for running on a single GPU using the bfloat16 data type.

--dtype bfloat16:

This argument specifies the data type to be used for the converted checkpoint. In this case, the data type is set to bfloat16, which is a 16-bit floating-point format that provides a balance between precision and performance. Using bfloat16 can help reduce memory usage and improve inference speed compared to using the default 32-bit floating-point format (float32).

When you run this command, the convert_checkpoint.py script will:

Load the pre-trained LLaMA model checkpoint from the specified --model_dir.
Convert the model weights and other necessary components to a format compatible with TensorRT LLM.
Apply any specified optimizations, such as using the bfloat16 data type.
Save the converted checkpoint to the specified --output_dir.

After the conversion process is complete, you will have a new checkpoint in the ./tllm_checkpoint_1gpu_bf16 directory that is ready to be loaded and used with TensorRT LLM

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir ./tmp/llama/7B/trt_engines/bf16/1-gpu \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16

--checkpoint_dir ./tllm_checkpoint_1gpu_bf16

This specifies the directory path where the converted checkpoint (obtained from running convert_checkpoint.py) is located.

--output_dir ./tmp/llama/7B/trt_engines/bf16/1-gpu

This specifies the directory path where the built TensorRT engine will be saved.

--gpt_attention_plugin bfloat16

This argument specifies the data type to be used for the GPT attention plugin in the TensorRT engine. In this case, it is set to bfloat16.

--gemm_plugin bfloat16

This argument specifies the data type to be used for the GEMM (General Matrix Multiplication) plugin in the TensorRT engine. In this case, it is also set to bfloat16.

When you run this command, the trtllm-build tool will:

Load the converted checkpoint from the specified --checkpoint_dir.
Build and optimize the TensorRT engine using the specified plugins and data types.
Save the built TensorRT engine to the specified --output_dir.

After the build process is complete, you will have a TensorRT engine file in the ./tmp/llama/7B/trt_engines/bf16/1-gpu directory that can be used for efficient inference with TensorRT LLM.

To get more detailed information about the trtllm-build command and its available options,

I recommend referring to the documentation or running it with the --help flag as mentioned earlier.

The compilation process optimises the model graph, fuses operations into efficient kernels, and generates a compiled engine file.

Output

The provided output is from the execution of the trtllm-build command-line tool,

The command is building a TensorRT engine for the LLaMA-7B model using bfloat16 precision on a single GPU. Let's analyze the output in detail:

Command and Version

The trtllm-build command is executed with specific arguments to build the engine.
The TensorRT-LLM version is displayed as "0.10.0.dev2024041600".

Plugin Configuration

The output shows the configuration of various plugins used by TensorRT-LLM.
The gpt_attention_plugin and gemm_plugin are set to use bfloat16 precision.
Other plugins like bert_attention_plugin, nccl_plugin, moe_plugin, etc., are set to their default values.

Warnings

Several warnings are displayed regarding the mismatch of data types between inputs of certain layers (e.g., IElementWiseLayer).
The warnings indicate that the first input has type BFloat16, while the second input has type Float.
These warnings are related to the internal workings of TensorRT and the model architecture.

Memory Usage

The output tracks the memory usage changes during the engine building process.
It shows the memory usage on both the CPU and GPU at different stages of the process.
The memory usage is reported in MiB (mebibytes).

Engine Building

The TensorRT engine named "Unnamed Network 0" is being built.
Warnings about unused inputs (e.g., "position_ids") are displayed.
The global timing cache is in use, and profiling results will be stored.
The approximate region cut reduction algorithm is called for graph reduction.
The number of input and output tensors in the network is detected.

Memory Allocation

The total host persistent memory, device persistent memory, and scratch memory are reported.
The block assignment algorithm (ShiftNTopDown) is used to assign block shifts to nodes.
The total activation memory and weights memory are displayed.

Engine Generation

The engine generation process is completed, and the time taken is reported (11.9708 seconds in this case).
Peak memory usage of TensorRT CPU and GPU memory allocators is shown.

Serialization

The generated engine is serialized, and the serialization process is timed.
The code generator cache, compilation cache, and timing cache entries are serialized.
The timing cache is saved to a file named "model.cache".
The serialized engine is saved to "./tmp/llama/7B/trt_engines/bf16/1-gpu/rank0.engine".

Total Time:
- The total time taken for building all engines is reported (00:00:36 in this case).

The output provides detailed information about the engine building process, including the configuration of plugins, memory usage, warnings encountered, and the time taken for different stages of the process.

It helps in understanding the progress and performance of the TensorRT-LLM engine building for the LLaMA-7B model using bfloat16 precision on a single GPU.

Step 4:

Running the Model Locally To execute the compiled model locally, use the TensorRT-LLM C++ runtime, which handles token sampling, KV cache management, and request batching.

python3 examples/llama/run.py 
   --engine_dir=./tmp/llama/7B/trt_engines/bf16/1-gpu 
   --max_output_len 100
   --tokenizer_dir meta-llama/Llama-2-7b-chat-hf 
   --input_text "How do I count to nine in French?"

Step 5:

Deploying with Triton Inference Server For production-ready deployment, you can use NVIDIA Triton Inference Server with the TensorRT-LLM backend. First, create a model repository with the necessary artifacts:

cd ..
git clone -b release/0.8.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
cp ../TensorRT-LLM/examples/llama/out/*   all_models/inflight_batcher_llm/tensorrt_llm/1/

Modify the configuration files to specify the compiled model engine, tokenizer, and memory allocation settings:

python3 tools/fill_template.py --in_place \
      all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
      decoupled_mode:true,engine_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1,\
max_tokens_in_paged_kv_cache:,batch_scheduler_policy:guaranteed_completion,kv_cache_free_gpu_mem_fraction:0.2,\
max_num_sequences:4

python tools/fill_template.py --in_place \
    all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
    tokenizer_type:llama,tokenizer_dir:meta-llama/Llama-2-7b-chat-hf

python tools/fill_template.py --in_place \
    all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
    tokenizer_type:llama,tokenizer_dir:meta-llama/Llama-2-7b-chat-hf

Launch the Triton Inference Server container and start the server:

docker run -it --rm --gpus all --network host --shm-size=1g \
     -v $(pwd)/all_models:/all_models \
     -v $(pwd)/scripts:/opt/scripts \
     nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

huggingface-cli login --token *****
pip install sentencepiece protobuf

python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 1

Step 6:

Sending Requests To interact with the running Triton Inference Server, you can use the client libraries or send HTTP requests to the generate endpoint.

curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'

Conclusion In this tutorial, we covered the steps to get started with TensorRT-LLM, including installation, model compilation, local execution, and deployment using NVIDIA Triton Inference Server.

TensorRT-LLM provides a powerful toolkit for optimizing and deploying LLMs efficiently on NVIDIA GPUs. By leveraging its capabilities, you can harness the potential of these models and build innovative AI-driven applications.

For more information and resources, refer to the following:

TensorRT-LLM GitHub repository: /NVIDIA/TensorRT-LLM
NVIDIA NeMo: End-to-end framework for generative AI deployments
TensorRT-LLM documentation and sample code on GitHub

PreviousPretrainedConfig class NextTutorial 2 - get inference going

Last updated 1 year ago

Was this helpful?