The TensorRT-LLM Process
TensorRT-LLM is a toolkit designed to help users create optimised solutions for Large Language Model (LLM) inference.
It provides a Python API that allows users to define models and compile efficient TensorRT engines for NVIDIA GPUs.
The toolkit also includes Python and C++ components for building runtimes to execute these engines, as well as backends for the Triton Inference Server, making it easy to create web-based services for LLMs.
TensorRT-LLM supports multi-GPU and multi-node configurations through MPI.
The process of creating an inference solution with TensorRT-LLM involves the following steps:
Model Definition
Users can either define their own model or choose from a list of pre-defined network architectures supported by TensorRT-LLM.
Model Training
If using a custom model, users must train the model using a training framework (training is not part of TensorRT-LLM). For pre-defined models, users can download checkpoints from various providers, such as the Hugging Face hub, which offers models trained using NVIDIA Nemo or PyTorch.
Model Recreation
With the model definition and weights, users utilise TensorRT-LLM's Python API to recreate the model in a format that can be compiled by TensorRT into an efficient engine.
TensorRT-LLM supports several standard models out-of-the-box for ease of use.
Runtime Creation
TensorRT-LLM provides users with components to create a runtime that executes the efficient TensorRT engine. The runtime components offer features such as beam-search and extensive sampling functionalities (e.g., top-K and top-P sampling). The C++ runtime is the recommended choice.
Triton Inference Server Integration
TensorRT-LLM includes Python and C++ backends for NVIDIA Triton Inference Server, allowing users to assemble solutions for LLM online serving. The C++ backend is recommended as it implements in-flight batching for optimized performance.
To use TensorRT-LLM, users need to supply a set of trained weights.
These weights can be obtained from the user's own model trained in a framework like NVIDIA NeMo or pulled from repositories such as the Hugging Face Hub, which offers pretrained weights.
In summary, TensorRT-LLM simplifies the process of creating optimised LLM inference solutions by providing a Python API for model definition, components for runtime creation, and backends for the Triton Inference Server.
Users can leverage pre-defined models and pretrained weights or use their own custom models to build efficient and scalable LLM inference solutions.
Last updated