# TensorRT-LLM

TensorRT-LLM is a framework for executing Large Language Model (LLM) inference on NVIDIA GPUs.&#x20;

It integrates a Python API for <mark style="color:yellow;">defining and compiling models into efficient TensorRT engines and includes both Python and C++ components for runtime execution</mark>.

Additionally, it provides <mark style="color:yellow;">backend support for the Triton Inference Server,</mark> facilitating the deployment of web based large language model services.&#x20;

The toolkit is compatible with multi-GPU and multi-node setups through MPI.

TensorRT-LLM integrates with the TensorRT deep learning compiler and includes optimised kernels, as well as pre- and post-processing steps.

It also incorporates multi-GPU/multi-node communication primitives.&#x20;

The software aims to provide high performance without requiring users to have deep knowledge of C++ or CUDA programming languages.

<figure><img src="https://1726855934-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FtDk18PoauZh6y1imIvHj%2Fuploads%2FeDnVMIZij3xf9AboBXm4%2FTRT_LLM_v0-5-0_H100vA100_1st.png?alt=media&#x26;token=fced4b39-6463-4656-9088-2a9717d162d7" alt="" width="563"><figcaption><p>H100 FP8 increases max throughput, decreases 1st token latency, and reduces memory consumption. At peak, TensorRT-LLM on H100 can achieve >10K token/s or &#x3C;10ms to first token.</p></figcaption></figure>

### <mark style="color:blue;">Python API</mark>

TensorRT-LLM offers a <mark style="color:yellow;">modular Python API</mark> that allows for ease of use and quick customisations. It enables you to define, optimise, and execute new language model architectures as they evolve.

### <mark style="color:blue;">Features and Optimisations</mark>

* <mark style="color:blue;">**Streaming of Tokens:**</mark> Handles token streaming efficiently.
* <mark style="color:blue;">**In-flight Batching:**</mark> Allows for optimised scheduling to manage dynamic loads.
* <mark style="color:blue;">**Paged attention:**</mark> Efficiently manages attention mechanisms in large models.
* <mark style="color:blue;">**Quantization:**</mark> Supports reduced-precision inference for better performance.

### <mark style="color:blue;">Performance Improvements</mark>

TensorRT-LLM, when used with NVIDIA Hopper architecture, significantly accelerates LLM inference.

For example, it can increase throughput by 8x compared to the A100 GPU. It also shows 4.6x speedup for the Llama 2 language model by Meta.

### <mark style="color:blue;">TCO and Energy Efficiency</mark>

The software not only improves computational efficiency but also substantially reduces the total cost of ownership (TCO) and energy consumption.&#x20;

An 8x performance speedup results in a 5.3x reduction in TCO and a 5.6x reduction in energy costs compared to the A100 baseline.

### <mark style="color:blue;">Advanced Scheduling Technique: In-flight Batching</mark>

TensorRT-LLM includes an optimised scheduling feature called "in-flight batching," which allows the *<mark style="color:yellow;">**runtime to immediately start executing new requests even before the previous batch is completed**</mark>*<mark style="color:yellow;">**.**</mark> This enables better utilisation of GPU resources.

### <mark style="color:blue;">Quantization and FP8 Support</mark>

NVIDIA H100 GPUs with TensorRT-LLM support a new 8-bit floating-point format (FP8) that allows for more efficient memory usage during inference without sacrificing accuracy.&#x20;

This is done using NVIDIA's Hopper Transformer Engine technology.

### <mark style="color:blue;">Conclusion and Future Implications</mark>

The growing ecosystem of LLMs requires efficient solutions for deployment and scaling, and TensorRT-LLM aims to meet this need.&#x20;

The software provides a robust, scalable, and cost-effective solution for businesses looking to deploy large language models.

In summary, TensorRT-LLM is a significant leap forward for anyone working with large language models, offering a range of features and optimizations to streamline deployment, improve performance, and reduce costs.
