> For the complete documentation index, see [llms.txt](https://tensorrt-llm.continuumlabs.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://tensorrt-llm.continuumlabs.ai/tensorrt-llm.md). # TensorRT-LLM TensorRT-LLM is a framework for executing Large Language Model (LLM) inference on NVIDIA GPUs. It integrates a Python API for defining and compiling models into efficient TensorRT engines and includes both Python and C++ components for runtime execution. Additionally, it provides backend support for the Triton Inference Server, facilitating the deployment of web based large language model services. The toolkit is compatible with multi-GPU and multi-node setups through MPI. TensorRT-LLM integrates with the TensorRT deep learning compiler and includes optimised kernels, as well as pre- and post-processing steps. It also incorporates multi-GPU/multi-node communication primitives. The software aims to provide high performance without requiring users to have deep knowledge of C++ or CUDA programming languages.

H100 FP8 increases max throughput, decreases 1st token latency, and reduces memory consumption. At peak, TensorRT-LLM on H100 can achieve >10K token/s or <10ms to first token.

### Python API TensorRT-LLM offers a modular Python API that allows for ease of use and quick customisations. It enables you to define, optimise, and execute new language model architectures as they evolve. ### Features and Optimisations * **Streaming of Tokens:** Handles token streaming efficiently. * **In-flight Batching:** Allows for optimised scheduling to manage dynamic loads. * **Paged attention:** Efficiently manages attention mechanisms in large models. * **Quantization:** Supports reduced-precision inference for better performance. ### Performance Improvements TensorRT-LLM, when used with NVIDIA Hopper architecture, significantly accelerates LLM inference. For example, it can increase throughput by 8x compared to the A100 GPU. It also shows 4.6x speedup for the Llama 2 language model by Meta. ### TCO and Energy Efficiency The software not only improves computational efficiency but also substantially reduces the total cost of ownership (TCO) and energy consumption. An 8x performance speedup results in a 5.3x reduction in TCO and a 5.6x reduction in energy costs compared to the A100 baseline. ### Advanced Scheduling Technique: In-flight Batching TensorRT-LLM includes an optimised scheduling feature called "in-flight batching," which allows the ***runtime to immediately start executing new requests even before the previous batch is completed*****.** This enables better utilisation of GPU resources. ### Quantization and FP8 Support NVIDIA H100 GPUs with TensorRT-LLM support a new 8-bit floating-point format (FP8) that allows for more efficient memory usage during inference without sacrificing accuracy. This is done using NVIDIA's Hopper Transformer Engine technology. ### Conclusion and Future Implications The growing ecosystem of LLMs requires efficient solutions for deployment and scaling, and TensorRT-LLM aims to meet this need. The software provides a robust, scalable, and cost-effective solution for businesses looking to deploy large language models. In summary, TensorRT-LLM is a significant leap forward for anyone working with large language models, offering a range of features and optimizations to streamline deployment, improve performance, and reduce costs. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://tensorrt-llm.continuumlabs.ai/tensorrt-llm.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.