> For the complete documentation index, see [llms.txt](https://tensorrt-llm.continuumlabs.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://tensorrt-llm.continuumlabs.ai/performance.md).

# Performance

This document highlights the <mark style="color:yellow;">performance benchmarks of TensorRT-LLM on NVIDIA GPUs across different models</mark>, with a focus on throughput and latency for inference tasks.

### <mark style="color:blue;">**Methodology**</mark>

The performance data was gathered following the benchmarks outlined in the respective folder, ensuring a standardised approach to measure and validate the performance of TensorRT-LLM.

<mark style="color:green;">**High Throughput**</mark>

Performance measurements at large batch sizes were taken to represent high-throughput scenarios. In these cases, the throughput is measured in output tokens per second.

### <mark style="color:blue;">**Performance on Different GPUs**</mark>

<mark style="color:green;">**H100 GPUs (FP8 Precision)**</mark>

Performance on H100 GPUs showcases the fastest throughput rates among the tested GPUs, with significant gains in both short and extended input and output lengths.

<mark style="color:green;">**L40S GPUs (FP8 Precision)**</mark>

The L40S GPUs demonstrate moderate throughput, suitable for applications where a balance between cost and performance is needed.

<mark style="color:green;">**A100 GPUs (FP16 Precision)**</mark>

A100 GPUs show competitive throughput rates, highlighting their capability to handle demanding tasks at a reduced precision level.

<mark style="color:blue;">**Tensor Parallelism (TP)**</mark>

The data incorporates the use of tensor parallelism, where multiple GPU units are employed in parallel to handle the computational load, enhancing throughput and reducing computation time.

<mark style="color:blue;">**Low Latency**</mark>

For low-latency scenarios, such as online streaming tasks where end-user perceived latency is critical, batch size 1 is used to measure the first token's response time.&#x20;

The H100 GPUs demonstrate the lowest latency times, indicating their suitability for applications requiring quick, real-time responses.

### <mark style="color:blue;">**Latency Observations**</mark>

<mark style="color:green;">**H100 GPUs (FP8 Precision)**</mark>

Display low latency across different models, making them ideal for tasks requiring immediate feedback.

<mark style="color:green;">**L40S GPUs (FP8 Precision)**</mark>

Show higher latency compared to H100 GPUs, but still within a reasonable range for many real-time applications.

<mark style="color:green;">**A100 GPUs (FP16 Precision)**</mark>

The latency on A100 GPUs is higher, particularly for extended input lengths, suggesting a trade-off between throughput performance and response time.

The provided data serves as a reference and should not be construed as the maximum achievable performance.&#x20;

It demonstrates that TensorRT-LLM delivers considerable improvements in speed and efficiency across various GPUs and can be optimized further depending on the specific requirements of the task at hand.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tensorrt-llm.continuumlabs.ai/performance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
