Performance
TensorRT-LLM
This document highlights the performance benchmarks of TensorRT-LLM on NVIDIA GPUs across different models, with a focus on throughput and latency for inference tasks.
Methodology
The performance data was gathered following the benchmarks outlined in the respective folder, ensuring a standardised approach to measure and validate the performance of TensorRT-LLM.
High Throughput
Performance measurements at large batch sizes were taken to represent high-throughput scenarios. In these cases, the throughput is measured in output tokens per second.
Performance on Different GPUs
H100 GPUs (FP8 Precision)
Performance on H100 GPUs showcases the fastest throughput rates among the tested GPUs, with significant gains in both short and extended input and output lengths.
L40S GPUs (FP8 Precision)
The L40S GPUs demonstrate moderate throughput, suitable for applications where a balance between cost and performance is needed.
A100 GPUs (FP16 Precision)
A100 GPUs show competitive throughput rates, highlighting their capability to handle demanding tasks at a reduced precision level.
Tensor Parallelism (TP)
The data incorporates the use of tensor parallelism, where multiple GPU units are employed in parallel to handle the computational load, enhancing throughput and reducing computation time.
Low Latency
For low-latency scenarios, such as online streaming tasks where end-user perceived latency is critical, batch size 1 is used to measure the first token's response time.
The H100 GPUs demonstrate the lowest latency times, indicating their suitability for applications requiring quick, real-time responses.
Latency Observations
H100 GPUs (FP8 Precision)
Display low latency across different models, making them ideal for tasks requiring immediate feedback.
L40S GPUs (FP8 Precision)
Show higher latency compared to H100 GPUs, but still within a reasonable range for many real-time applications.
A100 GPUs (FP16 Precision)
The latency on A100 GPUs is higher, particularly for extended input lengths, suggesting a trade-off between throughput performance and response time.
The provided data serves as a reference and should not be construed as the maximum achievable performance.
It demonstrates that TensorRT-LLM delivers considerable improvements in speed and efficiency across various GPUs and can be optimized further depending on the specific requirements of the task at hand.
Last updated