# NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server

### <mark style="color:blue;">Introduction</mark>

NVIDIA Nsight Systems is a powerful system-wide performance analysis tool designed to help developers optimise and debug their GPU-accelerated applications.&#x20;

It provides a unified timeline view of the entire system, including CPU and GPU activity, enabling users to identify performance bottlenecks and optimization opportunities.&#x20;

In this guide, we will explore the technical details and features of Nsight Systems and discuss how to apply it in the context of using TensorRT-LLM and Triton Inference Server to serve large language models (LLMs).

### <mark style="color:blue;">Technical Details</mark>

Nsight Systems is a profiling and tracing tool that collects and visualises performance data from various sources, including CPUs, GPUs, and system resources.

It supports a wide range of computing and graphics APIs, such as CUDA, Vulkan, OpenGL, and DirectX. Nsight Systems uses low-overhead sampling techniques to gather performance metrics and events without significantly impacting the application's runtime behavior.

#### <mark style="color:green;">The core components of Nsight Systems include:</mark>

1. <mark style="color:purple;">**Profiling Target:**</mark> The system or device on which the application is being profiled. Nsight Systems supports local and remote profiling targets, including Linux and Windows workstations, servers, and embedded devices like NVIDIA Jetson and Drive platforms.
2. <mark style="color:purple;">**Tracing:**</mark> Nsight Systems traces the execution of the target application and collects performance data, such as API calls, memory transfers, kernel launches, and system events. The tracing options can be customized to focus on specific APIs, metrics, or time ranges.
3. <mark style="color:purple;">**Timeline View:**</mark> The main visualisation interface of Nsight Systems is the timeline view, which displays the collected performance data in a chronological order. The timeline is divided into different rows representing CPU threads, GPU streams, and other system resources. Each event or metric is represented as a color-coded block on the timeline, allowing users to identify patterns, correlations, and potential bottlenecks.
4. <mark style="color:purple;">**Metrics and Events**</mark><mark style="color:purple;">:</mark> Nsight Systems captures a wide range of performance metrics and events, including CPU and GPU utilisation, memory usage, API calls, kernel execution times, and data transfers. These metrics and events provide valuable insights into the application's behavior and help identify areas for optimisation.

### <mark style="color:blue;">Features</mark>

Nsight Systems offers several key features that make it a powerful tool for performance analysis and optimization:

1. <mark style="color:purple;">**System-Wide Profiling**</mark><mark style="color:purple;">:</mark> Nsight Systems provides a holistic view of the entire system, including CPUs, GPUs, and memory. It captures the interactions between different components and allows users to correlate performance data across the system.
2. <mark style="color:purple;">**API Support**</mark><mark style="color:purple;">:</mark> Nsight Systems supports a wide range of computing and graphics APIs, making it suitable for various GPU-accelerated applications. It can trace CUDA, Vulkan, OpenGL, DirectX, and other APIs, providing insights into the performance characteristics of each API.
3. <mark style="color:purple;">**Customizable Tracing**</mark><mark style="color:purple;">:</mark> Users can customise the tracing options to focus on specific APIs, metrics, or time ranges. This flexibility allows developers to target specific areas of interest and minimise the overhead of tracing.
4. <mark style="color:purple;">**Visual Timeline**</mark><mark style="color:purple;">:</mark> The timeline view in Nsight Systems provides an intuitive and visually appealing representation of the performance data. It allows users to zoom in and out, navigate through the timeline, and inspect individual events and metrics in detail.
5. <mark style="color:purple;">**Correlation and Analysis**</mark><mark style="color:purple;">:</mark> Nsight Systems enables users to correlate performance data across different components and identify relationships between events. It provides tools for analysis, such as filtering, searching, and aggregating data, to help users pinpoint performance bottlenecks and optimization opportunities.
6. <mark style="color:purple;">**Remote Profiling**</mark><mark style="color:purple;">:</mark> Nsight Systems supports remote profiling, allowing users to profile applications running on remote systems or embedded devices. This feature is particularly useful when working with Triton Inference Server, as it enables profiling the server's performance on remote machines.

### <mark style="color:blue;">Applying Nsight Systems with TensorRT-LLM and Triton Inference Server</mark>

When using TensorRT-LLM and Triton Inference Server to serve large language models, Nsight Systems can be a valuable tool for optimizing performance and identifying bottlenecks.&#x20;

Here are some key areas where Nsight Systems can be applied:

<mark style="color:green;">**Profiling TensorRT Engines**</mark>

Nsight Systems can be used to profile the execution of TensorRT engines, including the `rank0.engine` file generated from the LLaMA model.&#x20;

By tracing the engine's execution, users can analyze the performance of individual layers, kernels, and memory transfers. This information can help identify opportunities for optimisation, such as adjusting batch sizes, using different precisions, or optimizing kernel configurations.

<mark style="color:green;">**Analysing Triton Inference Server**</mark>

Nsight Systems can be used to profile the performance of Triton Inference Server when serving LLMs.

By tracing the server's execution, users can monitor the CPU and GPU utilisation, request handling, and data transfers. This analysis can help identify bottlenecks in the server's configuration, such as insufficient GPU resources, suboptimal model placement, or inefficient request handling.

<mark style="color:green;">**Optimizing Data Transfers**</mark>

Nsight Systems can help identify inefficiencies in data transfers between the CPU and GPU.&#x20;

By analysing the timeline view, users can pinpoint slow or unnecessary data transfers and optimise them accordingly. This may involve techniques like batching requests, using pinned memory, or overlapping data transfers with computation.

#### <mark style="color:green;">**Correlating Performance Issues**</mark>

Nsight Systems allows users to correlate performance issues across different components of the system. For example, if there are delays in the CPU processing of requests, it can impact the overall performance of the inference server. By analyzing the timeline view and correlating events across the CPU and GPU, users can identify the root cause of performance issues and address them accordingly.

<mark style="color:green;">**Customising Tracing**</mark>

Nsight Systems provides flexibility in customizing the tracing options to focus on specific areas of interest.&#x20;

When using TensorRT-LLM and Triton Inference Server, users can enable tracing for relevant APIs, such as CUDA and Triton's API, while disabling unnecessary tracing options. This helps minimise the overhead of tracing and ensures that the collected performance data is relevant to the specific use case.

### <mark style="color:blue;">Conclusion</mark>

NVIDIA Nsight Systems is a powerful tool for performance analysis and optimisation of GPU-accelerated applications.&#x20;

Its system-wide profiling capabilities, visual timeline view, and extensive API support make it an invaluable asset for developers working with TensorRT-LLM and Triton Inference Server.&#x20;

By applying Nsight Systems to profile and analyze the performance of LLM inference, users can identify bottlenecks, optimise data transfers, and fine-tune the configuration of the inference server.&#x20;

With its intuitive interface and rich feature set, Nsight Systems empowers developers to unlock the full potential of their GPU-accelerated applications and deliver high-performance LLM inference services.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tensorrt-llm.continuumlabs.ai/nvidia-nsight-systems-a-comprehensive-guide-for-tensorrt-llm-and-triton-inference-server.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
