LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Introduction
  • Technical Details
  • Features
  • Applying Nsight Systems with TensorRT-LLM and Triton Inference Server
  • Conclusion

Was this helpful?

NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server

Introduction

NVIDIA Nsight Systems is a powerful system-wide performance analysis tool designed to help developers optimise and debug their GPU-accelerated applications.

It provides a unified timeline view of the entire system, including CPU and GPU activity, enabling users to identify performance bottlenecks and optimization opportunities.

In this guide, we will explore the technical details and features of Nsight Systems and discuss how to apply it in the context of using TensorRT-LLM and Triton Inference Server to serve large language models (LLMs).

Technical Details

Nsight Systems is a profiling and tracing tool that collects and visualises performance data from various sources, including CPUs, GPUs, and system resources.

It supports a wide range of computing and graphics APIs, such as CUDA, Vulkan, OpenGL, and DirectX. Nsight Systems uses low-overhead sampling techniques to gather performance metrics and events without significantly impacting the application's runtime behavior.

The core components of Nsight Systems include:

  1. Profiling Target: The system or device on which the application is being profiled. Nsight Systems supports local and remote profiling targets, including Linux and Windows workstations, servers, and embedded devices like NVIDIA Jetson and Drive platforms.

  2. Tracing: Nsight Systems traces the execution of the target application and collects performance data, such as API calls, memory transfers, kernel launches, and system events. The tracing options can be customized to focus on specific APIs, metrics, or time ranges.

  3. Timeline View: The main visualisation interface of Nsight Systems is the timeline view, which displays the collected performance data in a chronological order. The timeline is divided into different rows representing CPU threads, GPU streams, and other system resources. Each event or metric is represented as a color-coded block on the timeline, allowing users to identify patterns, correlations, and potential bottlenecks.

  4. Metrics and Events: Nsight Systems captures a wide range of performance metrics and events, including CPU and GPU utilisation, memory usage, API calls, kernel execution times, and data transfers. These metrics and events provide valuable insights into the application's behavior and help identify areas for optimisation.

Features

Nsight Systems offers several key features that make it a powerful tool for performance analysis and optimization:

  1. System-Wide Profiling: Nsight Systems provides a holistic view of the entire system, including CPUs, GPUs, and memory. It captures the interactions between different components and allows users to correlate performance data across the system.

  2. API Support: Nsight Systems supports a wide range of computing and graphics APIs, making it suitable for various GPU-accelerated applications. It can trace CUDA, Vulkan, OpenGL, DirectX, and other APIs, providing insights into the performance characteristics of each API.

  3. Customizable Tracing: Users can customise the tracing options to focus on specific APIs, metrics, or time ranges. This flexibility allows developers to target specific areas of interest and minimise the overhead of tracing.

  4. Visual Timeline: The timeline view in Nsight Systems provides an intuitive and visually appealing representation of the performance data. It allows users to zoom in and out, navigate through the timeline, and inspect individual events and metrics in detail.

  5. Correlation and Analysis: Nsight Systems enables users to correlate performance data across different components and identify relationships between events. It provides tools for analysis, such as filtering, searching, and aggregating data, to help users pinpoint performance bottlenecks and optimization opportunities.

  6. Remote Profiling: Nsight Systems supports remote profiling, allowing users to profile applications running on remote systems or embedded devices. This feature is particularly useful when working with Triton Inference Server, as it enables profiling the server's performance on remote machines.

Applying Nsight Systems with TensorRT-LLM and Triton Inference Server

When using TensorRT-LLM and Triton Inference Server to serve large language models, Nsight Systems can be a valuable tool for optimizing performance and identifying bottlenecks.

Here are some key areas where Nsight Systems can be applied:

Profiling TensorRT Engines

Nsight Systems can be used to profile the execution of TensorRT engines, including the rank0.engine file generated from the LLaMA model.

By tracing the engine's execution, users can analyze the performance of individual layers, kernels, and memory transfers. This information can help identify opportunities for optimisation, such as adjusting batch sizes, using different precisions, or optimizing kernel configurations.

Analysing Triton Inference Server

Nsight Systems can be used to profile the performance of Triton Inference Server when serving LLMs.

By tracing the server's execution, users can monitor the CPU and GPU utilisation, request handling, and data transfers. This analysis can help identify bottlenecks in the server's configuration, such as insufficient GPU resources, suboptimal model placement, or inefficient request handling.

Optimizing Data Transfers

Nsight Systems can help identify inefficiencies in data transfers between the CPU and GPU.

By analysing the timeline view, users can pinpoint slow or unnecessary data transfers and optimise them accordingly. This may involve techniques like batching requests, using pinned memory, or overlapping data transfers with computation.

Correlating Performance Issues

Nsight Systems allows users to correlate performance issues across different components of the system. For example, if there are delays in the CPU processing of requests, it can impact the overall performance of the inference server. By analyzing the timeline view and correlating events across the CPU and GPU, users can identify the root cause of performance issues and address them accordingly.

Customising Tracing

Nsight Systems provides flexibility in customizing the tracing options to focus on specific areas of interest.

When using TensorRT-LLM and Triton Inference Server, users can enable tracing for relevant APIs, such as CUDA and Triton's API, while disabling unnecessary tracing options. This helps minimise the overhead of tracing and ensures that the collected performance data is relevant to the specific use case.

Conclusion

NVIDIA Nsight Systems is a powerful tool for performance analysis and optimisation of GPU-accelerated applications.

Its system-wide profiling capabilities, visual timeline view, and extensive API support make it an invaluable asset for developers working with TensorRT-LLM and Triton Inference Server.

By applying Nsight Systems to profile and analyze the performance of LLM inference, users can identify bottlenecks, optimise data transfers, and fine-tune the configuration of the inference server.

With its intuitive interface and rich feature set, Nsight Systems empowers developers to unlock the full potential of their GPU-accelerated applications and deliver high-performance LLM inference services.

PreviousMessage Passing Interface (MPI)NextNCCL

Last updated 1 year ago

Was this helpful?

Page cover image