LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Conversion APIs
  • Quantization APIs
  • Build APIs
  • CLI Tools

Was this helpful?

Tasks

Conversion APIs

  • Study the TopModelMixin class and its from_hugging_face() method to understand how the conversion interface is defined.

  • Investigate the implementation of the from_hugging_face() method in the LLaMAForCausalLM class to see how the weights are converted from Hugging Face checkpoints to the TensorRT-LLM expected format.

  • Explore other conversion methods like from_meta_ckpt() in the LLaMAForCausalLM class to learn how different checkpoint formats are handled.

  • Look into the convert_checkpoint.py script to see how the conversion process is simplified using the conversion APIs.

Quantization APIs

  • Study the PretrainedModel class and its quantize() method to understand the default implementation for AMMO-supported quantization.

  • Investigate the LLaMAForCausalLM class and its overridden quantize() method to see how model-specific quantization is handled.

  • Explore the QuantConfig class to learn about the different quantization configurations available.

  • Look into the usage of the quantize() API in an MPI program to understand how quantization is performed in a distributed setting.

Build APIs

  • Study the tensorrt_llm.build API to understand how TensorRT-LLM models are built into TensorRT-LLM engines.

  • Investigate the BuildConfig class to learn about the different build configurations available.

  • Explore the from_checkpoint() method in the PretrainedModel class to see how checkpoints are deserialized into model objects.

CLI Tools

  • Investigate the model-specific convert_checkpoint.py scripts in the examples/<model xxx>/ folders to understand how to convert checkpoints using the command line.

  • Explore the examples/quantization/quantize.py script to learn how to perform quantization using the CLI tool.

  • Study the trtllm-build CLI tool to understand how to build TensorRT-LLM engines from checkpoints using the command line.

To dive deeper into each module and command, you can:

  1. Read the source code of the relevant classes and methods to understand their implementation details.

  2. Explore the documentation and comments within the code to gain insights into the purpose and usage of each module and API.

  3. Experiment with the CLI tools and scripts by running them with different arguments and configurations to see how they behave.

  4. Consult the TensorRT-LLM documentation and tutorials for more detailed explanations and examples of each module and command.

  5. Engage with the TensorRT-LLM community, such as forums or chat channels, to ask questions and learn from experienced users and developers.

By investigating these modules and commands, you can gain a comprehensive understanding of the TensorRT-LLM build workflow and how to effectively utilize the conversion, quantization, and build APIs to optimize and deploy models using TensorRT-LLM.

CopyRetry

PreviousCompiling LLama ModelsNextLLama Model Directory

Last updated 1 year ago

Was this helpful?