LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

Build arguments

In this script:

Configuration
Explanation
Suggested Values

max_input_len

The maximum length of the input sequence.

512 - 1024 (depending on the model architecture and available GPU memory)

max_output_len

The maximum length of the output sequence.

256 - 512 (depending on the desired output length and available GPU memory)

max_batch_size

The maximum batch size for the engine.

1 - 32 (depending on the available GPU memory and desired throughput)

max_beam_width

The maximum beam width for beam search during generation.

1 - 8 (higher values can improve output quality but increase computational cost)

max_num_tokens

The maximum number of tokens to generate.

100 - 500 (depending on the desired output length)

opt_num_tokens

The optimal number of tokens to generate.

50 - 200 (depending on the desired output length and trade-off between quality and efficiency)

max_prompt_embedding_table_size

The maximum size of the prompt embedding table (for prompt tuning).

0 - 10000 (depending on the number of prompt templates used for prompt tuning)

gather_context_logits

Whether to gather context logits during generation.

False (set to True for debugging or analysis purposes)

gather_generation_logits

Whether to gather generation logits during generation.

False (set to True for debugging or analysis purposes)

strongly_typed

Whether to use strongly typed TensorRT networks.

True (enables additional optimizations and error checking)

builder_opt

The optimization level for the TensorRT builder.

3 (default value, higher values may result in longer build times but potentially better performance)

profiling_verbosity

The verbosity level for TensorRT profiling.

"layer_names_only" (provides a good balance between profiling information and readability)

enable_debug_output

Whether to enable debug output for the TensorRT network.

False (set to True for debugging purposes)

max_draft_len

The maximum length of the draft sequence (for Medusa models).

0 - 200 (depending on the desired draft length for Medusa models)

use_refit

Whether to use the refit feature for multi-GPU building.

False (set to True for multi-GPU builds to reduce build time)

input_timing_cache

The path to the input timing cache file.

"timing_cache.bin" (provide a path to a previously generated timing cache file to speed up the build process)

output_timing_cache

The path to the output timing cache file.

"output_cache.bin" (provide a path to store the generated timing cache for future builds)

lora_config

A LoraBuildConfig object specifying LoRA configuration for the model.

Depends on the specific LoRA adaptation requirements and available pre-trained LoRA weights

auto_parallel_config

An AutoParallelConfig object specifying auto-parallel configuration for the model.

Depends on the number of available GPUs and the desired trade-off between build time and inference performance

weight_sparsity

Whether to enable weight sparsity for the engine.

False (set to True if the model weights are sparse and you want to optimize for storage and computation)

plugin_config

A PluginConfig object specifying plugin configuration for the engine.

Depends on the specific requirements and available custom TensorRT plugins

max_encoder_input_len

The maximum length of the encoder input sequence (for encoder-decoder models).

512 - 1024 (depending on the encoder architecture and available GPU memory)

use_fused_mlp

Whether to use fused MLP layers for optimization.

True (can improve performance by reducing memory accesses and kernel launches)

dry_run

Whether to perform a dry run without actually building the engine.

False (set to True for testing purposes or to validate the build configuration without spending time on the actual build)

visualize_network

Whether to visualize the TensorRT network as a DOT graph.

False (set to True for debugging or understanding the network structure)

PreviousTensorRT-LLM Build Process DocumentationNexttrtllm build configuration file

Last updated 1 year ago

Was this helpful?