LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

run.py for inference

The run.py file is a script for running inference with a pre-built TensorRT engine.

It takes various command-line arguments to configure the inference process and generates output based on the provided input.

Let's analyze the key arguments parsed to the script and their significance:

Key arguments:

  1. --max_output_len: The maximum length of the generated output sequence.

  2. --max_attention_window_size: The attention window size that controls the sliding window attention or cyclic KV cache behavior.

  3. --sink_token_length: The length of the sink token.

  4. --log_level: The logging level for the script.

  5. --engine_dir: The directory containing the pre-built TensorRT engine.

  6. --use_py_session: Whether to use the Python runtime session or the C++ session.

  7. --input_text: The input text to be used for generation.

  8. --input_file: An alternative to --input_text, allowing input to be read from a CSV or Numpy file.

  9. --max_input_length: The maximum length of the input sequence.

  10. --output_csv, --output_npy: Files to store the tokenized output in CSV or Numpy format.

  11. --output_logits_npy: File to store the generation logits in Numpy format (only when num_beams==1).

  12. --output_log_probs_npy, --output_cum_log_probs_npy: Files to store the log probabilities and cumulative log probabilities in Numpy format.

  13. --tokenizer_dir, --tokenizer_type, --vocab_file: Configuration for the tokenizer.

  14. --num_beams: The number of beams to use for beam search (use num_beams > 1 for beam search).

  15. --temperature, --top_k, --top_p, --length_penalty, --repetition_penalty, --presence_penalty, --frequency_penalty: Parameters for controlling the generation process.

  16. --early_stopping: Whether to use early stopping during beam search.

  17. --debug_mode: Whether to turn on debug mode.

  18. --streaming, --streaming_interval: Configuration for streaming mode.

  19. --prompt_table_path, --prompt_tasks: Configuration for prompt tuning.

  20. --lora_dir, --lora_task_uids, --lora_ckpt_source: Configuration for LoRA (Low-Rank Adaptation).

  21. --num_prepend_vtokens: The number of default virtual tokens to prepend to each sentence.

  22. --run_profiling: Whether to run profiling iterations to measure inference latencies.

  23. --medusa_choices: Configuration for Medusa decoding.

Relationship to the LLaMA engine:

  • The run.py script is designed to perform inference using a pre-built TensorRT engine, such as the rank0.engine file generated from the LLaMA model.

  • The --engine_dir argument specifies the directory containing the TensorRT engine file, which is loaded by the script for inference.

  • The config.json file contains the configuration details of the LLaMA model used to build the TensorRT engine. It includes information such as the model architecture, data types, vocabulary size, and other hyperparameters.

  • The run.py script uses the configuration from config.json to set up the appropriate tokenizer, input parsing, and output processing based on the LLaMA model's specifications.

  • The script also supports additional features like LoRA, prompt tuning, and Medusa decoding, which can be configured through the respective command-line arguments and are specific to the LLaMA model.

In summary, the run.py script is a generic inference script that can be used with any pre-built TensorRT engine, including the LLaMA engine. It takes the necessary configuration from the config.json file and the rank0.engine file to perform inference and generate output based on the provided input and specified arguments.

PreviousProposed build config file for LLama3NextUsing the models - running Llama

Last updated 1 year ago

Was this helpful?