LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

summarize.py script in Llama folder

The summarize_long.py file is a Python script that demonstrates how to use the TensorRT-LLM framework to perform text summarization using the LLaMA model.

It compares the performance of the TensorRT-LLM implementation with the Hugging Face (HF) implementation of the LLaMA model.

To execute the summarize_long.py script and use your serialized LLaMA model, follow these steps:

  1. Ensure that you have the necessary dependencies installed in your TensorRT-LLM Docker container. The script requires the datasets and transformers libraries, which should be listed in the requirements.txt file.

  2. Make sure you have the serialized LLaMA model checkpoint files in the appropriate directory. In your case, it seems like the serialized model is located in the tllm_checkpoint_1gpu_bf16 directory.

  3. Open the summarize_long.py script and locate the parse_args() function. This function defines the command-line arguments that the script accepts. You may need to modify some of the default values to match your setup. For example:

    • Set the hf_model_location argument to the path of your Hugging Face LLaMA model checkpoint, if you want to compare with the HF implementation.

    • Set the dataset_path argument to the path where you want to cache the dataset used for testing.

    • Set the engine_dir argument to the directory containing your serialized TensorRT-LLM model checkpoint, which in your case is tllm_checkpoint_1gpu_bf16.

    • Adjust other arguments such as max_attention_window_size, max_input_len, batch_size, num_beams, output_len, etc., according to your requirements.

  4. Run the summarize_long.py script using the following command:

python summarize_long.py --test_trt_llm --engine_dir ./tllm_checkpoint_1gpu_bf16

This command will execute the script and use the TensorRT-LLM implementation with your serialized LLaMA model checkpoint located in the tllm_checkpoint_1gpu_bf16 directory.

You can also add additional arguments to control the behavior of the script. For example, you can add --test_hf to compare with the Hugging Face implementation, or --check_accuracy to check the accuracy of the TensorRT-LLM implementation against a specified threshold.

Now, let's analyze the summarize_long.py script in more detail

  • The script starts by importing the necessary libraries and modules, including TensorRT-LLM-specific modules and profiling tools.

  • The parse_args() function defines the command-line arguments that the script accepts, allowing you to configure various aspects of the summarization task, such as model locations, dataset paths, hyperparameters, etc.

  • The TRTLLaMA() function sets up the TensorRT-LLM model based on the provided configuration files and loads the serialized model checkpoint.

  • The get_long_texts() function retrieves long text samples from the OpenWebText-10k dataset that fall within a specified token length range.

  • The prepare_prompt() function prepares the input prompt for summarization by cleaning and formatting the text.

  • The summarize_hf() function performs text summarization using the Hugging Face implementation of the LLaMA model.

  • The summarize_tensorrt_llm() function performs text summarization using the TensorRT-LLM implementation of the LLaMA model.

  • The main() function is the entry point of the script. It loads the tokenizer, retrieves the test data, and performs summarization using either the TensorRT-LLM implementation, the Hugging Face implementation, or both, depending on the provided command-line arguments.

  • If both TensorRT-LLM and Hugging Face implementations are tested, the script compares the generated summaries using the ROUGE metric and logs the results.

The script provides a comprehensive example of how to use the TensorRT-LLM framework for text summarization tasks, demonstrating the integration with popular NLP libraries and datasets, as well as the comparison with the Hugging Face implementation.

Keep in mind that the script assumes the availability of certain files and directories, such as the serialized model checkpoint, configuration files, and dataset cache. Make sure to set up your environment accordingly before running the script.

Previousgenerate_int8 functionNextCompiling LLama Models

Last updated 1 year ago

Was this helpful?