examples/run.py
The run.py
file is a Python script that serves as a command-line interface for running inference with a pre-trained language model using the TensorRT-LLM framework.
Let's break down the script and analyze its components:
Imports
The script starts by importing necessary modules, including
argparse
for parsing command-line arguments,csv
for handling CSV files,numpy
for numerical operations,torch
for PyTorch functionality, andtensorrt_llm
for the TensorRT-LLM framework.It also imports utility functions from a separate
utils
module.
Argument Parsing
The
parse_arguments
function is defined to parse command-line arguments using theargparse
module.It sets up various arguments that control the behavior of the script, such as
max_output_len
,max_attention_window_size
,log_level
,engine_dir
,input_text
,input_file
,output_csv
,output_npy
,tokenizer_dir
,num_beams
,temperature
,top_k
,top_p
, and many others.These arguments allow the user to customize the inference process, such as specifying the maximum output length, attention window size, input text or file, output file formats, tokenizer settings, and generation parameters.
Main Function
The
main
function is the entry point of the script.It starts by parsing the command-line arguments using the
parse_arguments
function.It sets up logging based on the specified log level.
It initializes the tokenizer using the
load_tokenizer
function from theutils
module, based on the provided tokenizer directory and type.It prepares the input data by either reading from the specified input file (CSV or Numpy) or using the provided input text.
It tokenizes the input text using the loaded tokenizer and creates a batch of input IDs.
It sets up the model runner based on the specified engine directory and runtime session type (Python or C++).
It prepares additional generation parameters, such as end token ID, pad token ID, stop words list, bad words list, and prompt tuning settings.
It generates the output text using the
generate
method of the model runner, passing the input batch and various generation parameters.It decodes the generated output tokens back into text using the tokenizer.
It saves the generated output to the specified output file formats (CSV, Numpy) if provided.
If profiling is enabled, it runs multiple iterations of the generation process to measure the average latency.
Utility Functions
The script relies on utility functions defined in the
utils
module, such asload_tokenizer
,read_model_name
, andthrottle_generator
.These utility functions handle tasks like loading the tokenizer, reading the model name from a file, and throttling the generation process for streaming output.
TensorRT-LLM Integration
The script utilizes the TensorRT-LLM framework for efficient inference with pre-trained language models.
It imports the necessary modules from
tensorrt_llm
, such asModelRunner
,ModelRunnerCpp
, andprofiler
.It uses the
ModelRunner
orModelRunnerCpp
class to load the pre-built TensorRT engine and run inference.It leverages the
generate
method of the model runner to generate output text based on the provided input and generation parameters.
Profiling
The script includes an optional profiling feature that measures the average latency of multiple iterations of the generation process.
If profiling is enabled, it runs a warmup phase followed by the actual profiling phase, using the
tensorrt_llm.profiler
module to measure the elapsed time.It reports the average latency of the specified number of iterations.
Overall, the run.py
script provides a comprehensive command-line interface for running inference with pre-trained language models using the TensorRT-LLM framework.
It offers flexibility in specifying input data, output formats, tokenizer settings, and generation parameters.
It integrates with the TensorRT-LLM framework to leverage optimized inference with TensorRT engines and supports both Python and C++ runtime sessions. Additionally, it includes profiling capabilities to measure the inference latency.
Last updated