examples/run.py

The run.py file is a Python script that serves as a command-line interface for running inference with a pre-trained language model using the TensorRT-LLM framework.

Let's break down the script and analyze its components:

Imports

The script starts by importing necessary modules, including argparse for parsing command-line arguments, csv for handling CSV files, numpy for numerical operations, torch for PyTorch functionality, and tensorrt_llm for the TensorRT-LLM framework.
It also imports utility functions from a separate utils module.

Argument Parsing

The parse_arguments function is defined to parse command-line arguments using the argparse module.
It sets up various arguments that control the behavior of the script, such as max_output_len, max_attention_window_size, log_level, engine_dir, input_text, input_file, output_csv, output_npy, tokenizer_dir, num_beams, temperature, top_k, top_p, and many others.
These arguments allow the user to customize the inference process, such as specifying the maximum output length, attention window size, input text or file, output file formats, tokenizer settings, and generation parameters.

Main Function

The main function is the entry point of the script.
It starts by parsing the command-line arguments using the parse_arguments function.
It sets up logging based on the specified log level.
It initializes the tokenizer using the load_tokenizer function from the utils module, based on the provided tokenizer directory and type.
It prepares the input data by either reading from the specified input file (CSV or Numpy) or using the provided input text.
It tokenizes the input text using the loaded tokenizer and creates a batch of input IDs.
It sets up the model runner based on the specified engine directory and runtime session type (Python or C++).
It prepares additional generation parameters, such as end token ID, pad token ID, stop words list, bad words list, and prompt tuning settings.
It generates the output text using the generate method of the model runner, passing the input batch and various generation parameters.
It decodes the generated output tokens back into text using the tokenizer.
It saves the generated output to the specified output file formats (CSV, Numpy) if provided.
If profiling is enabled, it runs multiple iterations of the generation process to measure the average latency.

Utility Functions

The script relies on utility functions defined in the utils module, such as load_tokenizer, read_model_name, and throttle_generator.
These utility functions handle tasks like loading the tokenizer, reading the model name from a file, and throttling the generation process for streaming output.

TensorRT-LLM Integration

The script utilizes the TensorRT-LLM framework for efficient inference with pre-trained language models.
It imports the necessary modules from tensorrt_llm, such as ModelRunner, ModelRunnerCpp, and profiler.
It uses the ModelRunner or ModelRunnerCpp class to load the pre-built TensorRT engine and run inference.
It leverages the generate method of the model runner to generate output text based on the provided input and generation parameters.

Profiling

The script includes an optional profiling feature that measures the average latency of multiple iterations of the generation process.
If profiling is enabled, it runs a warmup phase followed by the actual profiling phase, using the tensorrt_llm.profiler module to measure the elapsed time.
It reports the average latency of the specified number of iterations.

Overall, the run.py script provides a comprehensive command-line interface for running inference with pre-trained language models using the TensorRT-LLM framework.

It offers flexibility in specifying input data, output formats, tokenizer settings, and generation parameters.

It integrates with the TensorRT-LLM framework to leverage optimized inference with TensorRT engines and supports both Python and C++ runtime sessions. Additionally, it includes profiling capabilities to measure the inference latency.

PreviousTutorial 2 - get inference going Nextexamples/utils.py

Last updated 1 month ago