examples/summarize.py
The summarize.py
file is a Python script that performs evaluation tasks using the TensorRT-LLM framework and compares the results with the Hugging Face (HF) library. Let's analyze the script in detail:
Imports
The script starts by importing necessary modules, including
argparse
for parsing command-line arguments,evaluate
for evaluation metrics,numpy
for numerical operations,torch
for PyTorch functionality, anddatasets
for loading datasets.It also imports utility functions from a separate
utils
module and imports the TensorRT-LLM library and its associated modules.
Main Function
The
main
function is the entry point of the script and takes command-line arguments as input.It initializes the runtime rank using
tensorrt_llm.mpi_rank()
and sets the log level based on thelog_level
argument.It determines whether to run the Hugging Face model test based on the
test_hf
argument and the runtime rank.It reads the model name and version using the
read_model_name
function from theutils
module.If the
hf_model_dir
argument is not provided, it tries to infer it based on the model name using theDEFAULT_HF_MODEL_DIRS
dictionary.It loads the tokenizer using the
load_tokenizer
function from theutils
module and measures the time taken for loading the tokenizer.Based on the
eval_task
argument, it sets up the appropriate dataset, including the dataset name, revision, input key, output key, and split.It loads the dataset using the
load_dataset
function from thedatasets
library.It initializes the evaluation metrics based on the
eval_task
argument, such as Rouge for summarization tasks or exact match and pass@k for code completion tasks.If
test_hf
is enabled, it loads the Hugging Face model based on the model name and device map.If
test_trt_llm
is enabled, it initializes the TensorRT-LLM model runner based on the engine directory and runtime session type (Python or C++).It sets up the output directory for saving the generated summaries if specified.
It iterates over the dataset in batches and performs the following steps for each batch:
Tokenizes the input data using the loaded tokenizer.
Generates summaries using the TensorRT-LLM model runner and measures the time taken.
If
test_hf
is enabled, generates summaries using the Hugging Face model and measures the time taken.Computes evaluation metrics for both TensorRT-LLM and Hugging Face summaries.
Saves the generated summaries to the output directory if specified.
Logs the input, reference, and generated summaries for debugging purposes.
After processing all batches, it computes and logs the final evaluation metrics for both TensorRT-LLM and Hugging Face models.
It checks the accuracy of the TensorRT-LLM model based on the specified thresholds for Rouge-1 score and perplexity (if enabled).
Command-line Arguments
The script uses the
argparse
module to define and parse various command-line arguments, including:hf_model_dir
: The directory of the Hugging Face model.tokenizer_dir
: The directory of the tokenizer (defaults tohf_model_dir
if not specified).vocab_file
: The vocabulary file for the tokenizer.test_hf
: Flag to enable testing with the Hugging Face model.test_trt_llm
: Flag to enable testing with the TensorRT-LLM model.data_type
: The data type to use (e.g., fp16, bf16).engine_dir
: The directory containing the TensorRT engine files.use_py_session
: Flag to use Python runtime session for TensorRT-LLM.eval_task
: The evaluation task to perform (e.g., summarize, code_completion).check_accuracy
: Flag to enable accuracy checking.tensorrt_llm_rouge1_threshold
: The threshold for TensorRT-LLM Rouge-1 score.eval_ppl
: Flag to enable perplexity evaluation.tensorrt_llm_ppl_threshold
: The threshold for TensorRT-LLM perplexity.dataset_path
: The path to the dataset.log_level
: The log level for the script.batch_size
: The batch size for processing.max_ite
: The maximum number of iterations.output_len
: The maximum output length for generated summaries.max_input_length
: The maximum input length for tokenization.max_attention_window_size
: The attention window size for sliding window attention or cyclic KV cache.sink_token_length
: The sink token length.num_beams
: The number of beams for beam search.temperature
: The temperature for sampling.top_k
: The top-k value for sampling.top_p
: The top-p value for sampling.length_penalty
: The length penalty for beam search.repetition_penalty
: The repetition penalty for sampling.presence_penalty
: The presence penalty for sampling.frequency_penalty
: The frequency penalty for sampling.early_stopping
: Flag to enable early stopping for beam search.debug_mode
: Flag to enable debug mode.add_special_tokens
: Flag to add special tokens during tokenization.hf_device_map_auto
: Flag to use 'auto' device map for loading Hugging Face models.output_dir
: The directory to save the generated summaries.medusa_choices
: The Medusa choices for decoding (experimental feature).
Evaluation Metrics
The script uses the
evaluate
library to compute evaluation metrics based on theeval_task
argument.For summarization tasks, it uses the Rouge metric to evaluate the generated summaries against the reference summaries.
For code completion tasks, it uses the exact match and pass@k metrics to evaluate the generated code against the reference code.
Overall, the summarize.py
script provides a comprehensive evaluation framework for comparing the performance of TensorRT-LLM and Hugging Face models on summarization and code completion tasks.
It allows for flexible configuration through command-line arguments and provides detailed logging and accuracy checking. The script demonstrates the integration of TensorRT-LLM with popular NLP libraries and evaluation metrics, making it a valuable tool for benchmarking and analyzing the performance of language models.
Last updated