examples/summarize.py

The summarize.py file is a Python script that performs evaluation tasks using the TensorRT-LLM framework and compares the results with the Hugging Face (HF) library. Let's analyze the script in detail:

Imports

The script starts by importing necessary modules, including argparse for parsing command-line arguments, evaluate for evaluation metrics, numpy for numerical operations, torch for PyTorch functionality, and datasets for loading datasets.
It also imports utility functions from a separate utils module and imports the TensorRT-LLM library and its associated modules.

Main Function

The main function is the entry point of the script and takes command-line arguments as input.
It initializes the runtime rank using tensorrt_llm.mpi_rank() and sets the log level based on the log_level argument.
It determines whether to run the Hugging Face model test based on the test_hf argument and the runtime rank.
It reads the model name and version using the read_model_name function from the utils module.
If the hf_model_dir argument is not provided, it tries to infer it based on the model name using the DEFAULT_HF_MODEL_DIRS dictionary.
It loads the tokenizer using the load_tokenizer function from the utils module and measures the time taken for loading the tokenizer.
Based on the eval_task argument, it sets up the appropriate dataset, including the dataset name, revision, input key, output key, and split.
It loads the dataset using the load_dataset function from the datasets library.
It initializes the evaluation metrics based on the eval_task argument, such as Rouge for summarization tasks or exact match and pass@k for code completion tasks.
If test_hf is enabled, it loads the Hugging Face model based on the model name and device map.
If test_trt_llm is enabled, it initializes the TensorRT-LLM model runner based on the engine directory and runtime session type (Python or C++).
It sets up the output directory for saving the generated summaries if specified.
It iterates over the dataset in batches and performs the following steps for each batch:
- Tokenizes the input data using the loaded tokenizer.
- Generates summaries using the TensorRT-LLM model runner and measures the time taken.
- If test_hf is enabled, generates summaries using the Hugging Face model and measures the time taken.
- Computes evaluation metrics for both TensorRT-LLM and Hugging Face summaries.
- Saves the generated summaries to the output directory if specified.
- Logs the input, reference, and generated summaries for debugging purposes.
After processing all batches, it computes and logs the final evaluation metrics for both TensorRT-LLM and Hugging Face models.
It checks the accuracy of the TensorRT-LLM model based on the specified thresholds for Rouge-1 score and perplexity (if enabled).

Command-line Arguments

The script uses the argparse module to define and parse various command-line arguments, including:
- hf_model_dir: The directory of the Hugging Face model.
- tokenizer_dir: The directory of the tokenizer (defaults to hf_model_dir if not specified).
- vocab_file: The vocabulary file for the tokenizer.
- test_hf: Flag to enable testing with the Hugging Face model.
- test_trt_llm: Flag to enable testing with the TensorRT-LLM model.
- data_type: The data type to use (e.g., fp16, bf16).
- engine_dir: The directory containing the TensorRT engine files.
- use_py_session: Flag to use Python runtime session for TensorRT-LLM.
- eval_task: The evaluation task to perform (e.g., summarize, code_completion).
- check_accuracy: Flag to enable accuracy checking.
- tensorrt_llm_rouge1_threshold: The threshold for TensorRT-LLM Rouge-1 score.
- eval_ppl: Flag to enable perplexity evaluation.
- tensorrt_llm_ppl_threshold: The threshold for TensorRT-LLM perplexity.
- dataset_path: The path to the dataset.
- log_level: The log level for the script.
- batch_size: The batch size for processing.
- max_ite: The maximum number of iterations.
- output_len: The maximum output length for generated summaries.
- max_input_length: The maximum input length for tokenization.
- max_attention_window_size: The attention window size for sliding window attention or cyclic KV cache.
- sink_token_length: The sink token length.
- num_beams: The number of beams for beam search.
- temperature: The temperature for sampling.
- top_k: The top-k value for sampling.
- top_p: The top-p value for sampling.
- length_penalty: The length penalty for beam search.
- repetition_penalty: The repetition penalty for sampling.
- presence_penalty: The presence penalty for sampling.
- frequency_penalty: The frequency penalty for sampling.
- early_stopping: Flag to enable early stopping for beam search.
- debug_mode: Flag to enable debug mode.
- add_special_tokens: Flag to add special tokens during tokenization.
- hf_device_map_auto: Flag to use 'auto' device map for loading Hugging Face models.
- output_dir: The directory to save the generated summaries.
- medusa_choices: The Medusa choices for decoding (experimental feature).

Evaluation Metrics

The script uses the evaluate library to compute evaluation metrics based on the eval_task argument.
For summarization tasks, it uses the Rouge metric to evaluate the generated summaries against the reference summaries.
For code completion tasks, it uses the exact match and pass@k metrics to evaluate the generated code against the reference code.

Overall, the summarize.py script provides a comprehensive evaluation framework for comparing the performance of TensorRT-LLM and Hugging Face models on summarization and code completion tasks.

It allows for flexible configuration through command-line arguments and provides detailed logging and accuracy checking. The script demonstrates the integration of TensorRT-LLM with popular NLP libraries and evaluation metrics, making it a valuable tool for benchmarking and analyzing the performance of language models.

Previousexamples/utils.py NextThe Python API

Last updated 1 month ago