examples/utils.py
The utils.py
script contains utility functions and constants used by the run.py
script for running inference with pre-trained language models using the TensorRT-LLM framework.
Let's analyse the script in detail:
Constants
The script defines two constants:
DEFAULT_HF_MODEL_DIRS
andDEFAULT_PROMPT_TEMPLATES
.DEFAULT_HF_MODEL_DIRS
is a dictionary that maps model architectures to their default Hugging Face model directories.It includes mappings for various models such as BaichuanForCausalLM, BloomForCausalLM, ChatGLMForCausalLM, FalconForCausalLM, GPTForCausalLM, GPTJForCausalLM, GPTNeoXForCausalLM, InternLMForCausalLM, LlamaForCausalLM, MPTForCausalLM, PhiForCausalLM, OPTForCausalLM, and QWenForCausalLM.
DEFAULT_PROMPT_TEMPLATES
is a dictionary that defines default prompt templates for specific model architectures. It includes templates for InternLMForCausalLM and QWenForCausalLM.
read_model_name
function
This function reads the model name and version from the
config.json
file located in the specifiedengine_dir
.It first retrieves the engine version using the
get_engine_version
function from thetensorrt_llm.builder
module.It then reads the
config.json
file and extracts the model architecture and version from it.If the engine version is not available, it returns the model name from the
builder_config
section of the configuration.If the model architecture is 'ChatGLMForCausalLM', it also retrieves the
chatglm_version
from the configuration.It returns the model architecture and version (if applicable) as a tuple.
throttle_generator
function
throttle_generator
functionThis function is used to throttle the output of a generator by yielding output at a specified interval.
It takes a
generator
and as
tream_interval
as input.It iterates over the generator and yields the output every
stream_interval
iterations.If there are remaining outputs after the last interval, it yields the last output as well.
load_tokenizer
function
load_tokenizer
functionThis function loads the tokenizer based on the provided tokenizer directory, vocabulary file, model name, model version, and tokenizer type.
If no vocabulary file is provided, it uses the
AutoTokenizer
from the Hugging Face Transformers library to load the tokenizer from the specified directory.It sets the
padding_side
andtruncation_side
to 'left' and handles legacy and fast tokenizer options based on the tokenizer type.If a vocabulary file is provided and the model name is 'GemmaForCausalLM', it uses the
GemmaTokenizer
from the Transformers library to load the tokenizer.If a vocabulary file is provided but the model name is not 'GemmaForCausalLM', it uses the
T5Tokenizer
from the Transformers library to load the tokenizer.It handles special cases for the 'QWenForCausalLM' and 'ChatGLMForCausalLM' models, where it reads additional configuration files to determine the padding and end token IDs.
For other models, it sets the padding token ID to the EOS token ID if it is not already set, and retrieves the padding and end token IDs from the tokenizer.
It returns the loaded tokenizer, padding token ID, and end token ID as a tuple.
Overall, the utils.py
script provides utility functions and constants to support the run.py
script in loading model configurations, throttling generator output, and loading tokenizers for different pre-trained language models.
These functions abstract away the details of handling different model architectures, tokenizers, and configurations, making it easier to work with various models in a consistent manner.
Last updated