examples/utils.py

The utils.py script contains utility functions and constants used by the run.py script for running inference with pre-trained language models using the TensorRT-LLM framework.

Let's analyse the script in detail:

Constants

The script defines two constants: DEFAULT_HF_MODEL_DIRS and DEFAULT_PROMPT_TEMPLATES.
DEFAULT_HF_MODEL_DIRS is a dictionary that maps model architectures to their default Hugging Face model directories.
It includes mappings for various models such as BaichuanForCausalLM, BloomForCausalLM, ChatGLMForCausalLM, FalconForCausalLM, GPTForCausalLM, GPTJForCausalLM, GPTNeoXForCausalLM, InternLMForCausalLM, LlamaForCausalLM, MPTForCausalLM, PhiForCausalLM, OPTForCausalLM, and QWenForCausalLM.
DEFAULT_PROMPT_TEMPLATES is a dictionary that defines default prompt templates for specific model architectures. It includes templates for InternLMForCausalLM and QWenForCausalLM.

read_model_name function

This function reads the model name and version from the config.json file located in the specified engine_dir.
It first retrieves the engine version using the get_engine_version function from the tensorrt_llm.builder module.
It then reads the config.json file and extracts the model architecture and version from it.
If the engine version is not available, it returns the model name from the builder_config section of the configuration.
If the model architecture is 'ChatGLMForCausalLM', it also retrieves the chatglm_version from the configuration.
It returns the model architecture and version (if applicable) as a tuple.

`throttle_generator` function

This function is used to throttle the output of a generator by yielding output at a specified interval.
It takes a generator and a stream_interval as input.
It iterates over the generator and yields the output every stream_interval iterations.
If there are remaining outputs after the last interval, it yields the last output as well.

`load_tokenizer` function

This function loads the tokenizer based on the provided tokenizer directory, vocabulary file, model name, model version, and tokenizer type.
If no vocabulary file is provided, it uses the AutoTokenizer from the Hugging Face Transformers library to load the tokenizer from the specified directory.
It sets the padding_side and truncation_side to 'left' and handles legacy and fast tokenizer options based on the tokenizer type.
If a vocabulary file is provided and the model name is 'GemmaForCausalLM', it uses the GemmaTokenizer from the Transformers library to load the tokenizer.
If a vocabulary file is provided but the model name is not 'GemmaForCausalLM', it uses the T5Tokenizer from the Transformers library to load the tokenizer.
It handles special cases for the 'QWenForCausalLM' and 'ChatGLMForCausalLM' models, where it reads additional configuration files to determine the padding and end token IDs.
For other models, it sets the padding token ID to the EOS token ID if it is not already set, and retrieves the padding and end token IDs from the tokenizer.
It returns the loaded tokenizer, padding token ID, and end token ID as a tuple.

Overall, the utils.py script provides utility functions and constants to support the run.py script in loading model configurations, throttling generator output, and loading tokenizers for different pre-trained language models.

These functions abstract away the details of handling different model architectures, tokenizers, and configurations, making it easier to work with various models in a consistent manner.

Previousexamples/run.py Nextexamples/summarize.py

Last updated 1 month ago

examples/utils.py

The utils.py script contains utility functions and constants used by the run.py script for running inference with pre-trained language models using the TensorRT-LLM framework.

Let's analyse the script in detail:

Constants

The script defines two constants: DEFAULT_HF_MODEL_DIRS and DEFAULT_PROMPT_TEMPLATES.
DEFAULT_HF_MODEL_DIRS is a dictionary that maps model architectures to their default Hugging Face model directories.
It includes mappings for various models such as BaichuanForCausalLM, BloomForCausalLM, ChatGLMForCausalLM, FalconForCausalLM, GPTForCausalLM, GPTJForCausalLM, GPTNeoXForCausalLM, InternLMForCausalLM, LlamaForCausalLM, MPTForCausalLM, PhiForCausalLM, OPTForCausalLM, and QWenForCausalLM.
DEFAULT_PROMPT_TEMPLATES is a dictionary that defines default prompt templates for specific model architectures. It includes templates for InternLMForCausalLM and QWenForCausalLM.

read_model_name function

This function reads the model name and version from the config.json file located in the specified engine_dir.
It first retrieves the engine version using the get_engine_version function from the tensorrt_llm.builder module.
It then reads the config.json file and extracts the model architecture and version from it.
If the engine version is not available, it returns the model name from the builder_config section of the configuration.
If the model architecture is 'ChatGLMForCausalLM', it also retrieves the chatglm_version from the configuration.
It returns the model architecture and version (if applicable) as a tuple.

`throttle_generator` function

This function is used to throttle the output of a generator by yielding output at a specified interval.
It takes a generator and a stream_interval as input.
It iterates over the generator and yields the output every stream_interval iterations.
If there are remaining outputs after the last interval, it yields the last output as well.

`load_tokenizer` function

This function loads the tokenizer based on the provided tokenizer directory, vocabulary file, model name, model version, and tokenizer type.
If no vocabulary file is provided, it uses the AutoTokenizer from the Hugging Face Transformers library to load the tokenizer from the specified directory.
It sets the padding_side and truncation_side to 'left' and handles legacy and fast tokenizer options based on the tokenizer type.
If a vocabulary file is provided and the model name is 'GemmaForCausalLM', it uses the GemmaTokenizer from the Transformers library to load the tokenizer.
If a vocabulary file is provided but the model name is not 'GemmaForCausalLM', it uses the T5Tokenizer from the Transformers library to load the tokenizer.
It handles special cases for the 'QWenForCausalLM' and 'ChatGLMForCausalLM' models, where it reads additional configuration files to determine the padding and end token IDs.
For other models, it sets the padding token ID to the EOS token ID if it is not already set, and retrieves the padding and end token IDs from the tokenizer.
It returns the loaded tokenizer, padding token ID, and end token ID as a tuple.

These functions abstract away the details of handling different model architectures, tokenizers, and configurations, making it easier to work with various models in a consistent manner.

Previousexamples/run.py Nextexamples/summarize.py

Last updated 1 month ago

Constants

throttle_generator function

load_tokenizer function

Constants

throttle_generator function

load_tokenizer function

`throttle_generator` function

`load_tokenizer` function

`throttle_generator` function

`load_tokenizer` function