Analysis of the output from build.py
Based on the output of the trtllm build process, it appears that the build was successful, and the TensorRT engine was generated without any major errors.
However, there are a few warnings and observations worth noting:
Data type mismatch warnings: The output contains multiple warnings about data type mismatches between input tensors of certain layers. For example:
These warnings suggest that some layers have input tensors with different data types (Half and Float). While the build process still completed, it's worth investigating if these mismatches have any impact on the model's accuracy or performance.
Layernorm precision warnings: The output includes warnings about running layernorm in FP16 precision after self-attention, which may cause overflow:
To mitigate this issue, consider exporting the model to a newer ONNX opset version or forcing the layernorm layers to run in FP32 precision, as suggested in the warning message.
Unused input tensor: The output mentions that the input tensor "position_ids" is unused or used only at compile-time and is not being removed:
This warning indicates that the "position_ids" tensor is not being utilized during inference.
While it doesn't affect the build process, it's worth reviewing the model architecture to ensure that all intended inputs are being used correctly.
Memory usage: The output provides information about the memory usage during the build process. It shows the peak memory usage for CPU and GPU allocators, as well as the total activation memory, weights memory, and scratch memory.
These memory usage statistics can be helpful for understanding the resource requirements of the model and ensuring that the system has sufficient memory to accommodate the build process and inference.
Build time: The output indicates that the total time for building the TensorRT engine was approximately 5 minutes and 22 seconds:
This build time may vary depending on the system specifications and the complexity of the model.
Overall, the build process completed successfully, and the TensorRT engine was generated.
However, it's recommended to review the warnings related to data type mismatches and layernorm precision to ensure optimal model performance and accuracy. Additionally, monitoring the memory usage and build time can help in understanding the resource requirements and potential optimizations for the build process.
The config.json output
The provided config.json
file contains the configuration details for the TensorRT-LLM build process.
Let's analyze the different sections and their significance:
version
: Specifies the version of TensorRT-LLM used for the build process.pretrained_config
: Contains the configuration details for the pretrained model.architecture
: Indicates the model architecture, which is "LlamaForCausalLM" in this case.dtype
andlogits_dtype
: Specify the data type used for the model and logits, which is "float16" here.vocab_size
,max_position_embeddings
,hidden_size
,num_hidden_layers
,num_attention_heads
,num_key_value_heads
,head_size
,hidden_act
,intermediate_size
,norm_epsilon
: These parameters define the model architecture and dimensions.position_embedding_type
,use_parallel_embedding
,embedding_sharding_dim
,share_embedding_table
: Settings related to position embeddings and embedding sharding.mapping
: Specifies the parallelism configuration, including the world size, tensor parallelism (tp) size, and pipeline parallelism (pp) size.quantization
: Contains quantization-related settings, such as the quantization algorithm, group size, and excluded modules.Other parameters like
kv_dtype
,rotary_scaling
,moe_normalization_mode
,rotary_base
,moe_num_experts
,moe_top_k
,moe_tp_mode
,attn_bias
,disable_weight_only_quant_plugin
, andmlp_bias
are specific to the model architecture and optimization settings.
build_config
: Contains the build configuration options.max_input_len
,max_output_len
,max_batch_size
,max_beam_width
,max_num_tokens
,opt_num_tokens
: These parameters define the maximum input and output lengths, batch size, beam width, and token-related settings.max_prompt_embedding_table_size
,gather_context_logits
,gather_generation_logits
,strongly_typed
,builder_opt
,profiling_verbosity
,enable_debug_output
,max_draft_len
,use_refit
: Additional build configuration options related to prompts, logits gathering, typing, profiling, debugging, and refitting.input_timing_cache
andoutput_timing_cache
: Specify the paths for input and output timing caches.lora_config
: Contains settings for Low-Rank Adaptation (LoRA), including the LoRA directory, checkpoint source, maximum rank, target modules, and module mappings.auto_parallel_config
: Specifies the automatic parallelization configuration, including world size, GPUs per node, cluster key, sharding and communication cost models, parallelism settings, and various optimization options.weight_sparsity
,max_encoder_input_len
,use_fused_mlp
: Additional build configuration options related to weight sparsity, encoder input length, and fused MLP usage.plugin_config
: Contains plugin-specific settings for attention, GEMM, quantization, communication, and other optimization plugins.
The config.json
file serves as a comprehensive record of the build configuration used for the TensorRT-LLM model. It captures the model architecture, parallelism settings, quantization options, build parameters, and various optimization settings.
This configuration file is valuable for reproducibility and understanding the specific settings used during the build process. It can be used to rebuild the model with the same configuration or to compare different configurations and their impact on model performance and resource utilization.
The engine file
The file /app/tensorrt_llm/examples/llama/llama-2-7b-chat-engine/rank0.engine
is the serialized TensorRT engine that was built during the TensorRT-LLM build process.
It represents the optimized and compiled version of the LLaMA-2.7B model for inference.
Here are some key points about the TensorRT engine file:
Serialization: The TensorRT engine is serialized to disk after the build process is complete. This allows the engine to be loaded and used for inference without rebuilding it every time.
Optimization: The TensorRT engine incorporates various optimizations, such as layer fusion, kernel auto-tuning, precision calibration, and memory optimizations, to improve the performance and efficiency of the model during inference.
Size: The size of the serialized engine file (12.6 GB in this case) depends on the model architecture, size, and the build configuration used. Larger models and higher precision settings generally result in larger engine files.
Rank: The
rank0
in the file name indicates that this engine is built for a specific rank in a multi-GPU setup. In this case, it suggests that the engine is built for the first GPU (rank 0) in a distributed or parallel inference scenario.Inference: To perform inference using the LLaMA-2.7B model, you would load this serialized engine file into TensorRT and execute the inference requests using the TensorRT runtime API. The engine encapsulates the optimized computational graph and provides efficient execution of the model.
It's important to note that the serialized engine file is specific to the hardware and software environment in which it was built.
It may not be directly portable to different systems or TensorRT versions without rebuilding the engine.
The presence of the rank0.engine
file indicates that the TensorRT-LLM build process completed successfully, and the optimized engine is ready for inference.
You can use this engine file to deploy the LLaMA-2.7B model in your application or inference pipeline, leveraging the performance benefits provided by TensorRT.
Last updated