Analysis of the output from build.py

Based on the output of the trtllm build process, it appears that the build was successful, and the TensorRT engine was generated without any major errors.

However, there are a few warnings and observations worth noting:

Data type mismatch warnings: The output contains multiple warnings about data type mismatches between input tensors of certain layers. For example:

[05/06/2024-13:31:32] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.

These warnings suggest that some layers have input tensors with different data types (Half and Float). While the build process still completed, it's worth investigating if these mismatches have any impact on the model's accuracy or performance.

Layernorm precision warnings: The output includes warnings about running layernorm in FP16 precision after self-attention, which may cause overflow:

Copy code[05/06/2024-13:31:32] [TRT] [W] Detected layernorm nodes in FP16.
[05/06/2024-13:31:32] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.

To mitigate this issue, consider exporting the model to a newer ONNX opset version or forcing the layernorm layers to run in FP32 precision, as suggested in the warning message.

Unused input tensor: The output mentions that the input tensor "position_ids" is unused or used only at compile-time and is not being removed:

Copy code[05/06/2024-13:31:32] [TRT] [W] Unused Input: position_ids
[05/06/2024-13:31:33] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.

This warning indicates that the "position_ids" tensor is not being utilized during inference.

While it doesn't affect the build process, it's worth reviewing the model architecture to ensure that all intended inputs are being used correctly.

Memory usage: The output provides information about the memory usage during the build process. It shows the peak memory usage for CPU and GPU allocators, as well as the total activation memory, weights memory, and scratch memory.

Copy code[05/06/2024-13:36:30] [TRT] [I] Total Activation Memory: 286265856
[05/06/2024-13:36:30] [TRT] [I] Total Weights Memory: 13478928640
[05/06/2024-13:36:37] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 32469 MiB

These memory usage statistics can be helpful for understanding the resource requirements of the model and ensuring that the system has sufficient memory to accommodate the build process and inference.

Build time: The output indicates that the total time for building the TensorRT engine was approximately 5 minutes and 22 seconds:

[05/06/2024-13:36:46] [TRT-LLM] [I] Total time of building all engines: 00:05:22

This build time may vary depending on the system specifications and the complexity of the model.

Overall, the build process completed successfully, and the TensorRT engine was generated.

However, it's recommended to review the warnings related to data type mismatches and layernorm precision to ensure optimal model performance and accuracy. Additionally, monitoring the memory usage and build time can help in understanding the resource requirements and potential optimizations for the build process.

The config.json output

{
    "version": "0.10.0.dev2024041600",
    "pretrained_config": {
        "architecture": "LlamaForCausalLM",
        "dtype": "float16",
        "logits_dtype": "float16",
        "vocab_size": 32000,
        "max_position_embeddings": 4096,
        "hidden_size": 4096,
        "num_hidden_layers": 32,
        "num_attention_heads": 32,
        "num_key_value_heads": 32,
        "head_size": 128,
        "hidden_act": "silu",
        "intermediate_size": 11008,
        "norm_epsilon": 1e-05,
        "position_embedding_type": "rope_gpt_neox",
        "use_parallel_embedding": false,
        "embedding_sharding_dim": 0,
        "share_embedding_table": false,
        "mapping": {
            "world_size": 1,
            "tp_size": 1,
            "pp_size": 1
        },
        "quantization": {
            "quant_algo": null,
            "kv_cache_quant_algo": null,
            "group_size": 128,
            "smoothquant_val": null,
            "has_zero_point": false,
            "pre_quant_scale": false,
            "exclude_modules": [
                "lm_head"
            ]
        },
        "kv_dtype": "float16",
        "rotary_scaling": null,
        "moe_normalization_mode": null,
        "rotary_base": 10000.0,
        "moe_num_experts": 0,
        "moe_top_k": 0,
        "moe_tp_mode": 0,
        "attn_bias": false,
        "disable_weight_only_quant_plugin": false,
        "mlp_bias": false
    },
    "build_config": {
        "max_input_len": 256,
        "max_output_len": 256,
        "max_batch_size": 8,
        "max_beam_width": 1,
        "max_num_tokens": 2048,
        "opt_num_tokens": 8,
        "max_prompt_embedding_table_size": 0,
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": false,
        "builder_opt": null,
        "profiling_verbosity": "layer_names_only",
        "enable_debug_output": false,
        "max_draft_len": 0,
        "use_refit": false,
        "input_timing_cache": null,
        "output_timing_cache": "model.cache",
        "lora_config": {
            "lora_dir": [],
            "lora_ckpt_source": "hf",
            "max_lora_rank": 64,
            "lora_target_modules": [],
            "trtllm_modules_to_hf_modules": {}
        },
        "auto_parallel_config": {
            "world_size": 1,
            "gpus_per_node": 8,
            "cluster_key": "A100-PCIe-80GB",
            "cluster_info": null,
            "sharding_cost_model": "alpha_beta",
            "comm_cost_model": "alpha_beta",
            "enable_pipeline_parallelism": false,
            "enable_shard_unbalanced_shape": false,
            "enable_shard_dynamic_shape": false,
            "enable_reduce_scatter": true,
            "builder_flags": null,
            "debug_mode": false,
            "infer_shape": true,
            "validation_mode": false,
            "same_buffer_io": {
                "past_key_value_(\\d+)": "present_key_value_\\1"
            },
            "same_spec_io": {},
            "sharded_io_allowlist": [
                "past_key_value_\\d+",
                "present_key_value_\\d*"
            ],
            "fast_reduce": true,
            "fill_weights": false,
            "parallel_config_cache": null,
            "profile_cache": null,
            "dump_path": null,
            "debug_outputs": []
        },
        "weight_sparsity": false,
        "max_encoder_input_len": 1024,
        "use_fused_mlp": false,
        "plugin_config": {
            "bert_attention_plugin": "float16",
            "gpt_attention_plugin": "float16",
            "gemm_plugin": null,
            "smooth_quant_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": null,
            "lookup_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": null,
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "moe_plugin": "float16",
            "mamba_conv1d_plugin": "float16",
            "context_fmha": true,
            "context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "use_custom_all_reduce": true,
            "multi_block_mode": false,
            "enable_xqa": true,
            "attention_qk_half_accumulation": false,
            "tokens_per_block": 128,
            "use_paged_context_fmha": false,
            "use_fp8_context_fmha": false,
            "use_context_fmha_for_generation": false,
            "multiple_profiles": false,
            "paged_state": true,
            "streamingllm": false
        }
    }
}
```

The provided config.json file contains the configuration details for the TensorRT-LLM build process.

Let's analyze the different sections and their significance:

  1. version: Specifies the version of TensorRT-LLM used for the build process.

  2. pretrained_config: Contains the configuration details for the pretrained model.

    • architecture: Indicates the model architecture, which is "LlamaForCausalLM" in this case.

    • dtype and logits_dtype: Specify the data type used for the model and logits, which is "float16" here.

    • vocab_size, max_position_embeddings, hidden_size, num_hidden_layers, num_attention_heads, num_key_value_heads, head_size, hidden_act, intermediate_size, norm_epsilon: These parameters define the model architecture and dimensions.

    • position_embedding_type, use_parallel_embedding, embedding_sharding_dim, share_embedding_table: Settings related to position embeddings and embedding sharding.

    • mapping: Specifies the parallelism configuration, including the world size, tensor parallelism (tp) size, and pipeline parallelism (pp) size.

    • quantization: Contains quantization-related settings, such as the quantization algorithm, group size, and excluded modules.

    • Other parameters like kv_dtype, rotary_scaling, moe_normalization_mode, rotary_base, moe_num_experts, moe_top_k, moe_tp_mode, attn_bias, disable_weight_only_quant_plugin, and mlp_bias are specific to the model architecture and optimization settings.

  3. build_config: Contains the build configuration options.

    • max_input_len, max_output_len, max_batch_size, max_beam_width, max_num_tokens, opt_num_tokens: These parameters define the maximum input and output lengths, batch size, beam width, and token-related settings.

    • max_prompt_embedding_table_size, gather_context_logits, gather_generation_logits, strongly_typed, builder_opt, profiling_verbosity, enable_debug_output, max_draft_len, use_refit: Additional build configuration options related to prompts, logits gathering, typing, profiling, debugging, and refitting.

    • input_timing_cache and output_timing_cache: Specify the paths for input and output timing caches.

    • lora_config: Contains settings for Low-Rank Adaptation (LoRA), including the LoRA directory, checkpoint source, maximum rank, target modules, and module mappings.

    • auto_parallel_config: Specifies the automatic parallelization configuration, including world size, GPUs per node, cluster key, sharding and communication cost models, parallelism settings, and various optimization options.

    • weight_sparsity, max_encoder_input_len, use_fused_mlp: Additional build configuration options related to weight sparsity, encoder input length, and fused MLP usage.

    • plugin_config: Contains plugin-specific settings for attention, GEMM, quantization, communication, and other optimization plugins.

The config.json file serves as a comprehensive record of the build configuration used for the TensorRT-LLM model. It captures the model architecture, parallelism settings, quantization options, build parameters, and various optimization settings.

This configuration file is valuable for reproducibility and understanding the specific settings used during the build process. It can be used to rebuild the model with the same configuration or to compare different configurations and their impact on model performance and resource utilization.

The engine file

The file /app/tensorrt_llm/examples/llama/llama-2-7b-chat-engine/rank0.engine is the serialized TensorRT engine that was built during the TensorRT-LLM build process.

It represents the optimized and compiled version of the LLaMA-2.7B model for inference.

Here are some key points about the TensorRT engine file:

  1. Serialization: The TensorRT engine is serialized to disk after the build process is complete. This allows the engine to be loaded and used for inference without rebuilding it every time.

  2. Optimization: The TensorRT engine incorporates various optimizations, such as layer fusion, kernel auto-tuning, precision calibration, and memory optimizations, to improve the performance and efficiency of the model during inference.

  3. Size: The size of the serialized engine file (12.6 GB in this case) depends on the model architecture, size, and the build configuration used. Larger models and higher precision settings generally result in larger engine files.

  4. Rank: The rank0 in the file name indicates that this engine is built for a specific rank in a multi-GPU setup. In this case, it suggests that the engine is built for the first GPU (rank 0) in a distributed or parallel inference scenario.

  5. Inference: To perform inference using the LLaMA-2.7B model, you would load this serialized engine file into TensorRT and execute the inference requests using the TensorRT runtime API. The engine encapsulates the optimized computational graph and provides efficient execution of the model.

It's important to note that the serialized engine file is specific to the hardware and software environment in which it was built.

It may not be directly portable to different systems or TensorRT versions without rebuilding the engine.

The presence of the rank0.engine file indicates that the TensorRT-LLM build process completed successfully, and the optimized engine is ready for inference.

You can use this engine file to deploy the LLaMA-2.7B model in your application or inference pipeline, leveraging the performance benefits provided by TensorRT.

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023