Page cover image

convert_checkpoint examples

Here are some new commands for your model setup, incorporating different configurations for the LLaMA 70B model using the llama-2-7b-chat-hf directory and varying levels of tensor and pipeline parallelism:

To compile the LLaMA 7B model in the simplest way, you would typically want to minimize the complexity of the setup, focusing on using a single GPU without engaging in tensor or pipeline parallelism.

This approach reduces the complexity of the setup and avoids potential issues related to distributed computing environments.

Here's how you could do it with minimal configuration:

Convert the LLaMA 7B model to TensorRT-LLM checkpoint format using a single GPU

python3 convert_checkpoint.py --model_dir llama-2-7b-chat-hf \
                              --output_dir ./llama-2-7b-chat-hf-output \
                              --dtype float16

Build the TensorRT engine(s) for the LLaMA 70B model using a single GPU:

trtllm-build --checkpoint_dir ./llama-2-7b-chat-hf-output \
             --output_dir ./tmp/llama/7B-chat/trt_engines/fp16/1-gpu \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16

This setup uses the --tp_size 1 parameter to indicate that you're compiling the model for use with a single GPU, thus avoiding the additional complexity of managing multiple GPUs or engaging in tensor or pipeline parallelism.

It assumes the model directory ./tmp/llama/7B/hf/ contains the Hugging Face checkpoint for the LLaMA 7B model.

Remember, while this approach simplifies the compilation process, it may not fully leverage the computational capabilities of a multi-GPU setup, which could be beneficial for very large models like LLaMA 70B.

However, it serves as a good starting point for initial testing or environments where only a single GPU is available.

Use summarize_long.py

python3 ../summarize_long.py --test_trt_llm \
                       --hf_model_dir ./llama-models/llama-7b-hf \
                       --data_type fp16 \
                       --engine_dir ./tmp/llama/7B-chat/trt_engines/fp16/1-gpu \
                       --test_hf \
                       --tokenizer_dir ./llama-models/llama-7b-hf \
                       --output_dir ./results/llama/7B-chat \
                       --eval_task summarize

More advanced techniques for fun

Build LLaMA 70B with INT8 Quantization and 8-way Tensor Parallelism

This command sets up the model with INT8 weight-only quantization for improved performance on hardware that supports INT8 operations. It uses 8-way tensor parallelism to distribute the model across 8 GPUs.

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp8_int8 \
                            --dtype float16 \
                            --tp_size 8 \
                            --use_weight_only \
                            --weight_only_precision int8

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp8_int8 \
            --output_dir ./tmp/llama/70B/trt_engines/int8/8-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16

Build LLaMA 70B with FP8 Precision and 4-way Tensor Parallelism

This setup converts the LLaMA 70B model to use FP8 precision, aiming to achieve a balance between performance and precision. It utilizes 4-way tensor parallelism.

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp4_fp8 \
                            --dtype float16 \
                            --tp_size 4 \
                            --enable_fp8

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp4_fp8 \
            --output_dir ./tmp/llama/70B/trt_engines/fp8/4-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16

Build LLaMA 70B with 16-way Tensor Parallelism for Maximum GPU Utilization

This command is designed for setups with a high number of GPUs, utilizing 16-way tensor parallelism to maximize GPU utilization across a large cluster.

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp16 \
                            --dtype float16 \
                            --tp_size 16

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp16 \
            --output_dir ./tmp/llama/70B/trt_engines/fp16/16-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16

Using SmoothQuant for Enhanced Model Precision

This setup applies SmoothQuant with a specific alpha parameter to the model, aiming to improve the model's precision without significant performance degradation. It uses 8-way tensor parallelism.

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp8_sq \
                            --dtype float16 \
                            --tp_size 8 \
                            --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp8_sq \
            --output_dir ./tmp/llama/70B/trt_engines/sq/8-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16

Each command specifies a unique combination of precision, quantization, and parallelism settings to suit different hardware capabilities and performance goals.

Last updated

Was this helpful?