convert_checkpoint examples
Here are some new commands for your model setup, incorporating different configurations for the LLaMA 70B model using the llama-2-7b-chat-hf
directory and varying levels of tensor and pipeline parallelism:
To compile the LLaMA 7B model in the simplest way, you would typically want to minimize the complexity of the setup, focusing on using a single GPU without engaging in tensor or pipeline parallelism.
This approach reduces the complexity of the setup and avoids potential issues related to distributed computing environments.
Here's how you could do it with minimal configuration:
Convert the LLaMA 7B model to TensorRT-LLM checkpoint format using a single GPU
python3 convert_checkpoint.py --model_dir llama-2-7b-chat-hf \
--output_dir ./llama-2-7b-chat-hf-output \
--dtype float16
Build the TensorRT engine(s) for the LLaMA 70B model using a single GPU:
trtllm-build --checkpoint_dir ./llama-2-7b-chat-hf-output \
--output_dir ./tmp/llama/7B-chat/trt_engines/fp16/1-gpu \
--gpt_attention_plugin float16 \
--gemm_plugin float16
This setup uses the --tp_size 1
parameter to indicate that you're compiling the model for use with a single GPU, thus avoiding the additional complexity of managing multiple GPUs or engaging in tensor or pipeline parallelism.
It assumes the model directory ./tmp/llama/7B/hf/
contains the Hugging Face checkpoint for the LLaMA 7B model.
Remember, while this approach simplifies the compilation process, it may not fully leverage the computational capabilities of a multi-GPU setup, which could be beneficial for very large models like LLaMA 70B.
However, it serves as a good starting point for initial testing or environments where only a single GPU is available.
Use summarize_long.py
python3 ../summarize_long.py --test_trt_llm \
--hf_model_dir ./llama-models/llama-7b-hf \
--data_type fp16 \
--engine_dir ./tmp/llama/7B-chat/trt_engines/fp16/1-gpu \
--test_hf \
--tokenizer_dir ./llama-models/llama-7b-hf \
--output_dir ./results/llama/7B-chat \
--eval_task summarize
More advanced techniques for fun
Build LLaMA 70B with INT8 Quantization and 8-way Tensor Parallelism
This command sets up the model with INT8 weight-only quantization for improved performance on hardware that supports INT8 operations. It uses 8-way tensor parallelism to distribute the model across 8 GPUs.
python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
--output_dir ./tllm_checkpoint_70B_tp8_int8 \
--dtype float16 \
--tp_size 8 \
--use_weight_only \
--weight_only_precision int8
trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp8_int8 \
--output_dir ./tmp/llama/70B/trt_engines/int8/8-gpu/ \
--gpt_attention_plugin float16 \
--gemm_plugin float16
Build LLaMA 70B with FP8 Precision and 4-way Tensor Parallelism
This setup converts the LLaMA 70B model to use FP8 precision, aiming to achieve a balance between performance and precision. It utilizes 4-way tensor parallelism.
python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
--output_dir ./tllm_checkpoint_70B_tp4_fp8 \
--dtype float16 \
--tp_size 4 \
--enable_fp8
trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp4_fp8 \
--output_dir ./tmp/llama/70B/trt_engines/fp8/4-gpu/ \
--gpt_attention_plugin float16 \
--gemm_plugin float16
Build LLaMA 70B with 16-way Tensor Parallelism for Maximum GPU Utilization
This command is designed for setups with a high number of GPUs, utilizing 16-way tensor parallelism to maximize GPU utilization across a large cluster.
python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
--output_dir ./tllm_checkpoint_70B_tp16 \
--dtype float16 \
--tp_size 16
trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp16 \
--output_dir ./tmp/llama/70B/trt_engines/fp16/16-gpu/ \
--gpt_attention_plugin float16 \
--gemm_plugin float16
Using SmoothQuant for Enhanced Model Precision
This setup applies SmoothQuant with a specific alpha parameter to the model, aiming to improve the model's precision without significant performance degradation. It uses 8-way tensor parallelism.
python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
--output_dir ./tllm_checkpoint_70B_tp8_sq \
--dtype float16 \
--tp_size 8 \
--smoothquant 0.5
trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp8_sq \
--output_dir ./tmp/llama/70B/trt_engines/sq/8-gpu/ \
--gpt_attention_plugin float16 \
--gemm_plugin float16
Each command specifies a unique combination of precision, quantization, and parallelism settings to suit different hardware capabilities and performance goals.
Last updated
Was this helpful?