convert_checkpoint examples
Here are some new commands for your model setup, incorporating different configurations for the LLaMA 70B model using the llama-2-7b-chat-hf
directory and varying levels of tensor and pipeline parallelism:
To compile the LLaMA 7B model in the simplest way, you would typically want to minimize the complexity of the setup, focusing on using a single GPU without engaging in tensor or pipeline parallelism.
This approach reduces the complexity of the setup and avoids potential issues related to distributed computing environments.
Here's how you could do it with minimal configuration:
Convert the LLaMA 7B model to TensorRT-LLM checkpoint format using a single GPU
Build the TensorRT engine(s) for the LLaMA 70B model using a single GPU:
This setup uses the --tp_size 1
parameter to indicate that you're compiling the model for use with a single GPU, thus avoiding the additional complexity of managing multiple GPUs or engaging in tensor or pipeline parallelism.
It assumes the model directory ./tmp/llama/7B/hf/
contains the Hugging Face checkpoint for the LLaMA 7B model.
Remember, while this approach simplifies the compilation process, it may not fully leverage the computational capabilities of a multi-GPU setup, which could be beneficial for very large models like LLaMA 70B.
However, it serves as a good starting point for initial testing or environments where only a single GPU is available.
Use summarize_long.py
More advanced techniques for fun
Build LLaMA 70B with INT8 Quantization and 8-way Tensor Parallelism
This command sets up the model with INT8 weight-only quantization for improved performance on hardware that supports INT8 operations. It uses 8-way tensor parallelism to distribute the model across 8 GPUs.
Build LLaMA 70B with FP8 Precision and 4-way Tensor Parallelism
This setup converts the LLaMA 70B model to use FP8 precision, aiming to achieve a balance between performance and precision. It utilizes 4-way tensor parallelism.
Build LLaMA 70B with 16-way Tensor Parallelism for Maximum GPU Utilization
This command is designed for setups with a high number of GPUs, utilizing 16-way tensor parallelism to maximize GPU utilization across a large cluster.
Using SmoothQuant for Enhanced Model Precision
This setup applies SmoothQuant with a specific alpha parameter to the model, aiming to improve the model's precision without significant performance degradation. It uses 8-way tensor parallelism.
Each command specifies a unique combination of precision, quantization, and parallelism settings to suit different hardware capabilities and performance goals.
Last updated