> For the complete documentation index, see [llms.txt](https://tensorrt-llm.continuumlabs.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://tensorrt-llm.continuumlabs.ai/llama2-installation/converting-checkpoints/convert_checkpoint-examples.md).

# convert\_checkpoint examples

Here are some new commands for your model setup, incorporating different configurations for the LLaMA 70B model using the <mark style="color:yellow;">**`llama-2-7b-chat-hf`**</mark> directory and varying levels of tensor and pipeline parallelism:

To compile the LLaMA 7B model in the simplest way, you would typically want to minimize the complexity of the setup, focusing on using a single GPU without engaging in tensor or pipeline parallelism.&#x20;

This approach reduces the complexity of the setup and avoids potential issues related to distributed computing environments.&#x20;

Here's how you could do it with minimal configuration:&#x20;

#### <mark style="color:green;">Convert the LLaMA 7B model to TensorRT-LLM checkpoint format using a single GPU</mark>

```bash
python3 convert_checkpoint.py --model_dir llama-2-7b-chat-hf \
                              --output_dir ./llama-2-7b-chat-hf-output \
                              --dtype float16
```

#### <mark style="color:green;">Build the TensorRT engine(s) for the LLaMA 70B model using a single GPU:</mark>

```bash
trtllm-build --checkpoint_dir ./llama-2-7b-chat-hf-output \
             --output_dir ./tmp/llama/7B-chat/trt_engines/fp16/1-gpu \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16
```

This setup uses the <mark style="color:yellow;">**`--tp_size 1`**</mark> parameter to indicate that you're compiling the model for use with a single GPU, thus avoiding the additional complexity of managing multiple GPUs or engaging in tensor or pipeline parallelism.&#x20;

It assumes the model directory <mark style="color:yellow;">**`./tmp/llama/7B/hf/`**</mark> contains the Hugging Face checkpoint for the LLaMA 7B model.

Remember, while this approach simplifies the compilation process, it may not fully leverage the computational capabilities of a multi-GPU setup, which could be beneficial for very large models like LLaMA 70B.&#x20;

However, it serves as a good starting point for initial testing or environments where only a single GPU is available.

Use summarize\_long.py

```bash
python3 ../summarize_long.py --test_trt_llm \
                       --hf_model_dir ./llama-models/llama-7b-hf \
                       --data_type fp16 \
                       --engine_dir ./tmp/llama/7B-chat/trt_engines/fp16/1-gpu \
                       --test_hf \
                       --tokenizer_dir ./llama-models/llama-7b-hf \
                       --output_dir ./results/llama/7B-chat \
                       --eval_task summarize
```

### More advanced techniques for fun

### <mark style="color:blue;">Build LLaMA 70B with INT8 Quantization and 8-way Tensor Parallelism</mark>

This command sets up the model with INT8 weight-only quantization for improved performance on hardware that supports INT8 operations. It uses 8-way tensor parallelism to distribute the model across 8 GPUs.

```bash
python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp8_int8 \
                            --dtype float16 \
                            --tp_size 8 \
                            --use_weight_only \
                            --weight_only_precision int8

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp8_int8 \
            --output_dir ./tmp/llama/70B/trt_engines/int8/8-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16
```

### <mark style="color:blue;">Build LLaMA 70B with FP8 Precision and 4-way Tensor Parallelism</mark>

This setup converts the LLaMA 70B model to use FP8 precision, aiming to achieve a balance between performance and precision. It utilizes 4-way tensor parallelism.

```bash
python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp4_fp8 \
                            --dtype float16 \
                            --tp_size 4 \
                            --enable_fp8

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp4_fp8 \
            --output_dir ./tmp/llama/70B/trt_engines/fp8/4-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16
```

### <mark style="color:blue;">Build LLaMA 70B with 16-way Tensor Parallelism for Maximum GPU Utilization</mark>

This command is designed for setups with a high number of GPUs, utilizing 16-way tensor parallelism to maximize GPU utilization across a large cluster.

```bash
python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp16 \
                            --dtype float16 \
                            --tp_size 16

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp16 \
            --output_dir ./tmp/llama/70B/trt_engines/fp16/16-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16
```

### <mark style="color:blue;">Using SmoothQuant for Enhanced Model Precision</mark>

This setup applies SmoothQuant with a specific alpha parameter to the model, aiming to improve the model's precision without significant performance degradation. It uses 8-way tensor parallelism.

```bash
python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp8_sq \
                            --dtype float16 \
                            --tp_size 8 \
                            --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp8_sq \
            --output_dir ./tmp/llama/70B/trt_engines/sq/8-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16
```

Each command specifies a unique combination of precision, quantization, and parallelism settings to suit different hardware capabilities and performance goals.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://tensorrt-llm.continuumlabs.ai/llama2-installation/converting-checkpoints/convert_checkpoint-examples.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
