Tutorial 2 - get inference going

I put this together after installing LLama2, serialising it, etc

Ensure that you have the necessary dependencies installed by running

pip install -r requirements.txt

Convert the LLaMA-2-7B-chat model checkpoint to the TensorRT-LLM format

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf \
                             --output_dir ./tllm_checkpoint_1gpu_bf16 \
                             --dtype bfloat16

This command converts the model checkpoint to the TensorRT-LLM format using bfloat16 precision.

Build the TensorRT engine using the converted checkpoint:

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir ./tmp/llama/7B-chat/trt_engines/bf16/1-gpu \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16

This command builds the TensorRT engine using the converted checkpoint, specifying bfloat16 precision for the GPT attention and GEMM plugins.

Run inference using the built TensorRT engine:

python ../run.py --max_output_len 50 \
                 --engine_dir ./tmp/llama/7B-chat/trt_engines/bf16/1-gpu \
                 --tokenizer_dir ./llama-2-7b-chat-hf \
                 --input_text "Hello, how are you?"

This command runs inference using the built TensorRT engine.

It generates a maximum of 50 output tokens and uses the tokenizer from the original LLaMA-2-7B-chat model. Replace "Hello, how are you?" with your desired input text.

Alternatively, you can use the summarize_long.py script to run summarization on long articles:

python summarize_long.py --test_trt_llm \
                         --hf_model_dir ./llama-2-7b-chat-hf \
                         --data_type bf16 \
                         --engine_dir ./tmp/llama/7B-chat/trt_engines/bf16/1-gpu

This command runs the summarization script using the built TensorRT engine.

Note: Make sure to adjust the paths and options according to your specific directory structure and requirements.

By following these steps, you should be able to run inference on the serialized LLaMA-2-7B-chat model using TensorRT. The run.py script allows you to generate text based on a given input, while the summarize_long.py script demonstrates how to perform summarization on long articles.

Last updated