Tutorial 2 - get inference going
I put this together after installing LLama2, serialising it, etc
Ensure that you have the necessary dependencies installed by running
Convert the LLaMA-2-7B-chat model checkpoint to the TensorRT-LLM format
This command converts the model checkpoint to the TensorRT-LLM format using bfloat16 precision.
Build the TensorRT engine using the converted checkpoint:
This command builds the TensorRT engine using the converted checkpoint, specifying bfloat16 precision for the GPT attention and GEMM plugins.
Run inference using the built TensorRT engine:
This command runs inference using the built TensorRT engine.
It generates a maximum of 50 output tokens and uses the tokenizer from the original LLaMA-2-7B-chat model. Replace "Hello, how are you?"
with your desired input text.
Alternatively, you can use the summarize_long.py
script to run summarization on long articles:
This command runs the summarization script using the built TensorRT engine.
Note: Make sure to adjust the paths and options according to your specific directory structure and requirements.
By following these steps, you should be able to run inference on the serialized LLaMA-2-7B-chat model using TensorRT. The run.py
script allows you to generate text based on a given input, while the summarize_long.py
script demonstrates how to perform summarization on long articles.
Last updated