Bloom

This document provides instructions on how to build and run a BLOOM model using TensorRT-LLM

Overview

  • The TensorRT-LLM BLOOM implementation is located in tensorrt_llm/models/bloom/model.py.

  • The example code for running BLOOM with TensorRT-LLM is in the examples/bloom folder.

  • The main files in the example folder are:

    • build.py: Builds the TensorRT engine(s) needed to run the BLOOM model.

    • run.py: Runs the inference on an input text.

    • summarize.py: Summarizes articles in the cnn_dailymail dataset using the model.

Support Matrix

The document lists the supported features for BLOOM in TensorRT-LLM:

  • FP16 (half-precision floating-point)

  • INT8 & INT4 Weight-Only quantization

  • INT8 KV CACHE (Key-Value cache)

  • Smooth Quant (Smooth quantization)

  • Tensor Parallel (Tensor parallelism for multi-GPU)

Usage

The example code takes Hugging Face (HF) weights as input and builds the corresponding TensorRT engines.

The number of engines depends on the number of GPUs used for inference.

Prepare the HF BLOOM Checkpoint

Follow the guides at https://huggingface.co/docs/transformers/main/en/model_doc/bloom to prepare the HF BLOOM checkpoint

Example

To install BLOOM-560M

bash git lfs install rm -rf ./bloom/560M mkdir -p ./bloom/560M && git clone https://huggingface.co/bigscience/bloom-560m ./bloom/560M

Build TensorRT Engine(s)

Build.py builds TensorRT engine(s) from the HF checkpoint.

If no checkpoint directory is specified, it will build engine(s) with dummy weights. - Parallel building can be enabled using the --parallel_build argument to speed up the engine building process on multiple GPUs (single node only).

Run

Examples of running the model:

Single GPU (FP16)

Single GPU (INT8 weight-only):

2-way tensor parallelism

8-way tensor parallelism:

This documentation provides a comprehensive guide on how to build and run BLOOM models using TensorRT-LLM, covering various configurations and optimization techniques such as FP16, INT8 weight-only quantization, INT8 KV cache, SmoothQuant, and tensor parallelism for multi-GPU setups.

Last updated

Was this helpful?