# Bloom

This document provides instructions on how to build and run a BLOOM model using TensorRT-LLM

### <mark style="color:blue;">Overview</mark>

* The TensorRT-LLM BLOOM implementation is located in `tensorrt_llm/models/bloom/model.py`.
* The example code for running BLOOM with TensorRT-LLM is in the `examples/bloom` folder.
* The main files in the example folder are:
  * `build.py`: Builds the TensorRT engine(s) needed to run the BLOOM model.
  * `run.py`: Runs the inference on an input text.
  * `summarize.py`: Summarizes articles in the cnn\_dailymail dataset using the model.

<mark style="color:green;">Support Matrix</mark>&#x20;

The document lists the supported features for BLOOM in TensorRT-LLM:

* FP16 (half-precision floating-point)
* INT8 & INT4 Weight-Only quantization
* INT8 KV CACHE (Key-Value cache)
* Smooth Quant (Smooth quantization)
* Tensor Parallel (Tensor parallelism for multi-GPU)

### <mark style="color:blue;">Usage</mark>&#x20;

The example code takes Hugging Face (HF) weights as input and builds the corresponding TensorRT engines.&#x20;

The number of engines depends on the number of GPUs used for inference.&#x20;

Prepare the HF BLOOM Checkpoint

Follow the guides at <https://huggingface.co/docs/transformers/main/en/model_doc/bloom> to prepare the HF BLOOM checkpoint

#### <mark style="color:green;">Example</mark>

To install BLOOM-560M

`bash git lfs install rm -rf ./bloom/560M mkdir -p ./bloom/560M && git clone https://huggingface.co/bigscience/bloom-560m ./bloom/560M`&#x20;

#### <mark style="color:green;">Build TensorRT Engine(s)</mark>

`Build.py` builds TensorRT engine(s) from the HF checkpoint.

If no checkpoint directory is specified, it will build engine(s) with dummy weights. - Parallel building can be enabled using the `--parallel_build` argument to speed up the engine building process on multiple GPUs (single node only).

````
Examples:
 - Single GPU on BLOOM 560M (FP16):
 
 python build.py --model_dir ./bloom/560M/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/560M/trt_engines/fp16/1-gpu/
   ```
 - Single GPU on BLOOM 560M (INT8 weight-only):
   ```bash
   python build.py --model_dir ./bloom/560M/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --use_weight_only \
                   --output_dir ./bloom/560M/trt_engines/int8_weight_only/1-gpu/
   ```
 - 2-way tensor parallelism on BLOOM 560M:
   ```bash
   python build.py --model_dir ./bloom/560M/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/560M/trt_engines/fp16/2-gpu/ \
                   --world_size 2
   ```
 - 8-way tensor parallelism on BLOOM 176B (sharding embedding table in vocab dimension):
   ```bash
   python build.py --model_dir ./bloom/176B/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/176B/trt_engines/fp16/8-gpu/ \
                   --world_size 8 \
                   --use_parallel_embedding \
                   --embedding_sharding_dim 0 \
                   --use_lookup_plugin float16
   ```
 - 8-way tensor parallelism on BLOOM 176B (sharding embedding table in hidden dimension):
   ```bash
   python build.py --model_dir ./bloom/176B/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/176B/trt_engines/fp16/8-gpu/ \
                   --world_size 8 \
                   --use_parallel_embedding \
                   --embedding_sharding_dim 1
   ```
 - 8-way tensor parallelism on BLOOM 176B (share embedding table between embedding() and lm_head() layers):
   ```bash
   python build.py --model_dir ./bloom/176B/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/176B/trt_engines/fp16/8-gpu/ \
                   --world_size 8 \
                   --use_parallel_embedding \
                   --embedding_sharding_dim 0 \
                   --use_lookup_plugin float16 \
                   --use_embedding_sharing
   ```
````

Run

Examples of running the model:

Single GPU (FP16)

```bash
python summarize.py --test_trt_llm \
                    --hf_model_location ./bloom/560M/ \
                    --data_type fp16 \
                    --engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
```

Single GPU (INT8 weight-only):

```bash
python summarize.py --test_trt_llm \
                    --hf_model_location ./bloom/560M/ \
                    --data_type fp16 \
                    --engine_dir ./bloom/560M/trt_engines/int8_weight_only/1-gpu/
```

2-way tensor parallelism

```bash
mpirun -n 2 --allow-run-as-root \
    python summarize.py --test_trt_llm \
                        --hf_model_location ./bloom/560M/ \
                        --data_type fp16 \
                        --engine_dir ./bloom/560M/trt_engines/fp16/2-gpu/
```

8-way tensor parallelism:

```bash
mpirun -n 8 --allow-run-as-root \
    python summarize.py --test_trt_llm \
                        --hf_model_location ./bloom/176B/ \
                        --data_type fp16 \
                        --engine_dir ./bloom/176B/trt_engines/fp16/8-gpu/
```

This documentation provides a comprehensive guide on how to build and run BLOOM models using TensorRT-LLM, covering various configurations and optimization techniques such as FP16, INT8 weight-only quantization, INT8 KV cache, SmoothQuant, and tensor parallelism for multi-GPU setups.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tensorrt-llm.continuumlabs.ai/bloom.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
