LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Overview
  • Usage

Was this helpful?

Bloom

This document provides instructions on how to build and run a BLOOM model using TensorRT-LLM

Overview

  • The TensorRT-LLM BLOOM implementation is located in tensorrt_llm/models/bloom/model.py.

  • The example code for running BLOOM with TensorRT-LLM is in the examples/bloom folder.

  • The main files in the example folder are:

    • build.py: Builds the TensorRT engine(s) needed to run the BLOOM model.

    • run.py: Runs the inference on an input text.

    • summarize.py: Summarizes articles in the cnn_dailymail dataset using the model.

Support Matrix

The document lists the supported features for BLOOM in TensorRT-LLM:

  • FP16 (half-precision floating-point)

  • INT8 & INT4 Weight-Only quantization

  • INT8 KV CACHE (Key-Value cache)

  • Smooth Quant (Smooth quantization)

  • Tensor Parallel (Tensor parallelism for multi-GPU)

Usage

The example code takes Hugging Face (HF) weights as input and builds the corresponding TensorRT engines.

The number of engines depends on the number of GPUs used for inference.

Prepare the HF BLOOM Checkpoint

Example

To install BLOOM-560M

bash git lfs install rm -rf ./bloom/560M mkdir -p ./bloom/560M && git clone https://huggingface.co/bigscience/bloom-560m ./bloom/560M

Build TensorRT Engine(s)

Build.py builds TensorRT engine(s) from the HF checkpoint.

If no checkpoint directory is specified, it will build engine(s) with dummy weights. - Parallel building can be enabled using the --parallel_build argument to speed up the engine building process on multiple GPUs (single node only).

Examples:
 - Single GPU on BLOOM 560M (FP16):
 
 python build.py --model_dir ./bloom/560M/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/560M/trt_engines/fp16/1-gpu/
   ```
 - Single GPU on BLOOM 560M (INT8 weight-only):
   ```bash
   python build.py --model_dir ./bloom/560M/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --use_weight_only \
                   --output_dir ./bloom/560M/trt_engines/int8_weight_only/1-gpu/
   ```
 - 2-way tensor parallelism on BLOOM 560M:
   ```bash
   python build.py --model_dir ./bloom/560M/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/560M/trt_engines/fp16/2-gpu/ \
                   --world_size 2
   ```
 - 8-way tensor parallelism on BLOOM 176B (sharding embedding table in vocab dimension):
   ```bash
   python build.py --model_dir ./bloom/176B/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/176B/trt_engines/fp16/8-gpu/ \
                   --world_size 8 \
                   --use_parallel_embedding \
                   --embedding_sharding_dim 0 \
                   --use_lookup_plugin float16
   ```
 - 8-way tensor parallelism on BLOOM 176B (sharding embedding table in hidden dimension):
   ```bash
   python build.py --model_dir ./bloom/176B/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/176B/trt_engines/fp16/8-gpu/ \
                   --world_size 8 \
                   --use_parallel_embedding \
                   --embedding_sharding_dim 1
   ```
 - 8-way tensor parallelism on BLOOM 176B (share embedding table between embedding() and lm_head() layers):
   ```bash
   python build.py --model_dir ./bloom/176B/ \
                   --dtype float16 \
                   --use_gemm_plugin float16 \
                   --use_gpt_attention_plugin float16 \
                   --output_dir ./bloom/176B/trt_engines/fp16/8-gpu/ \
                   --world_size 8 \
                   --use_parallel_embedding \
                   --embedding_sharding_dim 0 \
                   --use_lookup_plugin float16 \
                   --use_embedding_sharing
   ```

Run

Examples of running the model:

Single GPU (FP16)

python summarize.py --test_trt_llm \
                    --hf_model_location ./bloom/560M/ \
                    --data_type fp16 \
                    --engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/

Single GPU (INT8 weight-only):

python summarize.py --test_trt_llm \
                    --hf_model_location ./bloom/560M/ \
                    --data_type fp16 \
                    --engine_dir ./bloom/560M/trt_engines/int8_weight_only/1-gpu/

2-way tensor parallelism

mpirun -n 2 --allow-run-as-root \
    python summarize.py --test_trt_llm \
                        --hf_model_location ./bloom/560M/ \
                        --data_type fp16 \
                        --engine_dir ./bloom/560M/trt_engines/fp16/2-gpu/

8-way tensor parallelism:

mpirun -n 8 --allow-run-as-root \
    python summarize.py --test_trt_llm \
                        --hf_model_location ./bloom/176B/ \
                        --data_type fp16 \
                        --engine_dir ./bloom/176B/trt_engines/fp16/8-gpu/

This documentation provides a comprehensive guide on how to build and run BLOOM models using TensorRT-LLM, covering various configurations and optimization techniques such as FP16, INT8 weight-only quantization, INT8 KV cache, SmoothQuant, and tensor parallelism for multi-GPU setups.

PreviousTensorRT ModelsNextHuggingface Bloom Documentation

Last updated 1 year ago

Was this helpful?

Follow the guides at to prepare the HF BLOOM checkpoint

https://huggingface.co/docs/transformers/main/en/model_doc/bloom