LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Overview
  • Resources
  • BloomConfig
  • BloomTokenizerFast
  • BloomModel
  • BloomForCausalLM
  • BloomForSequenceClassification
  • BloomForTokenClassification
  • BloomForQuestionAnswering
  • FlaxBloomModel and FlaxBloomForCausalLM
  • Relation to TensorRT

Was this helpful?

  1. Bloom

Huggingface Bloom Documentation

The Huggingface Bloom documentation provides an overview of the BLOOM model and its various versions, along with instructions on how to use the model for different tasks using the Huggingface Transformers library.

Overview

  • BLOOM is a large language model proposed by the BigScience Workshop, inspired by open science initiatives.

  • The architecture is similar to GPT-3, an auto-regressive model for next token prediction.

  • BLOOM has been trained on 46 different languages and 13 programming languages.

  • It comes in several versions with different parameter counts: bloom-560m, bloom-1b1, bloom-1b7, bloom-3b, bloom-7b1, and bloom (176B parameters).

Resources

  • The documentation lists official Huggingface and community resources for getting started with BLOOM.

  • It includes links to example scripts and notebooks for various tasks like text generation, text classification, token classification, and question answering.

  • There are also blog posts on optimizing BLOOM inference, training, and using it with DeepSpeed and Accelerate.

BloomConfig

  • This is the configuration class for the BLOOM model, used to store hyperparameters and other settings.

  • It inherits from PretrainedConfig and allows controlling the model outputs.

  • The config class takes various arguments like vocab_size, hidden_size, n_layer, n_head, etc.

  • Instantiating the config with default values yields a configuration similar to the BLOOM architecture.

BloomTokenizerFast

  • This is a "fast" tokenizer for BLOOM, based on the Huggingface Tokenizers library and using byte-level Byte-Pair-Encoding (BPE).

  • The tokenizer treats spaces as part of tokens, so a word will be encoded differently depending on whether it's at the beginning of a sentence or not.

  • The tokenizer inherits from PreTrainedTokenizerFast, which provides various methods for encoding and decoding text.

BloomModel

  • This is the bare BLOOM model transformer that outputs raw hidden states without any specific head on top.

  • It inherits from PreTrainedModel and is also a PyTorch nn.Module subclass.

  • The forward() method takes input_ids, attention_mask, past_key_values, and other arguments, and returns a BaseModelOutputWithPastAndCrossAttentions object or a tuple.

  • The model can be used for tasks that don't require a specific head on top of the hidden states.

BloomForCausalLM

  • This is the BLOOM model with a language modeling head on top (linear layer with weights tied to the input embeddings).

  • It's used for causal language modeling tasks, like next token prediction.

  • The forward() method is similar to BloomModel but also takes labels as input and can return a CausalLMOutputWithCrossAttentions object.

BloomForSequenceClassification

  • This is the BLOOM model with a sequence classification head on top (linear layer).

  • It's used for sequence classification tasks, like sentiment analysis or text classification.

  • The forward() method takes input_ids, attention_mask, labels, and other arguments, and returns a SequenceClassifierOutputWithPast object.

  • The documentation provides examples of single-label and multi-label classification.

BloomForTokenClassification

  • This is the BLOOM model with a token classification head on top (linear layer).

  • It's used for token-level classification tasks, like named entity recognition (NER).

  • The forward() method is similar to BloomForSequenceClassification but returns a TokenClassifierOutput object.

BloomForQuestionAnswering

  • This is the BLOOM model with a span classification head on top for extractive question answering tasks like SQuAD.

  • It has a linear layer on top of the hidden states to compute start and end logits for the answer span.

  • The forward() method takes input_ids, attention_mask, start_positions, end_positions, and other arguments, and returns a QuestionAnsweringModelOutput object.

FlaxBloomModel and FlaxBloomForCausalLM

  • These are the Flax/JAX implementations of the BLOOM model and the causal language modeling variant.

  • They inherit from FlaxPreTrainedModel and are also Flax nn.Module subclasses.

  • The call() method is similar to the PyTorch versions but uses JAX arrays and supports JIT compilation, automatic differentiation, vectorization, and parallelization.

Relation to TensorRT

  • The BLOOM model and its variants can be optimized for inference using TensorRT, which is a library for high-performance deep learning inference on NVIDIA GPUs.

  • TensorRT can optimize the model by fusing layers, using reduced precision (e.g., FP16 or INT8), and taking advantage of tensor cores on modern GPUs.

  • The optimized model can be deployed in various environments, like embedded systems or servers, for efficient inference.

PreviousBloomNextRuntime

Last updated 1 year ago

Was this helpful?