LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Attention Layer
  • Linear Layer
  • MLP Layer
  • Normalization Layers
  • Embedding Layer
  • Determining Parameters

Was this helpful?

  1. The Python API

Layers

The TensorRT-LLM Python API provides a set of layers that can be used to build and customize Transformer-based models.

Attention Layer

The Attention layer is a crucial component in Transformer models, responsible for capturing dependencies between input tokens.

Here are a few ways you can use the Attention layer:

  • Adjust the hidden_size and num_attention_heads parameters to control the capacity and parallelism of the attention mechanism. Increasing the hidden_size allows the model to learn more complex representations, while increasing num_attention_heads enables the model to attend to different aspects of the input simultaneously.

  • Experiment with different attention_mask_type values to control the attention pattern. For example, using AttentionMaskType.causal enables causal attention, which is commonly used in autoregressive language models like GPT.

  • Set the cross_attention parameter to True to perform cross-attention between the input and an additional encoder output. This is useful for tasks like sequence-to-sequence modelling or when incorporating external context.

  • Utilize the relative_attention parameter to enable relative position embeddings, which can improve the model's understanding of positional relationships between tokens.

Linear Layer

The Linear layer is a fundamental building block for performing linear transformations. Here are some interesting ways to use the Linear layer:

  • Adjust the in_features and out_features parameters to control the input and output dimensions of the linear transformation. This allows you to reshape the hidden states and adapt them to the desired size.

  • Experiment with different dtype values to control the numerical precision of the linear operation. Using lower precision data types like float16 can improve memory efficiency and computational speed, while sacrificing some numerical precision.

  • Use the tp_group and tp_size parameters to enable tensor parallelism, which can distribute the linear operation across multiple devices for improved performance and memory efficiency.

MLP Layer

The MLP (Multi-Layer Perceptron) layer is commonly used as the feed-forward network in Transformer models.

Here are some ideas for using the MLP layer:

  • Adjust the hidden_size and ffn_hidden_size parameters to control the capacity and expressiveness of the MLP. Increasing the ffn_hidden_size allows the model to learn more complex non-linear transformations.

  • Experiment with different activation functions by setting the hidden_act parameter. Popular choices include relu, gelu, and silu, each with its own characteristics and impact on the model's learning dynamics.

  • Consider using the FusedGatedMLP variant, which combines the gating mechanism and activation function into a single operation for improved computational efficiency.

Normalization Layers

Normalization layers help stabilise the training process and improve the model's convergence. The TensorRT-LLM API provides several normalization layers, such as LayerNorm, GroupNorm, and RmsNorm.

Here are some ideas for using these layers:

  • Experiment with different normalization techniques to see which one works best for your specific task and model architecture. LayerNorm is commonly used in Transformer models, while GroupNorm can be effective when dealing with smaller batch sizes.

  • Adjust the eps parameter to control the numerical stability of the normalization operation, especially when dealing with very small or very large values.

Embedding Layer

The Embedding layer is used to map discrete input tokens to dense vector representations.

Here are some ways to utilize the Embedding layer:

  • Adjust the num_embeddings and embedding_dim parameters to control the size of the embedding table and the dimensionality of the embeddings. Increasing the embedding_dim allows the model to learn richer representations of the input tokens.

  • Experiment with different dtype values to control the numerical precision of the embeddings, balancing memory efficiency and representation quality.

  • Consider using the PromptTuningEmbedding variant for prompt-tuning scenarios, where additional task-specific embeddings are incorporated into the model.

These are just a few examples of how you can use the TensorRT-LLM Python API layers to influence the model architecture and shape the computation.

The flexibility and modularity of the API allow you to experiment with different configurations and create custom Transformer-based models tailored to your specific needs.

Remember to consider the trade-offs between model capacity, computational efficiency, and memory consumption when adjusting the layer parameters.

It's also essential to validate the impact of your modifications on the model's performance and ensure that the chosen configurations align with your task requirements and available resources.

Determining Parameters

  • Understand the Task and Data: The choice of layers and their parameters should be driven by the specific characteristics of your data and the task at hand.

  • Experimentation: Often, finding the right configuration involves empirical testing. Use validation datasets to gauge the performance of different configurations.

  • Resource Constraints: Be mindful of the computational cost. More complex models require more memory and processing power.

  • Model Complexity and Overfitting: More parameters can lead to a more powerful model, but also increase the risk of overfitting. Balancing model complexity with the amount of available training data is crucial.

  • Research and Literature: Look at existing literature and research papers. Often, you can find insights and recommended configurations for similar tasks and data types.

  • Software and Hardware Compatibility: Ensure that your chosen layers and parameters are compatible with the hardware you plan to use, especially when leveraging specialised hardware like NVIDIA GPUs.

PreviousThe Python APINextFunctionals

Last updated 1 year ago

Was this helpful?

Page cover image