LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • High-Level Abstraction
  • Efficient Tensor Processing
  • Flexibility and Customization
  • Simplified Model Development
  • Consistency with Established Frameworks
  • Support for Advanced Features

Was this helpful?

  1. The Python API

Functionals

The functionals in TensorRT-LLM, such as slice, softmax, softplus, split, sqrt, and others, are designed to offer high-level, efficient, and specialised operations for processing tensors within the TensorRT Large Language Model (LLM) framework.

These functionals encapsulate complex tensor operations, making them accessible through a simplified and standardised interface for model developers.

High-Level Abstraction

Ease of Use: Functionals abstract away the low-level details of tensor operations, allowing developers to focus on higher-level model architecture without delving into the intricacies of each operation.

Standardised Operations: They provide a set of commonly used operations in deep learning models, ensuring consistency and predictability across different implementations.

Efficient Tensor Processing

Optimisation: Each functional is optimised for performance on NVIDIA GPUs, ensuring efficient execution of tensor operations critical for Large Language Models.

Hardware Acceleration: Leveraging TensorRT optimisations, these functionals are designed to maximise the computational capabilities of the underlying hardware, particularly for high-throughput and low-latency inference.

Flexibility and Customization

  • Configurable Parameters: Functionals come with various parameters that can be tuned according to the specific needs of the model, offering flexibility in how operations are applied to tensors.

  • Adaptability: They can be easily integrated into different parts of neural network architectures, catering to a wide range of applications from simple feed-forward networks to complex transformer models.

Simplified Model Development

Rapid Prototyping: By using these high-level operations, developers can quickly prototype and experiment with different model architectures.

Readability and Maintenance: The use of functionals leads to more readable and maintainable code, as complex tensor operations are encapsulated in simple, descriptive function calls.

Consistency with Established Frameworks

Familiarity: Many of these functionals mirror operations found in popular deep learning frameworks like PyTorch and TensorFlow, making it easier for developers to transition or integrate models with TensorRT-LLM.

Support for Advanced Features

Advanced Tensor Operations

Beyond basic operations, functionals like gpt_attention provide advanced capabilities specifically tailored for state-of-the-art language models, enabling cutting-edge performance and features.

In summary, the functionals in TensorRT-LLM are a collection of high-level, optimised, and flexible operations that simplify and accelerate the development of large language models on NVIDIA GPUs.

They are instrumental in transforming complex tensor manipulations into accessible, efficient, and standardised building blocks for model development.

PreviousLayersNextfunctional.py

Last updated 1 year ago

Was this helpful?

Page cover image