LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Concept and Functionality
  • Structure of ONNX Models
  • Serialization and Portability
  • Additional Features
  • Supported Data Types
  • What is Serialization?
  • Working with Different Data Structures
  • Key Points

Was this helpful?

ONNX

As for ONNX, it stands for Open Neural Network Exchange.

It's an open-source project that provides a specification for the interoperability between machine learning models.

The project is backed by several major companies, including Microsoft, Facebook, and Amazon.

The goal of ONNX is to provide a model exchange format that enables AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.

Instead of being tied to a single framework or ecosystem, developers can choose the right tools for their project and not worry about compatibility.

For example, you might train a deep learning model using a framework like PyTorch, then use ONNX to convert the model to a format that can be run on a different system or framework, such as Microsoft's ONNX Runtime, for inference.

This could be particularly useful in situations where the model needs to be deployed in a different environment than it was trained in.In terms of its format, ONNX models are represented as graphs.

Each node in the graph corresponds to an operation, such as an addition, multiplication, or convolution, and the edges represent the tensors flowing between nodes.

It's worth noting that while ONNX aims to support a broad range of machine learning tasks, not all models can be represented in ONNX, and not all operations are supported by all ONNX runtimes. It's always important to check the current capabilities of the ONNX specification and the specific runtime you are using.

The ONNX (Open Neural Network Exchange) documentation provides a comprehensive overview of various aspects of ONNX. Here's a detailed summary:

Concept and Functionality

ONNX as a Specialised Language: ONNX is akin to a programming language focused on mathematical functions, particularly those necessary for machine learning model inference. It defines operations required to implement the inference function of a machine learning model.

Model Representation: Models in ONNX are often referred to as ONNX graphs. They can represent complex mathematical operations, like a linear regression, in a manner similar to Python coding but specifically using ONNX operators.

Structure of ONNX Models

Graphs and Nodes: An ONNX graph is built using ONNX Operators. It consists of nodes (operations like MatMul and Add) and connections representing data flow between nodes. Each node has a type (an operator) and inputs/outputs.

Inputs, Outputs, and Initializers: Models have inputs and outputs defined in a specific format. Constants or unchanging inputs can be encoded directly into the graph as 'initializers'.

Attributes: Fixed parameters of operators, like alpha or beta in the Gemm operator, which are unchangeable during runtime.

Serialization and Portability

Protobuf for Serialization: ONNX uses protobuf to serialize graphs into a single block, enhancing model portability and reducing size.

Additional Features

Metadata Storage: ONNX allows embedding metadata such as model version, author, training information, etc., directly into the model.

Operators and Domains: ONNX has a comprehensive list of operators covering standard matrix operations, image transformations, neural network layers, etc. It defines domains like ai.onnx and ai.onnx.ml, each containing a specific set of operators.

Supported Data Types

Primary Focus on Tensors: ONNX mainly supports numerical computations with tensors (multi-dimensional arrays). Tensors are characterized by type, shape, and a contiguous array of values.

Element Types: ONNX supports various data types, including different float and integer types. The list includes FLOAT, INT8, INT16, INT32, and others.

Sparse Tensors: ONNX also supports sparse tensors, primarily useful for arrays with many null coefficients.

Other Types: Besides tensors, ONNX handles sequences of tensors, maps of tensors, and sequences of maps of tensors. Serialization in ONNX can be explained in simpler terms as follows:

What is Serialization?

Serialization is the process of converting an ONNX machine learning model (or any other data structure in ONNX) into a format that can be easily saved, transferred, and later reloaded. This process turns the complex model into a simpler, more compact form that can be stored in a single file.

Saving a Model (Serialization)

  1. How It's Done: You take your ONNX model and convert it to a string of bytes (a simple format).

  2. Example: Let's say you have a model called onnx_model. You would use the command onnx_model.SerializeToString() to convert this model into a string format.

  3. Saving to a File: After converting the model into a string, you can then save this string to a file (like "model.onnx") on your computer.

Loading a Model (Deserialization)

  1. How It's Done: When you want to use the model again, you need to convert the string of bytes back into the original ONNX model format.

  2. Example: Using the command onnx.load("model.onnx"), you can turn the string in the file "model.onnx" back into an ONNX model that you can work with.

Working with Different Data Structures

  • NodeProto: This is just another type of data structure in ONNX, similar to a model but with different content. You can save and load it in the same way as you do with a model.

  • TensorProto: This is a specific type of data structure for storing tensor data. There's a special command onnx.load_tensor_from_string() to load tensor data from a string.

Key Points

  • SerializeToString: A method used to convert a model or data into a string of bytes for saving.

  • ParseFromString: A method used to convert the saved string of bytes back into the original data structure.

  • File Handling: Saving and loading involves reading from and writing to files, which requires handling these files correctly in your code.

In summary, serialization in ONNX is about converting complex data structures like models and tensors into a simpler, string-based format for easy storage and retrieval. This process is essential for saving models and later using them in different environments or applications.

PreviousPhi 2.0NextMessage Passing Interface (MPI)

Last updated 1 year ago

Was this helpful?

Page cover image