LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. Transformer Architecture

Activation Functions

Activation Functions in Deep Learning: A Comprehensive Survey and Benchmark

PreviousLayer NormalisationNextResidual Connections

Last updated 1 year ago

Was this helpful?

Activation functions (AFs) are functions that we apply in neural networks after (typically) applying an affine transformation combining weights and input features.

They are typically non-linear functions.

The rectified linear unit, or ReLU, has been the most popular in the past decade, although the choice is architecture dependent, and many alternatives have emerged in recent years. In this section, you will find a constantly updating list of activation functions.

The June 2022 paper begins by highlighting the importance of AFs in introducing non-linearity into neural networks, which is crucial for learning complex patterns and representations from data.

The authors then provide a detailed classification of AFs based on their characteristics and types, including Logistic Sigmoid/Tanh, Rectified Unit, Exponential Unit, Adaptive Unit, and Miscellaneous AFs.

One of the strengths of this survey is its in-depth coverage of each class of AFs.

The authors provide a thorough analysis of the properties and limitations of each AF, along with a discussion of their variants and improvements proposed in the literature. This information is particularly useful for practitioners looking to select an appropriate AF for their specific task and data type.

The paper also presents a clear and concise summary of the advantages and disadvantages of primary AFs, such as Logistic Sigmoid, Tanh, ReLU, and ELU, in terms of key factors like diminishing gradients, limited non-linearity, optimization difficulty, and computational efficiency.

Another valuable contribution of this survey is the performance comparison conducted on benchmark datasets of different modalities using 18 state-of-the-art AFs with various types of networks.

This empirical evaluation provides practical insights into the performance of different AFs in real-world scenarios, which can guide researchers and practitioners in their choice of AFs for specific applications.

The authors also compare their survey with existing surveys and performance analyses, highlighting the comprehensive nature of their work and its importance in the current landscape of deep learning research.

In summary, this comprehensive survey on activation functions (AFs) in deep learning provides a valuable resource for researchers and practitioners. The authors have thoroughly classified and analyzed a wide range of AFs, including Logistic Sigmoid and Tanh based, ReLU based, ELU based, and learning based adaptive AFs. The survey not only covers the theoretical aspects of AFs but also presents an extensive performance comparison on different types of data, such as image, text, and speech, using various state-of-the-art AFs and network architectures.

The authors highlight the strengths and limitations of each class of AFs, providing insights into their properties, such as output range, monotonicity, and smoothness. They also discuss the impact of weight initialization on the performance of AFs and the suitability of different AFs for various types of data and network architectures.

Key Contributions

One of the key contributions of this paper is the experimental performance analysis, which compares 18 state-of-the-art AFs on benchmark datasets using different CNN models.

The results provide valuable guidance for practitioners in selecting appropriate AFs for their specific tasks and data types. For instance, the authors find that Softplus, ELU, and CELU perform well with MobileNet, while ReLU, Mish, and PDELU exhibit good performance with VGG16, GoogleNet, and DenseNet.

The convergence analysis of different AFs reveals that parametric AFs, such as PAU, PReLU, and PDELU, show better convergence as they can adapt to the data faster by learning parameters from the data. The authors also highlight the trade-off between accuracy and training time for different AFs, with ReLU, SELU, GELU, and Softplus striking a good balance.

The survey also provides recommendations for selecting AFs based on the insights gained from the analysis.

The authors emphasize the importance of matching the complexity of the AF with the complexity of the model and dataset to avoid overfitting or under-convergence. They suggest avoiding Logistic Sigmoid and Tanh AFs for CNNs due to poor convergence and recommend exploring recently proposed AFs such as Swish, Mish, and PAU for different problems.

In conclusion, this survey offers a thorough and systematic review of activation functions in deep learning, covering a wide range of theoretical and practical aspects. The authors have successfully organized and presented the vast literature on AFs, providing valuable insights and recommendations for the deep learning community. This paper will serve as a valuable reference for researchers and practitioners working on developing and applying deep learning models for various tasks and data types.

Activation Functions in Deep Learning: A Comprehensive Survey and BenchmarkarXiv.org
Summary of the advantages and disadvantages of primary AFs
Classification of Activation Functions
Page cover image
Logo