Page cover image

Layers

The TensorRT-LLM Python API provides a set of layers that can be used to build and customize Transformer-based models.

Attention Layer

The Attention layer is a crucial component in Transformer models, responsible for capturing dependencies between input tokens.

Here are a few ways you can use the Attention layer:

  • Adjust the hidden_size and num_attention_heads parameters to control the capacity and parallelism of the attention mechanism. Increasing the hidden_size allows the model to learn more complex representations, while increasing num_attention_heads enables the model to attend to different aspects of the input simultaneously.

  • Experiment with different attention_mask_type values to control the attention pattern. For example, using AttentionMaskType.causal enables causal attention, which is commonly used in autoregressive language models like GPT.

  • Set the cross_attention parameter to True to perform cross-attention between the input and an additional encoder output. This is useful for tasks like sequence-to-sequence modelling or when incorporating external context.

  • Utilize the relative_attention parameter to enable relative position embeddings, which can improve the model's understanding of positional relationships between tokens.

Linear Layer

The Linear layer is a fundamental building block for performing linear transformations. Here are some interesting ways to use the Linear layer:

  • Adjust the in_features and out_features parameters to control the input and output dimensions of the linear transformation. This allows you to reshape the hidden states and adapt them to the desired size.

  • Experiment with different dtype values to control the numerical precision of the linear operation. Using lower precision data types like float16 can improve memory efficiency and computational speed, while sacrificing some numerical precision.

  • Use the tp_group and tp_size parameters to enable tensor parallelism, which can distribute the linear operation across multiple devices for improved performance and memory efficiency.

MLP Layer

The MLP (Multi-Layer Perceptron) layer is commonly used as the feed-forward network in Transformer models.

Here are some ideas for using the MLP layer:

  • Adjust the hidden_size and ffn_hidden_size parameters to control the capacity and expressiveness of the MLP. Increasing the ffn_hidden_size allows the model to learn more complex non-linear transformations.

  • Experiment with different activation functions by setting the hidden_act parameter. Popular choices include relu, gelu, and silu, each with its own characteristics and impact on the model's learning dynamics.

  • Consider using the FusedGatedMLP variant, which combines the gating mechanism and activation function into a single operation for improved computational efficiency.

Normalization Layers

Normalization layers help stabilise the training process and improve the model's convergence. The TensorRT-LLM API provides several normalization layers, such as LayerNorm, GroupNorm, and RmsNorm.

Here are some ideas for using these layers:

  • Experiment with different normalization techniques to see which one works best for your specific task and model architecture. LayerNorm is commonly used in Transformer models, while GroupNorm can be effective when dealing with smaller batch sizes.

  • Adjust the eps parameter to control the numerical stability of the normalization operation, especially when dealing with very small or very large values.

Embedding Layer

The Embedding layer is used to map discrete input tokens to dense vector representations.

Here are some ways to utilize the Embedding layer:

  • Adjust the num_embeddings and embedding_dim parameters to control the size of the embedding table and the dimensionality of the embeddings. Increasing the embedding_dim allows the model to learn richer representations of the input tokens.

  • Experiment with different dtype values to control the numerical precision of the embeddings, balancing memory efficiency and representation quality.

  • Consider using the PromptTuningEmbedding variant for prompt-tuning scenarios, where additional task-specific embeddings are incorporated into the model.

These are just a few examples of how you can use the TensorRT-LLM Python API layers to influence the model architecture and shape the computation.

The flexibility and modularity of the API allow you to experiment with different configurations and create custom Transformer-based models tailored to your specific needs.

Remember to consider the trade-offs between model capacity, computational efficiency, and memory consumption when adjusting the layer parameters.

It's also essential to validate the impact of your modifications on the model's performance and ensure that the chosen configurations align with your task requirements and available resources.

Determining Parameters

  • Understand the Task and Data: The choice of layers and their parameters should be driven by the specific characteristics of your data and the task at hand.

  • Experimentation: Often, finding the right configuration involves empirical testing. Use validation datasets to gauge the performance of different configurations.

  • Resource Constraints: Be mindful of the computational cost. More complex models require more memory and processing power.

  • Model Complexity and Overfitting: More parameters can lead to a more powerful model, but also increase the risk of overfitting. Balancing model complexity with the amount of available training data is crucial.

  • Research and Literature: Look at existing literature and research papers. Often, you can find insights and recommended configurations for similar tasks and data types.

  • Software and Hardware Compatibility: Ensure that your chosen layers and parameters are compatible with the hardware you plan to use, especially when leveraging specialised hardware like NVIDIA GPUs.

Last updated

Was this helpful?