Reducing Activation Recomputation in Large Transformer Models
Last updated
Last updated
This May 2022 paper addresses the challenges of training large transformer models with trillions of parameters, focusing on reducing the memory requirements for storing activations.
As transformer models scale up, model parallelism becomes necessary to distribute model parameters, activations, and optimizer state across devices.
However, tensor-level model parallelism has limitations in terms of communication requirements and performance, while pipeline parallelism cannot reduce activation memory while maintaining high device utilization.
The standard approach to alleviate memory pressure is to use activation recomputation (also known as gradient checkpointing), where activations are not stored but recomputed during the backward pass.
However, this method incurs a significant penalty in training efficiency, with the authors observing a 30-40% execution time overhead when full activation recomputation is used.
To address this issue, the authors present two novel techniques: sequence parallelism and selective activation recomputation. These techniques, when used in conjunction with tensor parallelism, can significantly reduce the need for activation recomputation.
Sequence parallelism is introduced alongside tensor parallelism to prevent redundant storage of activations in regions that are not conducive to standard tensor parallelism.
Unlike the approach in [9], which requires parameters and optimizer state to be replicated on all devices, the authors' technique mixes tensor and sequence parallelism without additional compute, communication, or memory overhead.
By selectively choosing which activations to save and which to recompute, the authors show that much of the cost of recomputation can be eliminated while using only a fraction of the memory compared to when no recomputation is used.
The authors evaluate their approach on language models up to one trillion parameters in scale, demonstrating that their method reduces activation memory by 5× while reducing the execution time overhead from activation recomputation by over 90%.
For example, when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, they achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% achieved using full activation recomputation.
The paper also discusses related work in model parallelism, including tensor parallelism, pipeline parallelism, and alternative approaches based on data parallelism.
The authors highlight the limitations of these existing techniques and emphasise the advantages of their proposed method, which combines the benefits of tensor and sequence parallelism without the drawbacks of previous approaches.
In summary, this paper presents a significant contribution to the field of large-scale transformer model training by introducing novel techniques that drastically reduce the memory requirements for storing activations and minimize the need for activation recomputation.
The proposed method, which combines sequence parallelism and selective activation recomputation with tensor parallelism, enables more efficient training of trillion-parameter models without incurring additional compute, communication, or memory overhead.
The authors' implementation will be made available in both Megatron-LM and NeMo-Megatron, making it accessible to the wider research community.
In this section, the authors discuss the transformer architecture and derive an approximate formula for the memory required to store activations during the forward pass.
They then explore how different forms of model parallelism impact activation memory requirements and introduce a novel technique called sequence parallelism.
The authors consider a single stack transformer encoder or decoder with L layers.
Input tokens are fed into a word embedding table (size v × h) and combined with learned positional embeddings (size s × h), where s is the sequence length, h is the hidden dimension, and v is the vocabulary size.
The output of the embedding layer is a 3-D tensor of size s × b × h, where b is the micro batch size.
Each transformer layer consists of a self-attention block with a attention heads, followed by an MLP with two layers that increase the hidden size to 4h and then reduce it back to h.
The output from the last transformer layer is projected back into the vocabulary dimension to calculate the cross-entropy loss.
The authors derive an approximate formula for the memory required to store activations in the forward pass, excluding the main model parameters and optimizer state.
They consider the main contributors to memory and ignore small buffers.
Activations are assumed to be stored in 16-bit floating-point format (2 bytes per element), except for dropout masks (1 byte per element).
The memory required to store activations for a single transformer layer is derived as: sbh(34 + 5as/h), where a is the number of attention heads, s is the sequence length, b is the microbatch size, and h is the hidden dimension.
The authors use tensor parallelism developed by Shoeybi et al. to parallelize the attention and MLP blocks.
Tensor parallelism parallelizes model parameters, optimizer states, and activations inside the attention and MLP blocks, but not the input activations to these blocks.
With t-way tensor parallelism, the per-layer activation memory reduces to: sbh(10 + 24/t + 5as/ht).
The authors introduce sequence parallelism to partition the non-tensor parallel regions (layer-norms and dropouts) along the sequence dimension s.
Sequence parallelism reduces the memory required for activations in these regions without introducing additional communication overhead.
New operations g and ¯g are introduced as converters between sequence and tensor parallel regions, replacing the f and ¯f operators from tensor parallelism.
The derivation of g and ¯g is detailed for the MLP block, showing that g is an all-gather in the forward pass and a reduce-scatter in the backward pass, while ¯g is the conjugate.
With tensor and sequence parallelism combined, the per-layer activation memory reduces to: sbh(34/t + 5as/ht), effectively distributing activations among the tensor parallel group.
Pipeline parallelism divides the L layers of the transformer into L/p groups, where p is the pipeline parallel size.
However, pipeline parallelism does not uniformly divide the total activation memory by p due to overlapping introduced to reduce the pipeline bubble.
The first stage of the pipeline must store activations for p microbatches, resulting in a total activation memory of: sbhL(34/t + 5as/ht).
The authors note that the majority of the required activation memory is captured by the equation for the first stage of the pipeline.
Additional activation memory is required for input embeddings, the last layer-norm, and the output layer, but these terms are negligible compared to the main equation.
As a result, the equation sbhL(34/t + 5as/ht) is a good approximation of the total required activation memory.
In summary, this section provides a detailed analysis of the activation memory requirements in transformer models and introduces sequence parallelism as a novel technique to reduce memory consumption when combined with tensor parallelism. The authors derive approximate formulas for activation memory under different parallelism schemes, which serve as the foundation for their proposed optimizations.
NVIDIA A100 Tensor Core GPU: This reference directs to NVIDIA's official page for the A100 Tensor Core GPU, a powerful component designed for data centers that accelerates AI, data analytics, and high-performance computing. It provides specifications and benefits, emphasizing its capabilities in handling massive data workloads and AI tasks. NVIDIA A100
NVLink and nVSwitch: This is about NVIDIA's NVLink and nVSwitch technologies, which are crucial for creating high-bandwidth links between GPUs and CPUs. These technologies are vital for scaling up the performance of servers and high-performance computing systems by enabling faster data transfer rates between the components. NVLink and nVSwitch
Selene: This entry pertains to the Selene supercomputer, listed on the TOP500 website. Selene is an NVIDIA-designed system that ranks among the most powerful supercomputers in the world, utilized for a variety of computational tasks including large-scale simulations and AI processing. Selene at TOP500
Language Models are Few-Shot Learners by Tom B. Brown et al.: This seminal paper discusses advancements in language models, particularly focusing on their ability to perform tasks from limited examples — a process known as few-shot learning. The study is notable for exploring the capabilities of GPT-3 and its efficiency in learning and generalization from minimal data. arXiv link for Language Models are Few-Shot Learners
Training Deep Nets with Sublinear Memory Cost by Tianqi Chen et al.: This research explores techniques to reduce the memory cost of training deep neural networks. It's crucial for handling large models and datasets efficiently, proposing methods that allow deeper and more complex networks to be trained on limited hardware resources. arXiv link for Training Deep Nets with Sublinear Memory Cost
PALM by Aakanksha Chowdhery et al.: The PALM model represents a significant step in scaling language modeling, providing insights into training methodologies and system designs to handle extremely large models, focusing on system optimizations and efficiencies. Google AI Blog on Pathways Language Model (PaLM)
GPipe by Yanping Huang et al.: GPipe is a library for efficiently training large deep neural networks. It uses pipeline parallelism to split a model across different GPUs, significantly speeding up the training process by improving hardware utilization and training large models effectively. GPipe at NeurIPS 2019
Amazon SageMaker Model Parallelism by Can Karakus et al.: This reference discusses the model parallelism capabilities of Amazon SageMaker, detailing how it supports the training of large models by distributing their components across multiple GPUs and machines. Amazon SageMaker Model Parallelism
Sequence Parallelism by Shenggui Li et al.: Focuses on optimizing training processes from a system's perspective, particularly looking at long sequence training and the efficiency gains from parallel processing techniques. Sequence Parallelism
TeraPipe by Zhuohan Li et al.: Introduces TeraPipe, a token-level pipeline parallelism method for training large-scale language models, emphasizing its scalability and efficiency. TeraPipe at ICML 2021