CUDA and bandwidth

The article titled "Making Deep Learning Go Brrrr From First Principles" delves into optimising deep learning systems, emphasising the importance of understanding system bottlenecks and applying appropriate solutions.

Three Components of Deep Learning Efficiency

The efficiency of deep learning models is broken down into three components:

Compute: Time spent on GPUs performing floating point operations (FLOPs).
Memory: Time involved in transferring tensors within a GPU.
Overhead: All other aspects of the process.

This article delves into the principles of optimising deep learning systems, focusing on three key components: compute, memory bandwidth, and overhead.

The author argues that understanding which of these components is the bottleneck in a given system is crucial for making informed decisions about optimisation strategies.

Compute

The article begins by discussing the importance of maximising the time spent in the compute-bound regime, as this is where the actual floating-point operations (FLOPS) are performed.

However, the growing gap between the rate at which compute power increases compared to memory bandwidth makes it increasingly difficult to fully utilise the available compute resources.

Memory Bandwidth

Next, the article explains the concept of memory bandwidth costs, which are incurred when moving data between different storage locations, such as from CPU to GPU or from CUDA global memory to CUDA shared memory.

The author uses an analogy of a factory (compute) and a warehouse (memory) to illustrate this concept.

Memory-bound operations, such as unary operations like torch.cos, spend most of their time moving data around rather than performing actual computations.

To mitigate memory bandwidth costs, the article introduces the concept of operator fusion, which involves performing several computations at once to reduce the number of global memory reads and writes.

This optimisation is particularly effective for pointwise operators, but it requires the GPU to know what will happen next, making it incompatible with eager-mode execution.

The article then discusses how to reason about memory-bandwidth costs using a calculator and provides an example of measuring the achieved FLOPS and memory bandwidth for a simple PyTorch function.

By increasing the compute intensity (the number of operations performed per memory access), the system transitions from being memory-bandwidth bound to compute-bound.

Overhead

Next, the article addresses the issue of overhead, which refers to time spent on tasks other than transferring tensors or performing computations.

The primary sources of overhead include the Python interpreter, the PyTorch framework, and launching CUDA kernels. The author emphasises the significant speed difference between modern GPUs and the Python interpreter, highlighting the importance of minimising overhead.

To determine if a system is overhead-bound, the article suggests increasing the size of the data and observing if the runtime increases proportionally. If it does not, the system is likely overhead-bound.

The PyTorch profiler can also be used to visualise the relationship between CPU and GPU kernels, helping to identify overhead issues.

The article concludes by summarising the appropriate optimisation strategies for each performance regime: tracing, operator fusion, or avoiding Python for overhead-bound systems; operator fusion for bandwidth-bound systems; and using tensor cores or upgrading hardware for compute-bound systems.

The author acknowledges that the need for users to consider these factors reflects a failure on the part of the framework and emphasizes the importance of understanding basic systems principles for effective optimisation.

Identifying Optimisation Areas

Understanding which of these three areas (compute, memory bandwidth, overhead) is the bottleneck allows for more targeted and effective optimizations.

Memory-Bandwidth Bound vs. Compute-Bound Regimes

The article explains that if a model is limited by memory transfer times (memory-bandwidth bound), then increasing GPU FLOPs will not help. Conversely, if the model is compute-bound, focusing on reducing overhead might not be beneficial.

Maximising Compute Utilisation

The focus is on maximising compute time because it's difficult to reduce computation requirements without changing the operations. The challenge lies in keeping up with the increasing efficiency of compute compared to the slower growth of memory bandwidth.

Specialisation of Modern ML Accelerators

Modern machine learning accelerators are specialised for matrix multiplication.

Non-matrix operations, despite being significantly slower on these accelerators, constitute a very small percentage of the total operations and thus don't significantly impact overall performance.

Memory Bandwidth as a Bottleneck:

The article suggests that the time taken for memory transfers (memory bandwidth) is often the limiting factor in model performance, more so than the actual computation time.

One way to think about compute is as a factory. We send instructions to our factory (overhead), send it materials (memory-bandwidth), all to keep our factory running efficiently (compute).

Memory constrains compute

Bandwidth Costs

Definition

Bandwidth costs refer to the expenses incurred while moving data from one place to another in a computing system.

Examples:

From CPU to GPU.
Between nodes in a network
Within GPU, from CUDA global memory to CUDA shared memory

To understand what the memory bandwidth cost is, let's head back to our factory analogy.

Although our factory is where we do the actual work, it's not suitable as a bulk storage unit. A large part of this is that since we're doing actual work here, all the storage is optimised for being fast to actually use (SRAM), instead of having a lot of it.

So, where do we store the actual results and materials? The typical approach is to have a warehouse, probably somewhere where land is cheap and we have a lot of space (DRAM).

Then, we can ship supplies to and from our factories (memory bandwidth).

Memory Bandwidth Cost

The article uses a factory analogy to explain memory bandwidth costs. The factory represents the GPU's compute units, while the warehouse symbolises the GPU's DRAM.

The cost of transporting materials (data) between the factory and warehouse represents the memory bandwidth cost.

DRAM and SRAM

DRAM is likened to a warehouse where bulk storage happens, optimised for space rather than speed.
SRAM is compared to factory storage, optimised for speed and located where work (computation) happens.

Data Movement

Each GPU kernel execution requires moving data from GPU DRAM (warehouse) to the compute units (factory) and back.

Hey! This is a very stupid arrangement. Why are we sending the same data to global memory and then back to the compute units, over and over? We should just keep the data at the factory, perform all of our compute, and then send it back!

Operator Fusion

Operator fusion is a key optimisation in deep learning compilers, combining multiple operations into one to reduce memory access.

Efficiency

By performing several computations in one go, it reduces the need to write/read data to/from global memory multiple times.

Example:

Without Fusion: x.cos().cos() would require four global memory reads/writes.
With Fusion: The same operation requires only two global memory reads/writes.

Challenges in Operator Fusion

Implementation: Operator fusion is tricky as it requires predicting future operations and generating custom CUDA code.
Complexity: Not all operator fusions are straightforward, especially when combining different types of operations like reductions and matrix multiplications.

Memory-Bandwidth Analysis

Practical Analysis: For simple operations, it’s feasible to directly calculate the memory bandwidth.
GPU Capability: An example is given where an A100 GPU's memory bandwidth and compute capacity are compared to demonstrate when an operation becomes memory-bandwidth bound.

Conclusion and Solutions

Identifying Bottlenecks: Understanding the bottleneck in a deep learning system is crucial for applying the right solution.