Reducing Activation Recomputation in Large Transformer Models

Last updated