Multi-GPU Distributed Training
In the previous post, we used a single GPU to train a small, 117 million parameter model (GPT-2-small) with a batch size of 32 and a subset of optimizations - 1. Asynchronous I/O. 2. Four gradient accumulation steps. 3. And torch.compile with default parameters. In this post, we