Adaptive Mixed-Precision Training via Gradient Noise Estimation
arXiv preprintPreprint2024
N. Meters, L. Vasquez, A. Chen
We present an adaptive algorithm that dynamically selects per-layer numerical precision during training by estimating the signal-to-noise ratio of gradient updates. Layers with high gradient noise tolerance are cast to FP8, while sensitive layers retain BF16, with transitions governed by an exponential moving average of gradient variance. Applied to GPT-scale models, the method reduces training FLOPS by 28% with no measurable degradation in validation loss.
mixed-precisiontraining-efficiencynumerical-methods