Scaling with Gradient Grouping. Illustration of SGG with online grouping and group-specific learning rate (LR) scaling upon adaptive LR optimizers.
Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation.
The key steps of SGG are as follows:
To evaluate the effectiveness and versatility of SGG, we conducted extensive experiments on various benchmarks and tasks, covering large language models (LLMs) and multimodal large language models (MLLMs). The experimental results demonstrate that SGG consistently improves performance, accelerates convergence, and exhibits robustness across different scenarios.
Method | Venue | 60M | 130M | 350M | 1B |
---|---|---|---|---|---|
Adam† | ICLR’15 | 34.06 | 25.08 | 18.80 | 15.56 |
NAdam | ICLR’18 | 35.86 | 28.88 | 19.24 | 15.78 |
RAdam | ICLR’20 | 30.43 | 25.17 | 19.13 | 15.65 |
LAMB | ICLR’20 | 33.04 | 24.37 | 18.26 | 15.84 |
Adan | TPAMI’23 | 32.01 | 23.14 | 17.32 | 14.70 |
Adam+SGG | Ours | 30.31 | 22.18 | 17.28 | 14.30 |
∆ Gains | -3.75 | -2.90 | -1.52 | -1.26 | |
Adam-mini† | ICLR’25 | 34.10 | 24.85 | 19.05 | 16.07 |
Adafactor† | ICML’18 | 32.57 | 23.98 | 17.74 | 15.19 |
Low-Rank† | arXiv’22 | 78.18 | 45.51 | 37.41 | 34.53 |
CAME | ACL’23 | 31.37 | 23.38 | 17.45 | 14.68 |
CAME+SGG | Ours | 30.15 | 22.91 | 17.09 | 14.35 |
∆ Gains | -1.22 | -0.46 | -0.36 | -0.33 | |
APOLLO† | MLSys’25 | 31.55 | 22.94 | 16.85 | 14.20 |
APOLLO+SGG | Ours | 30.18 | 22.52 | 16.54 | 13.95 |
∆ Gains | -1.37 | -0.42 | -0.31 | -0.25 | |
LoRA† | ICLR’22 | 34.99 | 33.92 | 25.58 | 19.21 |
ReLoRA† | ICLR’23 | 37.04 | 29.37 | 29.08 | 18.33 |
GaLore† | ICML’24 | 34.88 | 25.36 | 18.95 | 15.64 |
LoRA+SGG | Ours | 30.62 | 23.62 | 17.86 | 14.73 |
∆ Gains | -4.37 | -10.30 | -7.72 | -4.48 |
Optimizer | Rank | CoLA | STS-B | MRPC | RTE | SST2 | MNLI | QNLI | QQP | Average |
---|---|---|---|---|---|---|---|---|---|---|
SGD | Full | 62.12 | 90.73 | 87.74 | 79.06 | 94.26 | 87.53 | 92.29 | 92.22 | 85.74 |
AdamW | Full | 62.24 | 90.92 | 91.30 | 79.42 | 94.57 | 87.18 | 92.33 | 92.28 | 86.24 |
LAMB | Full | 62.09 | 90.59 | 88.72 | 75.45 | 94.72 | 87.71 | 92.42 | 91.46 | 85.40 |
CAME | Full | 62.16 | 90.43 | 89.02 | 75.94 | 94.61 | 87.13 | 92.31 | 91.54 | 85.39 |
APOLLO | Full | 62.45 | 90.70 | 90.36 | 77.53 | 94.58 | 87.57 | 92.40 | 92.12 | 85.96 |
AdamW+SGG | Full | 63.36 | 91.22 | 92.65 | 80.87 | 95.58 | 88.32 | 92.88 | 93.32 | 87.28 |
LAMB+SGG | Full | 62.47 | 90.90 | 89.46 | 76.53 | 94.95 | 87.81 | 92.89 | 91.78 | 85.85 |
SGD (LoRA) | 4 | 60.32 | 90.31 | 87.75 | 79.06 | 94.27 | 87.39 | 92.16 | 91.89 | 85.39 |
AdamW (LoRA) | 4 | 61.38 | 90.57 | 91.07 | 78.70 | 92.89 | 86.82 | 92.18 | 91.29 | 85.61 |
LAMB (LoRA) | 4 | 61.51 | 90.33 | 89.46 | 74.73 | 94.27 | 87.51 | 92.48 | 91.57 | 85.23 |
DoRA | 4 | 60.38 | 90.50 | 88.24 | 74.73 | 93.69 | 92.59 | 92.68 | 92.68 | 85.89 |
GaLore (LoRA) | 4 | 60.35 | 90.73 | 92.25 | 79.42 | 94.04 | 87.00 | 92.24 | 91.06 | 85.89 |
AdamW+SGG | 4 | 62.36 | 91.10 | 92.12 | 80.51 | 95.06 | 88.18 | 92.62 | 93.06 | 86.88 |
LAMB+SGG | 4 | 62.47 | 90.90 | 89.46 | 75.53 | 94.95 | 87.73 | 92.92 | 91.78 | 85.72 |
SGD (LoRA) | 8 | 60.57 | 90.29 | 88.48 | 79.42 | 94.32 | 87.44 | 92.23 | 92.10 | 85.61 |
AdamW (LoRA) | 8 | 61.83 | 90.80 | 91.90 | 79.06 | 93.46 | 86.94 | 92.25 | 91.22 | 85.93 |
LAMB (LoRA) | 8 | 61.89 | 90.78 | 89.21 | 79.42 | 94.61 | 87.61 | 92.51 | 91.42 | 85.35 |
DoRA | 8 | 58.36 | 90.63 | 88.97 | 75.09 | 93.81 | 92.68 | 92.68 | 92.68 | 85.94 |
GaLore (LoRA) | 8 | 60.06 | 90.82 | 92.01 | 79.78 | 94.38 | 87.17 | 92.20 | 91.11 | 85.94 |
AdamW+SGG | 8 | 62.36 | 91.10 | 92.12 | 80.51 | 95.06 | 88.17 | 92.65 | 92.85 | 86.85 |
LAMB+SGG | 8 | 62.47 | 90.90 | 89.46 | 76.53 | 94.95 | 87.85 | 92.87 | 91.78 | 85.85 |
Method | BoolQ | PIQA | SIQA | WG | Arc-E | OBQA | Avg. |
---|---|---|---|---|---|---|---|
Parallel | 67.9 | 76.4 | 78.8 | 78.9 | 73.7 | 75.2 | 72.2 |
LoRA | 68.9 | 80.7 | 77.4 | 78.8 | 77.8 | 74.8 | 74.7 |
DoRA | 69.7 | 83.4 | 78.6 | 81.0 | 81.9 | 79.2 | 78.4 |
GaLore | 69.5 | 82.0 | 75.1 | 18.0 | 80.7 | 78.0 | 62.7 |
Fira | 69.4 | 82.6 | 78.0 | 81.2 | 82.2 | 80.8 | 76.9 |
LoRA+SGG | 70.3 | 83.6 | 78.8 | 80.9 | 81.5 | 79.0 | 77.6 |
∆ Gains | +1.4 | +2.9 | +1.4 | +2.1 | +3.7 | +4.2 | +2.9 |
DoRA+SGG | 71.4 | 84.8 | 79.5 | 82.8 | 83.8 | 81.2 | 79.6 |
∆ Gains | +1.7 | +1.4 | +0.9 | +1.8 | +1.9 | +2.0 | +1.2 |
Optimizer | Image Question Answering Benchmarks | Avg. | GQA | VizWiz | SciVQAI | VQAT | MMB | MMBCN | POPE |
---|---|---|---|---|---|---|---|---|---|
BLIP-2 | 41.0 | 19.6 | 61.0 | 42.5 | - | - | 85.3 | - | |
InstructBLIP | 49.2 | 34.5 | 60.5 | 50.1 | 36.0 | 23.7 | 79.8 | 47.7 | |
Qwen-VL | 59.3 | 35.2 | 67.1 | 63.8 | 38.2 | - | - | - | |
TinyLLaVA | 62.0 | - | 69.1 | 59.1 | 66.9 | - | 86.4 | - | |
MoE-LLaVA | 62.6 | - | 70.3 | 57.0 | 68.0 | - | 85.7 | - | |
LLaVA-Phi | - | - | 68.4 | 48.6 | 59.8 | - | 85.0 | - | |
LLaVA-NeXT | 64.2 | 57.6 | 70.1 | 64.9 | 67.4 | 60.6 | 86.5 | 67.3 | |
LLaVA-MOD | 58.7 | 39.2 | 68.0 | 58.5 | 66.3 | 61.9 | 87.0 | 62.8 | |
LLaVA-KD-2B | 62.3 | 44.7 | 64.7 | 53.4 | 64.0 | 63.7 | 86.3 | 62.7 | |
LLaVA-v1.5 | 62.0 | 50.0 | 66.8 | 58.2 | 64.3 | 58.3 | 85.9 | 63.6 | |
AdamW+SGG | 62.4 | 50.2 | 69.8 | 57.4 | 65.9 | 60.1 | 86.3 | 64.6 | |
∆ Gains | +0.4 | +0.2 | +3.0 | -0.8 | +1.6 | +1.8 | +0.4 | +1.0 | |
Adafactor+SGG | 62.8 | 50.6 | 71.6 | 57.3 | 66.3 | 60.8 | 86.0 | 65.1 | |
∆ Gains | +0.1 | +2.4 | +0.9 | +0.2 | +0.2 | +0.4 | +0.0 | +0.6 | |
LAMB+SGG | 44.0 | 53.3 | 61.8 | 43.5 | 43.3 | 41.9 | 81.3 | 52.7 | |
∆ Gains | +0.2 | +0.0 | +0.3 | +0.1 | +0.1 | +0.1 | +0.1 | +0.1 |
Optimizer | Image Question Answering Benchmarks | Avg. | GQA | VizWiz | SciVQAI | VQAT | MMB | MMBCN | POPE |
---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5 | 63.0 | 47.8 | 68.4 | 58.2 | 66.1 | 58.9 | 86.4 | 64.1 | |
LoRA+SGG | 63.4 | 51.0 | 70.1 | 58.6 | 66.7 | 59.4 | 86.6 | 65.1 | |
∆ Gains | +0.4 | +2.2 | +1.5 | +0.4 | +0.6 | +0.5 | +0.2 | +1.0 |
Optimizer | Image Question Answering Benchmarks | Avg. | GQA | VizWiz | SciVQAI | VQAT | MMB | MMBCN | POPE |
---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5 | 54.3 | 50.7 | 66.4 | 52.5 | 56.0 | 49.8 | 82.9 | 58.9 | |
Q-LoRA+SGG | 55.1 | 51.3 | 66.7 | 53.0 | 56.1 | 51.0 | 83.4 | 59.5 | |
∆ Gains | +0.8 | +0.6 | +0.3 | +0.5 | +0.1 | +0.2 | +0.5 | +0.6 |
@inproceedings{acl2025sgg, title={Taming LLMs with Gradient Grouping}, author={Li, Siyuan and Tian, Juanxi and Wang, Zedong and Jin, Xin and Liu, Zicheng and Zhang, Wentao and Xu, Dan}, booktitle={Annual Meeting of the Association for Computational Linguistics}, year={2025} }