Clicky

ACL Logo SGG 2025 Main ⭐️

Taming LLMs by Scaling Learning Rates with Gradient Grouping

Siyuan Li* zhejiang     Juanxi Tian* pku     Zedong Wang* hkust     Xin Jin westlake    
Zicheng Liu† zhejiang     Wentao Zhang pku     Dan Xu hkust    
zhejiang Zhejiang University       westlake Westlake University hkust The Hong Kong University of Science and Technology pku Peking University
alphaone-teaser

Scaling with Gradient Grouping. Illustration of SGG with online grouping and group-specific learning rate (LR) scaling upon adaptive LR optimizers.

Abstract


Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation.

Method


SGG Framework Overview

The key steps of SGG are as follows:

  1. Dynamic Gradient Grouping: SGG dynamically clusters gradient statistics (specifically momentum vectors) in each layer into K groups using online clustering algorithms such as mini-batch K-means.
  2. Cluster-specific Learning Rate Scaling: After grouping, SGG calculates a scaling factor for each cluster based on the deviation of the cluster's statistics relative to the layer's and the entire model's global statistics. This scaling factor modulates the learning rate for each parameter in the cluster.
  3. Parameter Update: The scaled learning rates are then used to update the model parameters, ensuring that each parameter adapts its learning rate based on its group's characteristics while maintaining individual parameter adaptation.

Experiment

To evaluate the effectiveness and versatility of SGG, we conducted extensive experiments on various benchmarks and tasks, covering large language models (LLMs) and multimodal large language models (MLLMs). The experimental results demonstrate that SGG consistently improves performance, accelerates convergence, and exhibits robustness across different scenarios.

Comparison Results with LLMs

Pre-training on C4

Table 4: C4 Pre-training with diverse LLaMA sizes (from 60M to 1B). Comparison of full-rank, memory-efficient, and low-rank optimizers. Validation Perplexity (PPL%↓: lower is better) is reported. Bold and green types denote the best results and gains↓ of SGG (blue background) over related baselines (gray background). Note that † denotes the results borrowed from GaLore, while the others were reimplemented in this work.
Method Venue 60M 130M 350M 1B
Adam† ICLR’15 34.06 25.08 18.80 15.56
NAdam ICLR’18 35.86 28.88 19.24 15.78
RAdam ICLR’20 30.43 25.17 19.13 15.65
LAMB ICLR’20 33.04 24.37 18.26 15.84
Adan TPAMI’23 32.01 23.14 17.32 14.70
Adam+SGG Ours 30.31 22.18 17.28 14.30
∆ Gains -3.75 -2.90 -1.52 -1.26
Adam-mini† ICLR’25 34.10 24.85 19.05 16.07
Adafactor† ICML’18 32.57 23.98 17.74 15.19
Low-Rank† arXiv’22 78.18 45.51 37.41 34.53
CAME ACL’23 31.37 23.38 17.45 14.68
CAME+SGG Ours 30.15 22.91 17.09 14.35
∆ Gains -1.22 -0.46 -0.36 -0.33
APOLLO† MLSys’25 31.55 22.94 16.85 14.20
APOLLO+SGG Ours 30.18 22.52 16.54 13.95
∆ Gains -1.37 -0.42 -0.31 -0.25
LoRA† ICLR’22 34.99 33.92 25.58 19.21
ReLoRA† ICLR’23 37.04 29.37 29.08 18.33
GaLore† ICML’24 34.88 25.36 18.95 15.64
LoRA+SGG Ours 30.62 23.62 17.86 14.73
∆ Gains -4.37 -10.30 -7.72 -4.48

SFT on GLUE

Table 5: GLUE Benchmark Results with RoBERTa-base. Top-1 accuracy (%↑: higher is better) is reported. Comparison across both full-rank and low-rank (LoRA r = 4, r = 8) settings. Bold and green types denote the best results and performance gains↑ of SGG (blue background) compared to related baselines (gray background).
Optimizer Rank CoLA STS-B MRPC RTE SST2 MNLI QNLI QQP Average
SGD Full 62.12 90.73 87.74 79.06 94.26 87.53 92.29 92.22 85.74
AdamW Full 62.24 90.92 91.30 79.42 94.57 87.18 92.33 92.28 86.24
LAMB Full 62.09 90.59 88.72 75.45 94.72 87.71 92.42 91.46 85.40
CAME Full 62.16 90.43 89.02 75.94 94.61 87.13 92.31 91.54 85.39
APOLLO Full 62.45 90.70 90.36 77.53 94.58 87.57 92.40 92.12 85.96
AdamW+SGG Full 63.36 91.22 92.65 80.87 95.58 88.32 92.88 93.32 87.28
LAMB+SGG Full 62.47 90.90 89.46 76.53 94.95 87.81 92.89 91.78 85.85
SGD (LoRA) 4 60.32 90.31 87.75 79.06 94.27 87.39 92.16 91.89 85.39
AdamW (LoRA) 4 61.38 90.57 91.07 78.70 92.89 86.82 92.18 91.29 85.61
LAMB (LoRA) 4 61.51 90.33 89.46 74.73 94.27 87.51 92.48 91.57 85.23
DoRA 4 60.38 90.50 88.24 74.73 93.69 92.59 92.68 92.68 85.89
GaLore (LoRA) 4 60.35 90.73 92.25 79.42 94.04 87.00 92.24 91.06 85.89
AdamW+SGG 4 62.36 91.10 92.12 80.51 95.06 88.18 92.62 93.06 86.88
LAMB+SGG 4 62.47 90.90 89.46 75.53 94.95 87.73 92.92 91.78 85.72
SGD (LoRA) 8 60.57 90.29 88.48 79.42 94.32 87.44 92.23 92.10 85.61
AdamW (LoRA) 8 61.83 90.80 91.90 79.06 93.46 86.94 92.25 91.22 85.93
LAMB (LoRA) 8 61.89 90.78 89.21 79.42 94.61 87.61 92.51 91.42 85.35
DoRA 8 58.36 90.63 88.97 75.09 93.81 92.68 92.68 92.68 85.94
GaLore (LoRA) 8 60.06 90.82 92.01 79.78 94.38 87.17 92.20 91.11 85.94
AdamW+SGG 8 62.36 91.10 92.12 80.51 95.06 88.17 92.65 92.85 86.85
LAMB+SGG 8 62.47 90.90 89.46 76.53 94.95 87.85 92.87 91.78 85.85

PEFT on Commonsense Reasoning

Table 6: LLaMA-7B PEFT Results on Commonsense Reasoning. Comparison of LoRA+SGG (blue background) against baselines. Top-1 accuracy (%↑: higher is better) of selected tasks and all tasks on average (Avg.) are reported. Bold and green types denote the best results and gains↑ compared to LoRA (gray background).
Method BoolQ PIQA SIQA WG Arc-E OBQA Avg.
Parallel 67.9 76.4 78.8 78.9 73.7 75.2 72.2
LoRA 68.9 80.7 77.4 78.8 77.8 74.8 74.7
DoRA 69.7 83.4 78.6 81.0 81.9 79.2 78.4
GaLore 69.5 82.0 75.1 18.0 80.7 78.0 62.7
Fira 69.4 82.6 78.0 81.2 82.2 80.8 76.9
LoRA+SGG 70.3 83.6 78.8 80.9 81.5 79.0 77.6
∆ Gains +1.4 +2.9 +1.4 +2.1 +3.7 +4.2 +2.9
DoRA+SGG 71.4 84.8 79.5 82.8 83.8 81.2 79.6
∆ Gains +1.7 +1.4 +0.9 +1.8 +1.9 +2.0 +1.2

Comparison Results with MLLMs

Full-Rank SFT

Table 8: MLLM performance comparison on diverse benchmarks with LLaVA variants and different optimizers. Top-1 accuracy (%)↑ for selected tasks and all-task averaged (Avg.) results are reported. MMB and MMBCN denote MMbench and MMbench (Chinese). Bold and green types denote the best results and gains↓ of SGG (blue background) over related baselines (gray background). Please view Table A6 for the full results.
Optimizer Image Question Answering Benchmarks Avg. GQA VizWiz SciVQAI VQAT MMB MMBCN POPE
BLIP-2 41.0 19.6 61.0 42.5 - - 85.3 -
InstructBLIP 49.2 34.5 60.5 50.1 36.0 23.7 79.8 47.7
Qwen-VL 59.3 35.2 67.1 63.8 38.2 - - -
TinyLLaVA 62.0 - 69.1 59.1 66.9 - 86.4 -
MoE-LLaVA 62.6 - 70.3 57.0 68.0 - 85.7 -
LLaVA-Phi - - 68.4 48.6 59.8 - 85.0 -
LLaVA-NeXT 64.2 57.6 70.1 64.9 67.4 60.6 86.5 67.3
LLaVA-MOD 58.7 39.2 68.0 58.5 66.3 61.9 87.0 62.8
LLaVA-KD-2B 62.3 44.7 64.7 53.4 64.0 63.7 86.3 62.7
LLaVA-v1.5 62.0 50.0 66.8 58.2 64.3 58.3 85.9 63.6
AdamW+SGG 62.4 50.2 69.8 57.4 65.9 60.1 86.3 64.6
∆ Gains +0.4 +0.2 +3.0 -0.8 +1.6 +1.8 +0.4 +1.0
Adafactor+SGG 62.8 50.6 71.6 57.3 66.3 60.8 86.0 65.1
∆ Gains +0.1 +2.4 +0.9 +0.2 +0.2 +0.4 +0.0 +0.6
LAMB+SGG 44.0 53.3 61.8 43.5 43.3 41.9 81.3 52.7
∆ Gains +0.2 +0.0 +0.3 +0.1 +0.1 +0.1 +0.1 +0.1

Low-Rank SFT (AdamW)

Table 8 (cont.)
Optimizer Image Question Answering Benchmarks Avg. GQA VizWiz SciVQAI VQAT MMB MMBCN POPE
LLaVA-v1.5 63.0 47.8 68.4 58.2 66.1 58.9 86.4 64.1
LoRA+SGG 63.4 51.0 70.1 58.6 66.7 59.4 86.6 65.1
∆ Gains +0.4 +2.2 +1.5 +0.4 +0.6 +0.5 +0.2 +1.0

8-bit Low-Rank SFT (AdamW)

Table 8 (cont.)
Optimizer Image Question Answering Benchmarks Avg. GQA VizWiz SciVQAI VQAT MMB MMBCN POPE
LLaVA-v1.5 54.3 50.7 66.4 52.5 56.0 49.8 82.9 58.9
Q-LoRA+SGG 55.1 51.3 66.7 53.0 56.1 51.0 83.4 59.5
∆ Gains +0.8 +0.6 +0.3 +0.5 +0.1 +0.2 +0.5 +0.6

BibTeX

        @inproceedings{acl2025sgg,
          title={Taming LLMs with Gradient Grouping},
          author={Li, Siyuan and Tian, Juanxi and Wang, Zedong and Jin, Xin and Liu, Zicheng and Zhang, Wentao and Xu, Dan},
          booktitle={Annual Meeting of the Association for Computational Linguistics},
          year={2025}
       }