Scaling with Gradient Grouping. Illustration of SGG with online grouping and group-specific learning rate (LR) scaling upon adaptive LR optimizers.

Abstract

Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation.

Method

The key steps of SGG are as follows:

Dynamic Gradient Grouping: SGG dynamically clusters gradient statistics (specifically momentum vectors) in each layer into K groups using online clustering algorithms such as mini-batch K-means.
Cluster-specific Learning Rate Scaling: After grouping, SGG calculates a scaling factor for each cluster based on the deviation of the cluster's statistics relative to the layer's and the entire model's global statistics. This scaling factor modulates the learning rate for each parameter in the cluster.
Parameter Update: The scaled learning rates are then used to update the model parameters, ensuring that each parameter adapts its learning rate based on its group's characteristics while maintaining individual parameter adaptation.

Experiment

To evaluate the effectiveness and versatility of SGG, we conducted extensive experiments on various benchmarks and tasks, covering large language models (LLMs) and multimodal large language models (MLLMs). The experimental results demonstrate that SGG consistently improves performance, accelerates convergence, and exhibits robustness across different scenarios.

Comparison Results with LLMs

Pre-training on C4

Table 4: C4 Pre-training with diverse LLaMA sizes (from 60M to 1B). Comparison of full-rank, memory-efficient, and low-rank optimizers. Validation Perplexity (PPL%↓: lower is better) is reported. Bold and green types denote the best results and gains↓ of SGG (blue background) over related baselines (gray background). Note that † denotes the results borrowed from GaLore, while the others were reimplemented in this work.
Method	Venue	60M	130M	350M	1B
Adam†	ICLR’15	34.06	25.08	18.80	15.56
NAdam	ICLR’18	35.86	28.88	19.24	15.78
RAdam	ICLR’20	30.43	25.17	19.13	15.65
LAMB	ICLR’20	33.04	24.37	18.26	15.84
Adan	TPAMI’23	32.01	23.14	17.32	14.70
Adam+SGG	Ours	30.31	22.18	17.28	14.30
∆ Gains		-3.75	-2.90	-1.52	-1.26
Adam-mini†	ICLR’25	34.10	24.85	19.05	16.07
Adafactor†	ICML’18	32.57	23.98	17.74	15.19
Low-Rank†	arXiv’22	78.18	45.51	37.41	34.53
CAME	ACL’23	31.37	23.38	17.45	14.68
CAME+SGG	Ours	30.15	22.91	17.09	14.35
∆ Gains		-1.22	-0.46	-0.36	-0.33
APOLLO†	MLSys’25	31.55	22.94	16.85	14.20
APOLLO+SGG	Ours	30.18	22.52	16.54	13.95
∆ Gains		-1.37	-0.42	-0.31	-0.25
LoRA†	ICLR’22	34.99	33.92	25.58	19.21
ReLoRA†	ICLR’23	37.04	29.37	29.08	18.33
GaLore†	ICML’24	34.88	25.36	18.95	15.64
LoRA+SGG	Ours	30.62	23.62	17.86	14.73
∆ Gains		-4.37	-10.30	-7.72	-4.48

SFT on GLUE

Table 5: GLUE Benchmark Results with RoBERTa-base. Top-1 accuracy (%↑: higher is better) is reported. Comparison across both full-rank and low-rank (LoRA r = 4, r = 8) settings. Bold and green types denote the best results and performance gains↑ of SGG (blue background) compared to related baselines (gray background).
Optimizer	Rank	CoLA	STS-B	MRPC	RTE	SST2	MNLI	QNLI	QQP	Average
SGD	Full	62.12	90.73	87.74	79.06	94.26	87.53	92.29	92.22	85.74
AdamW	Full	62.24	90.92	91.30	79.42	94.57	87.18	92.33	92.28	86.24
LAMB	Full	62.09	90.59	88.72	75.45	94.72	87.71	92.42	91.46	85.40
CAME	Full	62.16	90.43	89.02	75.94	94.61	87.13	92.31	91.54	85.39
APOLLO	Full	62.45	90.70	90.36	77.53	94.58	87.57	92.40	92.12	85.96
AdamW+SGG	Full	63.36	91.22	92.65	80.87	95.58	88.32	92.88	93.32	87.28
LAMB+SGG	Full	62.47	90.90	89.46	76.53	94.95	87.81	92.89	91.78	85.85
SGD (LoRA)	4	60.32	90.31	87.75	79.06	94.27	87.39	92.16	91.89	85.39
AdamW (LoRA)	4	61.38	90.57	91.07	78.70	92.89	86.82	92.18	91.29	85.61
LAMB (LoRA)	4	61.51	90.33	89.46	74.73	94.27	87.51	92.48	91.57	85.23
DoRA	4	60.38	90.50	88.24	74.73	93.69	92.59	92.68	92.68	85.89
GaLore (LoRA)	4	60.35	90.73	92.25	79.42	94.04	87.00	92.24	91.06	85.89
AdamW+SGG	4	62.36	91.10	92.12	80.51	95.06	88.18	92.62	93.06	86.88
LAMB+SGG	4	62.47	90.90	89.46	75.53	94.95	87.73	92.92	91.78	85.72
SGD (LoRA)	8	60.57	90.29	88.48	79.42	94.32	87.44	92.23	92.10	85.61
AdamW (LoRA)	8	61.83	90.80	91.90	79.06	93.46	86.94	92.25	91.22	85.93
LAMB (LoRA)	8	61.89	90.78	89.21	79.42	94.61	87.61	92.51	91.42	85.35
DoRA	8	58.36	90.63	88.97	75.09	93.81	92.68	92.68	92.68	85.94
GaLore (LoRA)	8	60.06	90.82	92.01	79.78	94.38	87.17	92.20	91.11	85.94
AdamW+SGG	8	62.36	91.10	92.12	80.51	95.06	88.17	92.65	92.85	86.85
LAMB+SGG	8	62.47	90.90	89.46	76.53	94.95	87.85	92.87	91.78	85.85

PEFT on Commonsense Reasoning

Table 6: LLaMA-7B PEFT Results on Commonsense Reasoning. Comparison of LoRA+SGG (blue background) against baselines. Top-1 accuracy (%↑: higher is better) of selected tasks and all tasks on average (Avg.) are reported. Bold and green types denote the best results and gains↑ compared to LoRA (gray background).
Method	BoolQ	PIQA	SIQA	WG	Arc-E	OBQA	Avg.
Parallel	67.9	76.4	78.8	78.9	73.7	75.2	72.2
LoRA	68.9	80.7	77.4	78.8	77.8	74.8	74.7
DoRA	69.7	83.4	78.6	81.0	81.9	79.2	78.4
GaLore	69.5	82.0	75.1	18.0	80.7	78.0	62.7
Fira	69.4	82.6	78.0	81.2	82.2	80.8	76.9
LoRA+SGG	70.3	83.6	78.8	80.9	81.5	79.0	77.6
∆ Gains	+1.4	+2.9	+1.4	+2.1	+3.7	+4.2	+2.9
DoRA+SGG	71.4	84.8	79.5	82.8	83.8	81.2	79.6
∆ Gains	+1.7	+1.4	+0.9	+1.8	+1.9	+2.0	+1.2

Comparison Results with MLLMs

Full-Rank SFT

Table 8: MLLM performance comparison on diverse benchmarks with LLaVA variants and different optimizers. Top-1 accuracy (%)↑ for selected tasks and all-task averaged (Avg.) results are reported. MMB and MMBCN denote MMbench and MMbench (Chinese). Bold and green types denote the best results and gains↓ of SGG (blue background) over related baselines (gray background). Please view Table A6 for the full results.
Optimizer	Image Question Answering Benchmarks	Avg.	GQA	VizWiz	SciVQAI	VQAT	MMB	MMBCN
BLIP-2	41.0	19.6	61.0	42.5	-	-	85.3	-
InstructBLIP	49.2	34.5	60.5	50.1	36.0	23.7	79.8	47.7
Qwen-VL	59.3	35.2	67.1	63.8	38.2	-	-	-
TinyLLaVA	62.0	-	69.1	59.1	66.9	-	86.4	-
MoE-LLaVA	62.6	-	70.3	57.0	68.0	-	85.7	-
LLaVA-Phi	-	-	68.4	48.6	59.8	-	85.0	-
LLaVA-NeXT	64.2	57.6	70.1	64.9	67.4	60.6	86.5	67.3
LLaVA-MOD	58.7	39.2	68.0	58.5	66.3	61.9	87.0	62.8
LLaVA-KD-2B	62.3	44.7	64.7	53.4	64.0	63.7	86.3	62.7
LLaVA-v1.5	62.0	50.0	66.8	58.2	64.3	58.3	85.9	63.6
AdamW+SGG	62.4	50.2	69.8	57.4	65.9	60.1	86.3	64.6
∆ Gains	+0.4	+0.2	+3.0	-0.8	+1.6	+1.8	+0.4	+1.0
Adafactor+SGG	62.8	50.6	71.6	57.3	66.3	60.8	86.0	65.1
∆ Gains	+0.1	+2.4	+0.9	+0.2	+0.2	+0.4	+0.0	+0.6
LAMB+SGG	44.0	53.3	61.8	43.5	43.3	41.9	81.3	52.7
∆ Gains	+0.2	+0.0	+0.3	+0.1	+0.1	+0.1	+0.1	+0.1

Low-Rank SFT (AdamW)

Table 8 (cont.)
Optimizer	Image Question Answering Benchmarks	Avg.	GQA	VizWiz	SciVQAI	VQAT	MMB	MMBCN
LLaVA-v1.5	63.0	47.8	68.4	58.2	66.1	58.9	86.4	64.1
LoRA+SGG	63.4	51.0	70.1	58.6	66.7	59.4	86.6	65.1
∆ Gains	+0.4	+2.2	+1.5	+0.4	+0.6	+0.5	+0.2	+1.0

8-bit Low-Rank SFT (AdamW)

Table 8 (cont.)
Optimizer	Image Question Answering Benchmarks	Avg.	GQA	VizWiz	SciVQAI	VQAT	MMB	MMBCN
LLaVA-v1.5	54.3	50.7	66.4	52.5	56.0	49.8	82.9	58.9
Q-LoRA+SGG	55.1	51.3	66.7	53.0	56.1	51.0	83.4	59.5
∆ Gains	+0.8	+0.6	+0.3	+0.5	+0.1	+0.2	+0.5	+0.6

BibTeX

        @inproceedings{acl2025sgg,
          title={Taming LLMs with Gradient Grouping},
          author={Li, Siyuan and Tian, Juanxi and Wang, Zedong and Jin, Xin and Liu, Zicheng and Zhang, Wentao and Xu, Dan},
          booktitle={Annual Meeting of the Association for Computational Linguistics},
          year={2025}
       }

SGG 2025 Main ⭐️

Taming LLMs by Scaling Learning Rates with Gradient Grouping