GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost
LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.
Jul 21, 2025
Contents
Limitation of LoRAWhat is LoRA?The Hidden Problem: Gradient EntanglementWhy LoRA Struggles at Larger Ranks?GraLoRA: Granular Low-Rank AdaptationGraLoRA DesignWhy GraLoRA is More Robust to OutliersEnhanced ExpressivityTradeoff AnalysisExperimental ResultsCode Generation TaskCommonsense Reasoning TaskConclusionTL;DR
LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.
Dive Deeper
Limitation of LoRAWhat is LoRA?The Hidden Problem: Gradient EntanglementWhy LoRA Struggles at Larger Ranks?GraLoRA: Granular Low-Rank AdaptationGraLoRA DesignWhy GraLoRA is More Robust to OutliersEnhanced ExpressivityTradeoff AnalysisExperimental ResultsCode Generation TaskExperiment SetupResultCommonsense Reasoning TaskExperiment SetupResultConclusion
Limitation of LoRA
What is LoRA?

LoRA (Low-Rank Adaptation) is one of the most widely adopted strategies for parameter-efficient fine-tuning (PEFT). As shown in Figure 1(a), given a pre-trained weight matrix ( and represent the input and output channel dimension), LoRA keeps frozen and introduces a trainable low-rank matrices and . Then, for a given input , the output of the LoRA-adapted layer is:
where is the scaling factor and denotes the token length. In the following sections is assumed to be 1 for simplicity.
The Hidden Problem: Gradient Entanglement

While full fine-tuning (FFT) updates the entire weight matrix, LoRA only updates the decomposed low-rank matrices and . The gradient of the loss with respect to LoRA’s update is:
From this, the gradients with respect to the LoRA parameters and are given by:
These result in the following update in the fused weight space:
As shown in Figure 2, this expression reveals how the structure of LoRA introduces non-trivial interactions between the gradients and the input, particularly through the rank-r matrices.
Why LoRA Struggles at Larger Ranks?
Counterintuitively, fine-tuning with large LoRA ranks (e.g., r > 64), often leads to reduced accuracy compared to moderate rank. This happens due to LoRA’s unique gradient dynamics, which differ significantly from those of FFT.

Specifically, LoRA's low-rank structure causes gradients to become globally sensitive to the entire input space as explained in the previous section. As a result, channels with unusually high activations (outlier channels) disproportionately influence gradient updates, amplifying their impact across all parameters. Figures 3(a) and 4 demonstrates this problem clearly, showing severe channel-wise imbalances and highlighting how these outlier activations increasingly dominate gradients as the rank grows.

This phenomenon creates a fundamental mismatch between the gradient behaviors of LoRA and FFT. Unlike FFT, where gradient updates remain localized, LoRA's entangled gradients reduce the model’s ability to selectively learn meaningful features—particularly problematic under skewed input distributions.
Although similar issues with outliers have been explored in quantization-aware training contexts, their significant influence on LoRA has remained under-investigated until now.
GraLoRA: Granular Low-Rank Adaptation
GraLoRA Design
Motivated by the observation in previous section, we propose GraLoRA, a fine-grained modular extension to overcome LoRA’s limitations. As illustrated in Figure 1, GraLoRA partitions the original weight matrix into a grid of independent sub-blocks, each with its own dedicated local low-rank adapter. Here, is a hyperparameter controlling the number of splits along the input and output dimensions. Notably, GraLoRA reduces to standard LoRA when .
The updated weight matrix in GraLoRA is expressed as the concatenation of block-wise updates:
This structure provides two significant benefits:
- Robustness to Outliers: Updates are tied to spatially bounded sub-blocks, thus outliers affect only a few relevant blocks, not the whole adapter.
- Enhanced expressivity: With blocks, GraLoRA scales the total rank from to without extra parameters.
Why GraLoRA is More Robust to Outliers

GraLoRA effectively isolates the influence of outlier channels. Because each block processes only a specific slice of the input channels, only the adapter pairs interacting with an outlier channel experience amplified gradients. In contrast, the remaining adapters remain largely unaffected, preserving their gradient magnitudes near baseline levels. Figure 5 clearly verifies this localized gradient behavior, highlighting how GraLoRA closely mirrors FFT, where only parameters directly tied to active inputs undergo meaningful updates.
Enhanced Expressivity

While initially described as concatenated block-wise updates, the GraLoRA weight update can equivalently be expressed as a product of two matrices similar to vanilla LoRA as in Figure 6. Under the assumption of linear independence among the columns of each sub-block matrix set and , the effective rank of becomes , precisely times greater than standard LoRA. Thus, GraLoRA significantly enhances the model’s expressivity while maintaining the same total parameter count.
Tradeoff Analysis
Computational Overhead
- Despite the increased granularity, GraLoRA maintains the same theoretical computational complexity (FLOPs) as LoRA by performing computations across independent parallel blocks.
Memory
- GraLoRA slightly increases intermediate projection memory by a factor of . Nevertheless, since the rank is typically small, the additional memory usage remains modest—usually under 100MB when gradient checkpointing is applied during training.
Experimental Results
Code Generation Task
Experiment Setup
- Model: LLaMA3.1-8B
- Training Dataset: Magicoder-Evol-Instruct-110k train dataset, a curated and decontaminated subset of WizardCoder
- Test Dataset: Humaneval+ test dataset following standard protocol via BigCode Evaluation Harness
- Device: 4 A100 80G GPUs
Result

As shown in Table 1, GraLoRA outperformed LoRA, MoRA, and RaSA across all tested ranks for
Pass@1 accuracy. At rank 64, GraLoRA achieved an absolute improvement of +2.4% in Pass@1,
+4.8% in Pass@5, and +4.1% in Pass@10 over LoRA. At rank 128, the gains were even more
pronounced, with increases of +8.5% in Pass@1, +6.9% in Pass@5, and +5.1% in Pass@10. Notably, while other methods struggled to fully utilize the increasing rank capacity—often reaching performance plateaus at lower ranks—GraLoRA maintained a consistent upward trajectory, effectively overcoming the limitations of LoRA.
Commonsense Reasoning Task
Experiment Setup
- Model: Qwen2.5-1.5B, Qwen2.5-7B, LLaMA3.1-70B
- Training Dataset: Merged dataset composed of training sets from 8 commonsense tasks (BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-Challenge, ARC-Easy, and OpenBookQA)
- Test Dataset: 8 commonsense tasks
- Device: 2 H100 80G GPUs for 1.5 & 7B models / 8 A100 80G GPUs for 70B model
Result

As shown in Table 2, GraLoRA outperformed other methods across a wide range of models and tasks. Notably, GraLoRA demonstrated superior performance across models of varying scales, achieving a 1.1% improvement in average accuracy on both Qwen2.5-1.5B and LLaMA3.1-70B. It also delivered a 0.9% gain on the widely used mid-sized model, Qwen2.5-7B.
Furthermore, GraLoRA achieved the best results on 20 out of 24 tasks, consistently outperforming alternatives across benchmarks. These results support our analysis in Section Robustness to Outliers, showing that GraLoRA’s localized updates enhance alignment with FFT and promote robust generalization in multi-aspect reasoning tasks.
Conclusion
We introduced GraLoRA, a novel parameter-efficient fine-tuning (PEFT) method that enhances LoRA through granular block-wise decomposition. GraLoRA addresses the critical issue of gradient entanglement, where outlier activations distort global gradient updates, by partitioning the adapter into independently trained low-rank blocks. This localized design increases expressivity by a factor of without extra parameters or computational overhead, effectively mitigating the gradient distortion observed in standard LoRA.
Empirically, GraLoRA consistently surpasses LoRA and strong baselines such as RaSA across various tasks and model scales. It achieves up to +8.5% absolute gain on HumanEval+ (Pass@1) for code generation and delivers substantial improvements on commonsense reasoning benchmarks, especially tasks involving multi-hop and structured reasoning.
Future work may explore adaptive partitioning, dynamic rank allocation, and applications beyond NLP, such as vision and multimodal transformers, further expanding GraLoRA’s versatility and effectiveness.
Share article
Join the SqueezeBits newsletter today!