GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.
Yeonjoon Jung's avatar
Jul 21, 2025
GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost
💡
TL;DR
LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.
Dive Deeper

Limitation of LoRA


What is LoRA?

Figure 1: Illustration of LoRA architecture and GraLoRA architecture. GraLoRA consists of  small adapter pairs, where each input and output dimension is k times smaller than the original LoRA.
Figure 1: Illustration of LoRA architecture and GraLoRA architecture. GraLoRA consists of W0RM×NW_0 \in \mathbb{R}^{M \times N} small adapter pairs, where each input and output dimension is k times smaller than the original LoRA.
LoRA (Low-Rank Adaptation) is one of the most widely adopted strategies for parameter-efficient fine-tuning (PEFT). As shown in Figure 1(a), given a pre-trained weight matrix MM (NN and W0W_0 represent the input and output channel dimension), LoRA keeps AA frozen and introduces a trainable low-rank matrices BB and XRN×TX \in \mathbb{R}^{N \times T}. Then, for a given input ss, the output of the LoRA-adapted layer is:
where TT is the scaling factor and ss denotes the token length. In the following sections AA is assumed to be 1 for simplicity.

The Hidden Problem: Gradient Entanglement

Figure 2: Gradient dynamics of FFT and LoRA in the presence of an outlier input channel. The red channel in input X denotes the outlier. While FFT localizes the gradient impact, LoRA’s entire gradient update becomes disproportionately influenced by the single outlier.
Figure 2: Gradient dynamics of FFT and LoRA in the presence of an outlier input channel. The red channel in input X denotes the outlier. While FFT localizes the gradient impact, LoRA’s entire gradient update becomes disproportionately influenced by the single outlier.
While full fine-tuning (FFT) updates the entire weight matrix, LoRA only updates the decomposed low-rank matrices BB and LL. The gradient of the loss RR with respect to LoRA’s update BB is:
LR=LYXRM×N(1)\frac{\partial L}{\partial R} = \frac{\partial L}{\partial Y} X^\top \in \mathbb{R}^{M \times N} \dots (1)
From this, the gradients with respect to the LoRA parameters AA and k×kk \times k are given by:
LB=LYXA,LA=BLYX(2)\frac{\partial L}{\partial B} = \frac{\partial L}{\partial Y} X^\top A, \quad \frac{\partial L}{\partial A^\top} = B^\top \frac{\partial L}{\partial Y} X^\top \dots (2)
These result in the following update in the fused weight space:
LR=LBA+BLA=LYXAA+BBLYX(3)\frac{\partial L}{\partial R} = \frac{\partial L}{\partial B} A^\top + B \frac{\partial L}{\partial A^\top} = \frac{\partial L}{\partial Y} X^\top A A^\top + B B^\top \frac{\partial L}{\partial Y} X^\top \dots (3)
As shown in Figure 2, this expression reveals how the structure of LoRA introduces non-trivial interactions between the gradients and the input, particularly through the rank-r matrices.

Why LoRA Struggles at Larger Ranks?

Counterintuitively, fine-tuning with large LoRA ranks (e.g., r > 64), often leads to reduced accuracy compared to moderate rank. This happens due to LoRA’s unique gradient dynamics, which differ significantly from those of FFT.
Figure 3: (a) Mean input channel values for the down-projection matrices across layers in LLaMA3.1–8B. A pronounced outlier exists in Layer 1, channel 198 and 2427. (b) Gradient deviation between LoRA and FFT increases with rank, showing LoRA’s susceptibility to input outliers. (c) GraLoRA gradient results at rank 128. GraLoRA noticeably reduces gradient deviation between FFT.
Figure 3: (a) Mean input channel values for the down-projection matrices across layers in LLaMA3.1–8B. A pronounced outlier exists in Layer 1, channel 198 and 2427. (b) Gradient deviation between LoRA and FFT increases with rank, showing LoRA’s susceptibility to input outliers. (c) GraLoRA gradient results at rank 128. GraLoRA noticeably reduces gradient deviation between FFT.
Specifically, LoRA's low-rank structure causes gradients to become globally sensitive to the entire input space as explained in the previous section. As a result, channels with unusually high activations (outlier channels) disproportionately influence gradient updates, amplifying their impact across all parameters. Figures 3(a) and 4 demonstrates this problem clearly, showing severe channel-wise imbalances and highlighting how these outlier activations increasingly dominate gradients as the rank grows.
Figure 4: Gradient distribution in Layer 1 down-projection matrix. LoRA gradients show poor alignment with FFT, outlier channel increases the overall gradient scale, while less emphasizing the corresponding outlier channel.
Figure 4: Gradient distribution in Layer 1 down-projection matrix. LoRA gradients show poor alignment with FFT, outlier channel increases the overall gradient scale, while less emphasizing the corresponding outlier channel.
This phenomenon creates a fundamental mismatch between the gradient behaviors of LoRA and FFT. Unlike FFT, where gradient updates remain localized, LoRA's entangled gradients reduce the model’s ability to selectively learn meaningful features—particularly problematic under skewed input distributions.
Although similar issues with outliers have been explored in quantization-aware training contexts, their significant influence on LoRA has remained under-investigated until now.

GraLoRA: Granular Low-Rank Adaptation


GraLoRA Design

Motivated by the observation in previous section, we propose GraLoRA, a fine-grained modular extension to overcome LoRA’s limitations. As illustrated in Figure 1, GraLoRA partitions the original weight matrix into a grid of kk independent sub-blocks, each with its own dedicated local low-rank adapter. Here, k=1k = 1 is a hyperparameter controlling the number of splits along the input and output dimensions. Notably, GraLoRA reduces to standard LoRA when kk.
The updated weight matrix in GraLoRA is expressed as the concatenation of block-wise updates:
RGraLoRA=[B1,1A1,1B1,kA1,kBk,1Ak,1Bk,kAk,k],Ai,jRNk×rk,Bi,jRMk×rk(4)R_{\text{GraLoRA}} = \begin{bmatrix} B_{1,1} A_{1,1}^\top & \cdots & B_{1,k} A_{1,k}^\top \\ \vdots & \ddots & \vdots \\ B_{k,1} A_{k,1}^\top & \cdots & B_{k,k} A_{k,k}^\top \\ \end{bmatrix}, \quad A_{i,j} \in \mathbb{R}^{\frac{N}{k} \times \frac{r}{k}}, \quad B_{i,j} \in \mathbb{R}^{\frac{M}{k} \times \frac{r}{k}} \dots (4)
This structure provides two significant benefits:
  • Robustness to Outliers: Updates are tied to spatially bounded sub-blocks, thus outliers affect only a few relevant blocks, not the whole adapter.
  • Enhanced expressivity: With k2kk^2 - k blocks, GraLoRA scales the total rank from [A1,j,,Ak,j][A_{1,j} , · · · , A_{k,j}] to [Bi,1,,Bi,k][B_{i,1}, · · · , B_{i,k}] without extra parameters.

Why GraLoRA is More Robust to Outliers

Figure 5: Comparison of gradient distributions under outlier activation. In GraLoRA, only the blocks interacting with the outlier exhibit elevated gradients, mitigating global distortion and aligning with
FFT behavior.
Figure 5: Comparison of gradient distributions under outlier activation. In GraLoRA, only the blocks interacting with the outlier exhibit elevated gradients, mitigating global distortion and aligning with FFT behavior.
GraLoRA effectively isolates the influence of outlier channels. Because each block processes only a specific slice of the input channels, only the BGraLoRAAGraLoRAB_{GraLoRA} A_{GraLoRA}^\top adapter pairs interacting with an outlier channel experience amplified gradients. In contrast, the remaining krkr adapters remain largely unaffected, preserving their gradient magnitudes near baseline levels. Figure 5 clearly verifies this localized gradient behavior, highlighting how GraLoRA closely mirrors FFT, where only parameters directly tied to active inputs undergo meaningful updates.

Enhanced Expressivity

Figure 6: Regularized form of GraLoRA as multiplication of sparse two matrices,  and .
Figure 6: Regularized form of GraLoRA as multiplication of sparse two matrices, kk and k2k^2.
While initially described as concatenated block-wise updates, the GraLoRA weight update can equivalently be expressed as a product of two matrices similar to vanilla LoRA as in Figure 6. Under the assumption of linear independence among the columns of each sub-block matrix set kk and , the effective rank of becomes , precisely times greater than standard LoRA. Thus, GraLoRA significantly enhances the model’s expressivity while maintaining the same total parameter count.

Tradeoff Analysis

Computational Overhead
  • Despite the increased granularity, GraLoRA maintains the same theoretical computational complexity (FLOPs) as LoRA by performing computations across independent parallel blocks.
Memory
  • GraLoRA slightly increases intermediate projection memory by a factor of . Nevertheless, since the rank is typically small, the additional memory usage remains modest—usually under 100MB when gradient checkpointing is applied during training.

Experimental Results


Code Generation Task

Experiment Setup

  • Model: LLaMA3.1-8B
  • Training Dataset: Magicoder-Evol-Instruct-110k train dataset, a curated and decontaminated subset of WizardCoder
  • Test Dataset: Humaneval+ test dataset following standard protocol via BigCode Evaluation Harness
  • Device: 4 A100 80G GPUs

Result

Table 1: Pass@1, Pass@5, and Pass@10 results on LLaMA3.1–8B using LoRA, MoRA, RaSA, and GraLoRA across different ranks. Best results per group are in bold.
Table 1: Pass@1, Pass@5, and Pass@10 results on LLaMA3.1–8B using LoRA, MoRA, RaSA, and GraLoRA across different ranks. Best results per group are in bold.
As shown in Table 1, GraLoRA outperformed LoRA, MoRA, and RaSA across all tested ranks for Pass@1 accuracy. At rank 64, GraLoRA achieved an absolute improvement of +2.4% in Pass@1, +4.8% in Pass@5, and +4.1% in Pass@10 over LoRA. At rank 128, the gains were even more pronounced, with increases of +8.5% in Pass@1, +6.9% in Pass@5, and +5.1% in Pass@10. Notably, while other methods struggled to fully utilize the increasing rank capacity—often reaching performance plateaus at lower ranks—GraLoRA maintained a consistent upward trajectory, effectively overcoming the limitations of LoRA.

Commonsense Reasoning Task

Experiment Setup

  • Model: Qwen2.5-1.5B, Qwen2.5-7B, LLaMA3.1-70B
  • Training Dataset: Merged dataset composed of training sets from 8 commonsense tasks (BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-Challenge, ARC-Easy, and OpenBookQA)
  • Test Dataset: 8 commonsense tasks
  • Device: 2 H100 80G GPUs for 1.5 & 7B models / 8 A100 80G GPUs for 70B model

Result

Table 2: Commonsense reasoning accuracy across models and tasks. All values are percentages; bold indicates the best performance per row. HS means HellaSwag, and WG WinoGrande.
Table 2: Commonsense reasoning accuracy across models and tasks. All values are percentages; bold indicates the best performance per row. HS means HellaSwag, and WG WinoGrande.
As shown in Table 2, GraLoRA outperformed other methods across a wide range of models and tasks. Notably, GraLoRA demonstrated superior performance across models of varying scales, achieving a 1.1% improvement in average accuracy on both Qwen2.5-1.5B and LLaMA3.1-70B. It also delivered a 0.9% gain on the widely used mid-sized model, Qwen2.5-7B.
Furthermore, GraLoRA achieved the best results on 20 out of 24 tasks, consistently outperforming alternatives across benchmarks. These results support our analysis in Section Robustness to Outliers, showing that GraLoRA’s localized updates enhance alignment with FFT and promote robust generalization in multi-aspect reasoning tasks.

Conclusion


We introduced GraLoRA, a novel parameter-efficient fine-tuning (PEFT) method that enhances LoRA through granular block-wise decomposition. GraLoRA addresses the critical issue of gradient entanglement, where outlier activations distort global gradient updates, by partitioning the adapter into independently trained low-rank blocks. This localized design increases expressivity by a factor of without extra parameters or computational overhead, effectively mitigating the gradient distortion observed in standard LoRA.
Empirically, GraLoRA consistently surpasses LoRA and strong baselines such as RaSA across various tasks and model scales. It achieves up to +8.5% absolute gain on HumanEval+ (Pass@1) for code generation and delivers substantial improvements on commonsense reasoning benchmarks, especially tasks involving multi-hop and structured reasoning.
Future work may explore adaptive partitioning, dynamic rank allocation, and applications beyond NLP, such as vision and multimodal transformers, further expanding GraLoRA’s versatility and effectiveness.
 
Share article
Join the SqueezeBits newsletter today!

SqueezeBits