GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.

Yeonjoon Jung

Jul 21, 2025

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

Contents

Limitation of LoRA What is LoRA?The Hidden Problem: Gradient Entanglement Why LoRA Struggles at Larger Ranks?GraLoRA: Granular Low-Rank Adaptation GraLoRA Design Why GraLoRA is More Robust to Outliers Enhanced Expressivity Tradeoff Analysis Experimental Results Code Generation Task Commonsense Reasoning Task Conclusion

💡

TL;DR

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.

Dive Deeper

Read the paper

Explore the code

Limitation of LoRA

What is LoRA?

Figure 1: Illustration of LoRA architecture and GraLoRA architecture. GraLoRA consists of small adapter pairs, where each input and output dimension is k times smaller than the original LoRA. — Figure 1: Illustration of LoRA architecture and GraLoRA architecture. GraLoRA consists of $W_0 \in \mathbb{R}^{M \times N}$ small adapter pairs, where each input and output dimension is k times smaller than the original LoRA.

LoRA (Low-Rank Adaptation) is one of the most widely adopted strategies for parameter-efficient fine-tuning (PEFT). As shown in Figure 1(a), given a pre-trained weight matrix

M

(

N

and

W_0

represent the input and output channel dimension), LoRA keeps

A

frozen and introduces a trainable low-rank matrices

B

and

X \in \mathbb{R}^{N \times T}

. Then, for a given input

s

, the output of the LoRA-adapted layer is:

where

T

is the scaling factor and

s

denotes the token length. In the following sections

A

is assumed to be 1 for simplicity.

The Hidden Problem: Gradient Entanglement

Figure 2: Gradient dynamics of FFT and LoRA in the presence of an outlier input channel. The red channel in input X denotes the outlier. While FFT localizes the gradient impact, LoRA’s entire gradient update becomes disproportionately influenced by the single outlier.

While full fine-tuning (FFT) updates the entire weight matrix, LoRA only updates the decomposed low-rank matrices

B

and

L

. The gradient of the loss

R

with respect to LoRA’s update

B

is:

\frac{\partial L}{\partial R} = \frac{\partial L}{\partial Y} X^\top \in \mathbb{R}^{M \times N} \dots (1)

From this, the gradients with respect to the LoRA parameters

A

and

k \times k

are given by:

\frac{\partial L}{\partial B} = \frac{\partial L}{\partial Y} X^\top A, \quad \frac{\partial L}{\partial A^\top} = B^\top \frac{\partial L}{\partial Y} X^\top \dots (2)

These result in the following update in the fused weight space:

\frac{\partial L}{\partial R} = \frac{\partial L}{\partial B} A^\top + B \frac{\partial L}{\partial A^\top} = \frac{\partial L}{\partial Y} X^\top A A^\top + B B^\top \frac{\partial L}{\partial Y} X^\top \dots (3)

As shown in Figure 2, this expression reveals how the structure of LoRA introduces non-trivial interactions between the gradients and the input, particularly through the rank-r matrices.

Why LoRA Struggles at Larger Ranks?

Counterintuitively, fine-tuning with large LoRA ranks (e.g., r > 64), often leads to reduced accuracy compared to moderate rank. This happens due to LoRA’s unique gradient dynamics, which differ significantly from those of FFT.

Figure 3: (a) Mean input channel values for the down-projection matrices across layers in LLaMA3.1–8B. A pronounced outlier exists in Layer 1, channel 198 and 2427. (b) Gradient deviation between LoRA and FFT increases with rank, showing LoRA’s susceptibility to input outliers. (c) GraLoRA gradient results at rank 128. GraLoRA noticeably reduces gradient deviation between FFT.

Specifically, LoRA's low-rank structure causes gradients to become globally sensitive to the entire input space as explained in the previous section. As a result, channels with unusually high activations (outlier channels) disproportionately influence gradient updates, amplifying their impact across all parameters. Figures 3(a) and 4 demonstrates this problem clearly, showing severe channel-wise imbalances and highlighting how these outlier activations increasingly dominate gradients as the rank grows.

Figure 4: Gradient distribution in Layer 1 down-projection matrix. LoRA gradients show poor alignment with FFT, outlier channel increases the overall gradient scale, while less emphasizing the corresponding outlier channel.

This phenomenon creates a fundamental mismatch between the gradient behaviors of LoRA and FFT. Unlike FFT, where gradient updates remain localized, LoRA's entangled gradients reduce the model’s ability to selectively learn meaningful features—particularly problematic under skewed input distributions.

Although similar issues with outliers have been explored in quantization-aware training contexts, their significant influence on LoRA has remained under-investigated until now.

GraLoRA: Granular Low-Rank Adaptation

GraLoRA Design

Motivated by the observation in previous section, we propose GraLoRA, a fine-grained modular extension to overcome LoRA’s limitations. As illustrated in Figure 1, GraLoRA partitions the original weight matrix into a grid of

k

independent sub-blocks, each with its own dedicated local low-rank adapter. Here,

k = 1

is a hyperparameter controlling the number of splits along the input and output dimensions. Notably, GraLoRA reduces to standard LoRA when

k

The updated weight matrix in GraLoRA is expressed as the concatenation of block-wise updates:

R_{\text{GraLoRA}} = \begin{bmatrix} B_{1,1} A_{1,1}^\top & \cdots & B_{1,k} A_{1,k}^\top \\ \vdots & \ddots & \vdots \\ B_{k,1} A_{k,1}^\top & \cdots & B_{k,k} A_{k,k}^\top \\ \end{bmatrix}, \quad A_{i,j} \in \mathbb{R}^{\frac{N}{k} \times \frac{r}{k}}, \quad B_{i,j} \in \mathbb{R}^{\frac{M}{k} \times \frac{r}{k}} \dots (4)

This structure provides two significant benefits:

Robustness to Outliers: Updates are tied to spatially bounded sub-blocks, thus outliers affect only a few relevant blocks, not the whole adapter.

Enhanced expressivity: With $k^2 - k$ blocks, GraLoRA scales the total rank from $[A_{1,j} , · · · , A_{k,j}]$ to $[B_{i,1}, · · · , B_{i,k}]$ without extra parameters.

Why GraLoRA is More Robust to Outliers

Figure 5: Comparison of gradient distributions under outlier activation. In GraLoRA, only the blocks interacting with the outlier exhibit elevated gradients, mitigating global distortion and aligning with
FFT behavior. — Figure 5: Comparison of gradient distributions under outlier activation. In GraLoRA, only the blocks interacting with the outlier exhibit elevated gradients, mitigating global distortion and aligning with FFT behavior.

GraLoRA effectively isolates the influence of outlier channels. Because each block processes only a specific slice of the input channels, only the

B_{GraLoRA} A_{GraLoRA}^\top

adapter pairs interacting with an outlier channel experience amplified gradients. In contrast, the remaining

kr

adapters remain largely unaffected, preserving their gradient magnitudes near baseline levels. Figure 5 clearly verifies this localized gradient behavior, highlighting how GraLoRA closely mirrors FFT, where only parameters directly tied to active inputs undergo meaningful updates.

Enhanced Expressivity

Figure 6: Regularized form of GraLoRA as multiplication of sparse two matrices, and . — Figure 6: Regularized form of GraLoRA as multiplication of sparse two matrices, $k$ and $k^2$ .

While initially described as concatenated block-wise updates, the GraLoRA weight update can equivalently be expressed as a product of two matrices similar to vanilla LoRA as in Figure 6. Under the assumption of linear independence among the columns of each sub-block matrix set

k

and , the effective rank of becomes , precisely times greater than standard LoRA. Thus, GraLoRA significantly enhances the model’s expressivity while maintaining the same total parameter count.

Tradeoff Analysis

Computational Overhead

Despite the increased granularity, GraLoRA maintains the same theoretical computational complexity (FLOPs) as LoRA by performing computations across independent parallel blocks.

Memory

GraLoRA slightly increases intermediate projection memory by a factor of . Nevertheless, since the rank is typically small, the additional memory usage remains modest—usually under 100MB when gradient checkpointing is applied during training.

Experimental Results

Code Generation Task

Experiment Setup

Model: LLaMA3.1-8B

Training Dataset: Magicoder-Evol-Instruct-110k train dataset, a curated and decontaminated subset of WizardCoder

Test Dataset: Humaneval+ test dataset following standard protocol via BigCode Evaluation Harness

Device: 4 A100 80G GPUs

Result

As shown in Table 1, GraLoRA outperformed LoRA, MoRA, and RaSA across all tested ranks for Pass@1 accuracy. At rank 64, GraLoRA achieved an absolute improvement of +2.4% in Pass@1, +4.8% in Pass@5, and +4.1% in Pass@10 over LoRA. At rank 128, the gains were even more pronounced, with increases of +8.5% in Pass@1, +6.9% in Pass@5, and +5.1% in Pass@10. Notably, while other methods struggled to fully utilize the increasing rank capacity—often reaching performance plateaus at lower ranks—GraLoRA maintained a consistent upward trajectory, effectively overcoming the limitations of LoRA.

Commonsense Reasoning Task

Experiment Setup

Model: Qwen2.5-1.5B, Qwen2.5-7B, LLaMA3.1-70B

Training Dataset: Merged dataset composed of training sets from 8 commonsense tasks (BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-Challenge, ARC-Easy, and OpenBookQA)

Test Dataset: 8 commonsense tasks

Device: 2 H100 80G GPUs for 1.5 & 7B models / 8 A100 80G GPUs for 70B model

Result

Table 2: Commonsense reasoning accuracy across models and tasks. All values are percentages; bold indicates the best performance per row. HS means HellaSwag, and WG WinoGrande.

As shown in Table 2, GraLoRA outperformed other methods across a wide range of models and tasks. Notably, GraLoRA demonstrated superior performance across models of varying scales, achieving a 1.1% improvement in average accuracy on both Qwen2.5-1.5B and LLaMA3.1-70B. It also delivered a 0.9% gain on the widely used mid-sized model, Qwen2.5-7B.

Furthermore, GraLoRA achieved the best results on 20 out of 24 tasks, consistently outperforming alternatives across benchmarks. These results support our analysis in Section Robustness to Outliers, showing that GraLoRA’s localized updates enhance alignment with FFT and promote robust generalization in multi-aspect reasoning tasks.

Conclusion

We introduced GraLoRA, a novel parameter-efficient fine-tuning (PEFT) method that enhances LoRA through granular block-wise decomposition. GraLoRA addresses the critical issue of gradient entanglement, where outlier activations distort global gradient updates, by partitioning the adapter into independently trained low-rank blocks. This localized design increases expressivity by a factor of without extra parameters or computational overhead, effectively mitigating the gradient distortion observed in standard LoRA.

Empirically, GraLoRA consistently surpasses LoRA and strong baselines such as RaSA across various tasks and model scales. It achieves up to +8.5% absolute gain on HumanEval+ (Pass@1) for code generation and delivers substantial improvements on commonsense reasoning benchmarks, especially tasks involving multi-hop and structured reasoning.

Future work may explore adaptive partitioning, dynamic rank allocation, and applications beyond NLP, such as vision and multimodal transformers, further expanding GraLoRA’s versatility and effectiveness.