LLMs Research
Posts
LLM Research Highlights: March 1-15, 2025

LLM Research Highlights: March 1-15, 2025

Exploring Innovations in Performance, Instruction Tuning, Cache Management, Quantization, and Unlearning for Large Language Models

March 21, 2025

🔑 takeaway from today’s newsletter

Performance Boosts: Forgetting Transformer, Multi-Attempt RL, and R1-Searcher improve efficiency, math accuracy, and search with selective memory, feedback, and RL.
Simplified Design: Normalization-Free Transformers speed up training and inference using Dynamic Tanh in a streamlined architecture.
Data Optimization: RDS+ enhances instruction tuning, achieving top performance with only 6% of the data pool.
Memory Efficiency: Q-Filters and RSQ optimize long-context handling and quantization by compressing KV Cache and prioritizing key tokens.
Compression & Fairness: TinyR1-32B-Preview and Group-Robust Unlearning deliver high accuracy and equitable data removal via distillation and unlearning techniques.

Core research

Forgetting Transformer: Softmax Attention with a Forget Gate

Paper: https://arxiv.org/abs/2503.02130
Authors: Zhixuan Lin et al. (Mila & Université de Montréal)
Focus: Enhancing Transformer performance with a data-dependent forgetting mechanism
Code: https://github.com/zhixuan-lin/forgetting-transformer

The Forgetting Transformer (FoX) addresses a key limitation in standard Transformers: their lack of an explicit mechanism to selectively forget past information, a feature common in recurrent sequence models via forget gates. While Transformers excel at long-context tasks, they often retain irrelevant details, impacting efficiency and performance on both short and long sequences. FoX introduces a novel approach by integrating a forget gate into softmax attention.

Key contribution:

Forgetting Attention: A scalar forget gate, computed as f_t=σ(w_f^⊤x_t+b_f), down-weights unnormalized attention scores in a data-dependent manner. This is applied as F_ij=∏ⁱ_l=j+1f_l, allowing the model to prioritize relevant context dynamically.

Pro Block Design: An enhanced architecture incorporating recurrent-inspired components like output gates and token shifts, boosting performance across tasks.

Results: Evaluated on LongCrawl64 with a 16K-token context, FoX outperforms the baseline Transformer on long-context language modeling (e.g., lower per-token loss), length extrapolation, and short-context downstream tasks (e.g., 50.85% avg. accuracy on LM-eval-harness vs. 50.79% for Transformer-Pro). It matches Transformer performance on long-context downstream tasks (LongBench) while requiring no positional embeddings and remaining compatible with FlashAttention. FoX’s ability to retain long-context retrieval (near-perfect needle-in-the-haystack scores) while improving efficiency marks a significant step forward for LLMs.

Learning from Failures in Multi-Attempt Reinforcement Learning

Paper: https://arxiv.org/abs/2503.04808
Authors: Stephen Chung et al. (DualityRL & Shanghai AI Lab)
Focus: Boosting LLM reasoning through multi-attempt training with feedback

This paper extends reinforcement learning (RL) for LLMs by shifting from single-turn question-answering to a multi-attempt framework, where models refine responses based on feedback after incorrect attempts. Inspired by DeepSeek R1, the authors argue that allowing multiple tries enhances reasoning by encouraging self-refinement, a capability often absent in single-turn trained models.

Key Innovations:

Multi-Attempt Task: The model gets N attempts (sampled from 1 to 5), with a transition function terminating the dialogue on a correct answer or exhausted attempts. Feedback prompts refinement after errors.

Reward Design: +1 for a correct answer, -0.5 for a wrong answer in correct format, and -1 otherwise, incentivizing exploration and correction without penalizing attempt count.

Results: Fine-tuning Qwen 2.5 Math 1.5B on 8K math questions, the multi-attempt LLM improves from 45.6% accuracy (1 attempt) to 52.5% (2 attempts) across five math benchmarks (e.g., AIME 2024, MATH 500). The single-turn baseline, in contrast, rises only from 42.3% to 43.2%. Even in single-attempt evaluations, the multi-attempt model edges out the baseline (45.4% vs. 43.5%), showcasing its superior adaptability and reasoning refinement—crucial for real-world LLM applications.

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Paper: https://arxiv.org/abs/2503.05592
Authors: Huatong Song et al. (Renmin University of China)
Focus: Enhancing LLMs with autonomous external search capabilities
Code: https://github.com/RUCAIBox/R1-Searcher

R1-Searcher tackles a persistent LLM weakness: reliance on internal knowledge, which falters on time-sensitive or knowledge-intensive tasks, leading to hallucinations. By integrating retrieval-augmented generation (RAG) with a two-stage RL approach, it trains LLMs to invoke external search systems effectively, improving reasoning without supervised fine-tuning (SFT).

Contributions:

Two-Stage RL: Stage 1 uses a retrieval reward (0.5 if search is invoked) to teach correct query formatting; Stage 2 adds an answer reward (F1 score-based) to optimize problem-solving with retrieved data.

RAG-based Rollout: Special tags (e.g., <begin_of_query>) pause generation for retrieval, integrating results seamlessly into reasoning.

Results: Trained on HotpotQA and 2WikiMultiHopQA, R1-Searcher (Qwen-2.5-7B-Base) outperforms GPT-40-mini-based ReARTeR by 48.2% on HotpotQA and 21.7% on 2Wiki (LLM-as-Judge scores). On the unseen Bamboogle dataset with online search, it achieves an 11.4% gain over Search-o1 (32B). This pure RL approach enhances generalization and inference efficiency, making LLMs more robust for complex, real-world queries.

Transformers without Normalization: Weight Space Analysis and Improved Training Dynamics

Paper:https://arxiv.org/abs/2503.10622
Authors: Jiachen Zhu et al. (Meta, NYU, MIT, Princeton)
Focus: Eliminating normalization layers in Transformers with minimal performance trade-offs
Code: https://jiachenzhu.github.io/DyT

Normalization layers like Layer Norm (LN) and RMSNorm are cornerstones of Transformer architectures, widely assumed to be critical for stable training and strong performance. "Transformers without Normalization" challenges this assumption, introducing a surprisingly simple alternative—Dynamic Tanh (DyT)—to replace normalization entirely. The authors observe that LN often produces tanh-like, S-shaped mappings, squashing extreme values while scaling inputs. DyT leverages this insight with an element-wise operation, DyT(x)=tanh(αx), where α is a learnable parameter, bypassing the need for statistical computations.

Contribution:

Dynamic Tanh (DyT): A lightweight replacement for normalization, DyT uses tanh(αx) to mimic LN’s squashing and scaling effects, with α dynamically adjusting to input ranges.

Drop-in Simplicity: DyT integrates seamlessly into existing Transformer designs, replacing LN or RMSNorm without altering other components or requiring extensive hyperparameter tweaks.

Results:

Tested across diverse domains—vision (ViT, ConvNeXt), language (LLaMA), speech (wav2vec 2.0), and DNA modeling (HyenaDNA)—DyT matches or exceeds the performance of normalized models. For LLaMA (7B to 70B), DyT achieves equivalent training loss and zero-shot accuracy on 15 tasks, while cutting training time by 8.2% and inference time by 7.8% on a 7B model. Its efficiency stems from avoiding mean/variance calculations, making it a leaner option. Unlike alternatives like Fixup or SkipInit, DyT maintains stability and performance without complex initialization tricks, using just a default α = 0.5 for most tasks (though tuned for LLMs). This approach redefines Transformer design, offering a faster, simpler alternative to normalization while preserving—or even enhancing—LLM capability, especially in resource-sensitive settings.

Discover more about recent research papers that enhance the performance of LLMs, published in the latter half of February 2025.

Instruction Tuning

Large-Scale Data Selection for Instruction Tuning

Paper: https://arxiv.org/abs/2503.01807
Code: https://github.com/hanishivi/automated-instruction-selection
Authors: Hamish Ivison et al. (University of Washington, Allen Institute for AI, University of Southern California)
Focus: Optimizing data selection for instruction-tuning at scale

Instruction-tuning drives language model performance, but scaling data selection to millions of samples remains tricky. This paper evaluates nine automated methods on pools up to 5.8 million samples, introducing RDS+, a simple embedding approach that outperforms complex techniques across single and multi-task settings. This paper offers RDS+ method, which uses weighted mean pooling of LM hidden states, consistently beating advanced methods like LESS and IFD. This method thrives with larger datasets, outperforming human-curated mixtures and improving multi-task generalization.

Results: On TULU 2, RDS+ selects 326,000 samples (6% of 5.8 million) to exceed the TULU 2 mixture and match full-pool performance, with 2-point gains over baselines and random selection. This efficient method shines at scale, enhancing large-scale instruction-tuning.

Cache management

Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression

Paper: https://arxiv.org/abs/2503.02812
Code: https://github.com/NathanGodey/qfilters
Authors: Nathan Godey et al. (Sorbonne Université, Inria, Sapienza University of Rome, University of Edinburgh, Miniml.AI)
Focus: Compressing the KV Cache for LLMs using Query-Key geometry

The KV Cache’s memory demands hinder long-context LLMs. Q-Filters leverages Query-Key geometry to estimate attention scores and prune less critical KV pairs without attention weights, ensuring compatibility with FlashAttention for efficient compression.

Key contribution:

Geometric Insight: Projects Keys onto Query eigenvectors to assess KV importance.
Training-Free & FlashAttention-Compatible: Computes filters once, integrating with memory-efficient attention.

Results: On Llama-3.1-8B, achieves 99% accuracy with 32x compression in needle-in-a-haystack tests and cuts perplexity drop by 65% over Streaming-LLM on the Pile with a 512-item cache.

Quantization

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

Paper: https://arxiv.org/abs/2503.01820
Authors: Yi-Lin Sung et al. (University of North Carolina at Chapel Hill)
Focus: Enhancing post-training quantization (PTQ) of LLMs by prioritizing important tokens
Code: https://github.com/ylsung/rsq

RSQ introduces a layer-wise quantization method that boosts LLM compression by focusing on high-importance tokens (e.g., those with large attention scores). It uses a three-step process (Rotate, Scale, Quantize) and token importance strategies to maintain key information, cutting computational costs.

Key Innovations:

Three-Step Process: Rotate (reduces outliers), Scale (adjusts features by importance), Quantize (uses scaled Hessian). Rotate applies orthogonal transformations to mitigate weight outliers, reducing quantization errors. While scale adjusts token features based on their importance, using a modified loss function: ‖(WX - ŴX)R‖₂², where R scales tokens dynamically. and, Quantize uses the GPTQ framework with a scaled Hessian matrix (H_RSQ = 2XR²Xᵀ) for efficient quantization.
Token Importance: Heuristic (e.g., First-N) and dynamic (AttnCon) strategies improve accuracy.

Results: On LLaMA3-8B-Instruct, RSQ gains 1.6% accuracy at 3-bit precision, with up to 3.0% improvement on long-context tasks over QuaRot, shining in extreme compression.

TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

Paper: https://arxiv.org/abs/2503.04872
Authors: Lin Sun et al. (Qiyuan Tech, Peking University)
Focus: Compressing LLMs via distillation with enhanced accuracy

TinyR1-32B-Preview uses Branch-Merge distillation to compress a 671B model (DeepSeek-R1) into a 32B student, improving accuracy in math, coding, and science while enabling efficient, quantization-ready deployment. It combines models using KL divergence for parameter selection.

Results: Outperforms baselines by +5.5 (Math), +4.4 (Coding), +2.9 (Science), nearly matching the 671B model and surpassing a 70B variant.

Unlearning

Group-robust Machine Unlearning

Paper: https://arxiv.org/abs/2503.09330
Authors: Thomas De Min et al. (University of Trento)
Focus: Ensuring fairness in machine unlearning for LLMs
Code: https://github.com/tdemin16/group-robust_machine_unlearning

Standard unlearning can harm model fairness when the forget set over represents certain groups, degrading performance unevenly. This paper proposes group-robust unlearning to maintain accuracy across groups in large language models (LLMs) handling diverse data.

Key Innovations:

REWEIGHT: Reweights sampling in retraining for fair exact unlearning.
MIU: Approximate unlearning that reduces group-specific bias by minimizing mutual information, aligning with the original model.

Results: On CelebA (0.5 unlearning ratio), MIU with REWEIGHT hits 69.0% group accuracy, beating SCRUB’s 62.9%, and keeps equalized odds delta at 0.6 vs. SCRUB’s 3.2. This method ensures equitable unlearning, enhancing LLM fairness for data removal tasks.

Reply

or to participate.