Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

ByAdmin

Jan 7, 2026

arXiv:2601.01887v2 Announce Type: cross
Abstract: Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

ByAdmin

By Admin

Related Post

Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework

OnlineMate: An LLM-Based Multi-Agent Companion System for Cognitive Support in Online Learning

Leave a Reply Cancel reply

You missed

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

OnlineMate: An LLM-Based Multi-Agent Companion System for Cognitive Support in Online Learning

Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework

THE AI TODAY