Residual Stream Analysis of Overfitting And Structural Disruptions

ByAdmin

Mar 17, 2026

THE AI TODAY

arXiv:2603.13318v1 Announce Type: cross
Abstract: Ensuring that large language models (LLMs) remain both helpful and harmless poses a significant challenge: fine-tuning on repetitive safety datasets, where unsafe prompts are paired with standard refusal templates, often leads to false refusals, in which benign queries are declined. We first quantify this effect, showing that safety data exhibits substantially lower token entropy and 2-gram diversity (0.048) compared to general instruction data. To uncover the root cause, we introduce FlowLens, a stable PCA-based tool for residual-stream geometry analysis, and reveal that higher proportions of safety examples concentrate variance along a few components, reducing representational smoothness and driving false refusals (false refusal rate rises from 63 percent to 84 percent as safety data increases from 0 percent to 40 percent). Guided by these insights, we propose Variance Concentration Loss (VCL), an auxiliary regularizer that penalizes excessive variance concentration in mid-layer residuals. Empirical results demonstrate that VCL reduces false refusals by over 35 percentage points while maintaining or improving performance on general benchmarks such as MMLU and GSM8K.

By Admin

AI RESEARCH

EngGPT2: Sovereign, Efficient and Open Intelligence

Mar 19, 2026 Admin

AI RESEARCH

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

Mar 19, 2026 Admin

AI RESEARCH

Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

Mar 19, 2026 Admin

Residual Stream Analysis of Overfitting And Structural Disruptions

ByAdmin

By Admin

Related Post

EngGPT2: Sovereign, Efficient and Open Intelligence

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

Leave a Reply Cancel reply

You missed

Generative AI-assisted Participatory Modeling in Socio-Environmental Planning under Deep Uncertainty

DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping

Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)