Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models
arXiv:2606.28153v1 Announce Type: cross Abstract: Jailbreak attacks bypass LLM safety alignment, yet their mechanisms remain poorly understood. We provide evidence…
