Subspace rerouting: Using Mechanistic Interpretability to Craft Attacks against Large Language Models
Writing
Paper available on ArXiv: https://arxiv.org/abs/2503.06269
Presentation post, it can also be read on LessWrong.
First failed experiments (Oct 24): This work is my capstone project for the AISF Alignment course. It consists of an adaptation of the Greedy Coordinate Gradient (GCG) attack to exploit the refusal subspace in large language models.
Experiments
Component attribution: This notebook illustrate the patching attribution method to find safety heads, the Direct Logit Attribution (DLA) method to find important layers, and the same DLA to find safety heads
Layer diff: This notebook shows how to reproduce the experiments of the Appendix Section I: Cross-layer stability of subspaces of the paper.
Multi-layers: This notebook shows how to reproduce the experiments of the Appendix Section J: Multi-layer tarrgeting effects on attack success.
Steering out-of-distribution: This notebook shows how to reproduce the experiments of the Appendix Secion H: Out of distribution discussion.
Miscellaneous
Weird answers: Collection of some interesting answers I had during experiments.
Run and eval: This notebook’s purpose is to quickly explain the general pipeline to craft adversarial inputs with Probe SSR.
Using Lens: Introduction to the main library used in the paper: Transformer Lens.