Subspace rerouting: Using Mechanistic Interpretability to Craft Attacks against Large Language Models

Writing

Paper available on ArXiv: https://arxiv.org/abs/2503.06269

Presentation post, it can also be read on LessWrong.

First failed experiments (Oct 24): This work is my capstone project for the AISF Alignment course. It consists of an adaptation of the Greedy Coordinate Gradient (GCG) attack to exploit the refusal subspace in large language models.

Experiments

Component attribution: This notebook illustrate the patching attribution method to find safety heads, the Direct Logit Attribution (DLA) method to find important layers, and the same DLA to find safety heads

Layer diff: This notebook shows how to reproduce the experiments of the Appendix Section I: Cross-layer stability of subspaces of the paper.

Multi-layers: This notebook shows how to reproduce the experiments of the Appendix Section J: Multi-layer tarrgeting effects on attack success.

Steering out-of-distribution: This notebook shows how to reproduce the experiments of the Appendix Secion H: Out of distribution discussion.

Miscellaneous

Weird answers: Collection of some interesting answers I had during experiments.

Run and eval: This notebook’s purpose is to quickly explain the general pipeline to craft adversarial inputs with Probe SSR.

Using Lens: Introduction to the main library used in the paper: Transformer Lens.