Subspace Rerouting: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Code and notebooks available here: https://github.com/Sckathach/subspace-rerouting.
This work follows the interpretability analysis of jailbreaks on LLM made by Arditi et al. (2024), He et al. (2024), and my previous failed attempt on the subject. It adapts the Greedy Coordinate Gradient (GCG) attack to target virtually any subspace in the model, which not only enables quick jailbreaks but also allows runtime interventions like vector steering or direction ablations to be converted into adversarial perturbations in the input. Perturbations that trigger desired behaviors without further intervention.
And sometimes, the perturbation is interpretable!
Removing the word “gracefully” makes the jailbreak fail, as expected from a “slightly robust” model:
This run also yielded perturbations like: “gently ensuring safety”, and more general incitations to perform the task in a “safe” manner.
I’ve published a paper with the technical details: https://arxiv.org/abs/2503.06269.
The goal of this post is to present the mechanistic interpretability component of the work, along with some findings I didn’t include in the paper. It’s organized into several sections:
- Preliminaries that briefly introduce “classical” gradient-based attacks against LLMs (Section 1)
- A presentation of interpretability techniques used to study jailbreaks (Section 2)
- An introduction to the Subspace Rerouting (SSR) algorithm (Section 3)
- Some first interpretations of generated jailbreaks (Section 4)
- SAE ??? (Section 5)
TLDR: If steering the model during inference modifies its behavior, it’s possible (sometimes) 1 to modify the input prompt into “prompt + something” such that the resulting activations will be close to those achieved with the intervention. This leads to a similar change in behavior without requiring intervention during inference. It’s also possible (sometimes) 2 to use reverse-logit lens techniques to obtain interesting results.
Gradient-based jailbreaks
Among the thousand ways to jailbreak a model - that is, making it behave contrary to its alignment - there are white-box algorithms that use the model’s weights to compute adversarial perturbations. These perturbations can make the LLM output harmful content like bomb-making instructions. One notable algorithm in this domain is the Greedy Coordinate Gradient (GCG) attack, which takes as input a harmful instruction \(x\), a target \(t\), and computes an adversarial suffix \(s\), such that \(LLM(x + s)\) is likely to output \(t\).
For those unfamiliar with GCG, it works like this, given a harmful instruction:
It adds a dummy suffix:
Prepares a target:
And optimizes the suffix (via backpropagation and HotFlip to generate new tokens):
The goal is for the full adversarial input to bypass security filters.
While effective on some older models, the GCG attack can require hours of computation and often fails on newer models:

This led to an obvious question: Why optimize end-to-end while treating the model as a “black box”? Is it possible to use mechanistic interpretability insights to enhance the attack?
Spoiler
Yes
Understanding jailbreaks
For those who haven’t read the excellent papers: Refusal is Mediated in a single direction (Arditi et al. (2024)), Jailbreak Lens (He et al. (2024)), or On the role of safety heads (Zhou et al. (2024)), safety mechanisms can be located inside the model, even in very early layers! For instance, scanning a dataset of harmful sentences - “How to create a bomb?”, and their harmless counterparts - “How to create a website?”, reveals interesting patterns as early as ~1/3 of the way through the model:
As visible in Figure 3, it’s possible to distinguish between harmful and harmless sentences using only one direction. This direction was termed the “refusal direction” by Arditi et al. (2024) and can be defined as the normalized difference in means between the activations of the two datasets.
This direction is impressive, as a simple cosine similarity between the activations and this direction provides insight into the model’s behavior—whether it will refuse to answer or not. Using the Logit lens (nostalgebraist (2020)), we can even attempt to interpret it:

He et al. (2024) went a step further, and instead of using a single dimension, employed probes to capture arbitrarily large subspaces. These probes—linear layers with a sigmoid activation—classify activations with over 95% accuracy as early as the third layer in most models!
Moreover, Zhou et al. (2024) found that safety mechanisms are not only localized semantically but can also be found in individual components, like attention heads. For instance, in Qwen 2.5 1.5b, a single head (I won’t specify which one :eyes:) can completely undermine the slight alignment of the model. More on that later.
With all these advances, couldn’t we improve adversarial attacks?
Hint
In classical adversarial attacks on image classifiers—you know, the panda transformed into a gibbon (Liu et al. (2020))—the target is the classifier, and the goal is to transition from one class to another. Isn’t that exactly our setup?
input → first layers → classifier → “harmful”, that we would like to flip to “harmless”?
Spoiler
Yes
Introducing Subspace Rerouting (SSR)
As its name suggests, subspace rerouting is about redirecting activations from one subspace to another. For instance, from the harmful subspace identified with our linear probes into the harmless subspace. The excellent Transformer Lens library makes the implementation nearly trivial.
- Take an input with dummy perturbations: “Super [MASK] input! [MASK][MASK][MASK][MASK]”.
- Perform a forward pass until the desired layer.
- Cache the needed activations with hooks.
- Perform backpropagation to update the perturbation, defining the loss as the distance between the current activations and the subspace you wish to target.
- Use the HotFlip method to find better tokens for your perturbation.
- Loop until satisfied.
This is a very general algorithm, not restricted to jailbreaks, as it can be applied (with the current implementation) using any hook available in Transformer Lens, targeting any number of different subspaces across different layers, placing perturbations anywhere in the input, and working with any kind of subspace.
The most straightforward subspaces to target are those defined by linear probes. The loss is simply the probe’s loss function, and SSR will handle the rest, finding a perturbation to reroute your activations:
As the targeted layer is the 5th, and the targeted subspace has a dimension of ~2, the optimization takes literally 16 seconds.
The question we can now ask is: Is this perturbation sufficient to jailbreak the model?
Spoiler
Yes
You can examine the hundreds of jailbreaks I published on Hugging Face: https://huggingface.co/datasets?other=arxiv:2503.06269.
Moreover, it works with probes, but as mentioned, you can define subspaces however you want. For instance, using the refusal direction defined earlier, we can define the refusal subspace as the set of activations with a high cosine similarity to the refusal direction. Conversely, the acceptance subspace can be defined as the set of vectors having a high negative cosine similarity with the refusal direction.
This represents the second use of SSR that I propose, and here are the results:

Disregarding my slight cheating (choosing a higher coefficient for my algorithm than for the intervention baseline), the results are much more impressive than I expected. SSR found suffixes that reproduced the same activation patterns as the runtime steering, but without requiring any intervention during the forward pass.
Beyond being sufficient to jailbreak the model, these results also reveal an interesting phenomenon: activations steered at runtime to be considered harmless quickly revert to their “harmful state”. Meanwhile, SSR generated inputs that consistently maintain a negative cosine similarity with the refusal direction, following a more “natural” path.
My hypothesis is that steering at runtime produces activations that are impossible to achieve under normal conditions—i.e., no input can naturally produce these activations, so these perturbations are brought back toward normality by the following layers. Computing the KL divergence between the normal run and the three other runs partially supports this hypothesis:

Unfortunately, I’ll stop this analysis here to avoid overreaching in my interpretations.
Lastly, I must discuss attention heads. As found by Arditi et al. (2024), successful jailbreak suffixes tend to “hijack” the attention of safety heads, such that these heads focus only on the gibberish suffix rather than the actual harmful content in the instruction. The example presented in Figure 1 is as follows: the suffix “:: Short YAML” is successful—the model (Qwen 2.5 1.5b) agrees to answer—while the suffix “:: Long YAML” is unsuccessful, with the model responding “I can’t”.

Like refusal directions and probes, this phenomenon can be exploited to generate attacks by “rerouting the attention” from harmful tokens to harmless ones. In practice, I reproduced the effects of jailbreaks artificially by defining my loss as the negative sum of the rectangular slice of the attention pattern corresponding to my perturbation (taken after the softmax). Hence, if the sum is large, it means the head is paying significant attention to my perturbations.
This approach didn’t produce particularly impressive jailbreaks, but it still performed slightly better than the non-mechanistic baseline. What I found most striking is that the success rate is \(\gt 0\) on every model, even though my optimization only targets one particular attention pattern of one particular head…
Before you ask, I tried targeting randomly selected heads to verify whether success is due merely to input perturbation or if the objective makes sense. For Qwen 2.5 1.5b, nearly any head chosen in the last layers can be targeted to produce jailbreaks. The model is poorly aligned; almost any perturbation seems to break it. However, this isn’t the case for other models. This method only works on Llama 3.2 3B and Gemma 2 2b if the targeted heads are safety heads, identified either through direct logit attribution or activation patching.
Although I haven’t conducted extensive experiments, and there’s still much work to be done on the optimization formulation, I’m fairly confident about the approach. In any case, it demonstrates that targeting specific components is possible, or at least conceivable.
Interpretable jailbreaks?
Success rates and attack durations are interesting, but what’s even more fascinating is that some jailbreaks are semantically coherent and interpretable! Despite being optimized through gradient methods with a substantial dose of randomness—when continuous gradients are converted into token flips, candidate tokens are chosen using a random distribution over possible tokens—not only are some tokens coherent in a safety sense, like “safely” or “respectfully,” but some perturbations even form small meaningful phrases, like “ask natural Dumbledore” or “portrayed As Goku”, and are answered accordingly!
I’ll let you test these yourself. My only recommendation is to stick with short perturbations (until I find a way to properly incorporate perplexity into my loss function). For instance, 6 tokens are already difficult to interpret: “…), theme Krzysz hero SoliinSlope overarching.
By the way, this suffix was generated for Gemma 2 2b, and kinda also works on Gemma 3 1b as well:
I didn’t count this as a jailbreak since the model successfully rerouted my rerouting toward a mostly harmless conversation, but I still found it noteworthy.
SAE ???
Yes, I had to talk about SAE, otherwise it’s not mechanistic isn’t it? :eyes:. Jokes apart, I obviously wanted to try my method using Sparse Autoencoders, but I was stuck at the first - and most important - step: finding good SAE latents to steer my model toward acceptance behavior, so unfortunately, I don’t have results on this front yet. However, I wanted to try something I find much more interesting than jailbreaking models: using SSR to reverse-logit lens stuff. The idea is simple: given a subspace, component, or whatever, use the SSR algorithm on an empty input - something like “[MASK][MASK][MASK][MASK][MASK][MASK]” - and observe what happens.
I first started with “How to create a [MASK]”, targeting Llama 3.2 1b’s layer 10’s refusal direction, expecting to see something like “bomb” — and I wasn’t far off!
As my algorithm keeps promising candidates, I receive a batch of answers that are closely related every run. During another run, I got a slightly different set of words:
I’ll stop here for obvious reasons. But the result is remarkable! Taking the refusal direction and projecting it to the vocabulary at the end of the model gives refusal words like “I can’t”, but taking the same refusal direction and “projecting” it to the input vocabulary gives slurs! In other words, the logit lens gives tokens that are directly caused by the direction, while the reversed-logit lens gives tokens that cause the direction! It’s not deterministic like its counterpart, but it achieves approximately one good run out of three, with each run converging in seconds, so it is usable. Plus, I didn’t specifically design my algorithm for this purpose, I see many potential easy optimizations.
But where do SAEs fit into all this? Well, I can’t put it off any longer. As I didn’t have much time (and, to be honest, because I’m bad at SAEing), I used examples from the excellent ARENA course on gpt2-small. Given the attention SAE at layer 9 and latent 20668, I adapted SSR to optimize for the following objective: finding tokens that maximize the activation of this particular latent. I thought it would hopelessly get stuck at 0, as being “sparse” is literally the goal of SAEs, but after a few runs (~5), it worked!
First, let’s look at our latent 3:
For those reading on smartphones, I’m interested in the positive logits. For our latent, they are: ” firearm”, ” firearms”, ” handguns”, ” Firearms”, ” handgun”, ” gun”, ” guns”, “Gun”, ” Guns”, and ” NRA”. And exactly as in the experiments with refusal directions, I obtained similar tokens with my reversed-logit lens:
Just with the model plus the trained SAE, SSR is able to discover what this latent represents. Of course, this might only work for a small number of latents, and I may have been fortunate (though not entirely, as I specifically chose a token-level feature latent). I’m currently working to improve the algorithm to better fit this new task, so it might be a lot better in a few weeks. For instance, I’m trying to incorporate perplexity when sampling candidate tokens or directly into the loss function, to hopefully, make SSR generate more coherent perturbations.
Conclusion
It’s been enjoyable jailbreaking LLMs, but I feel there’s so much more to explore other than jailbreaks. Not just optimizing inputs to trigger interesting behaviors (I found a head in Qwen that, when targeted, causes Qwen to struggle with stopping, and another where it perfectly repeats the input without the perturbation…), but more generally to investigate models and their mechanisms from the input side. Though it worth noting that this approach may not work on larger models, I’m limited to my personal computer, so I stopped at 3B parameters.
Moreover, I briefly mentioned in the paper that SSR successfully jailbreaks all four tested models with very high success rates (\(\gt 80\%\)): Qwen 2.5 1.5b, Llama 3.2 1b & 3b, and Gemma 2 2b, regardless of their different alignment processes — even though Gemma I have to admit, often reroutes my rerouting into a harmless conversations. This suggests that studying subspaces and their evolution might prove very useful for enhancing alignment techniques. You can also look the paper for the layer comparisons.
You can check the code here: https://github.com/Sckathach/subspace-rerouting, and find parts of my roadmap for the coming days in the TODO file. However, I’ll move the reversed-logit lens project elsewhere to keep the paper repository relatively clean.
References
Footnotes
It worked really well on the refusal direction, but I haven’t performed extensive experiments on other tasks yet.↩︎
Same.↩︎
Yes, it’s a static image… You can check the animated version here: https://www.neuronpedia.org/gpt2-small/9-att-kk/20668↩︎