Research note: Exploring the multi-dimensional refusal subspace in reasoning models

Author

Thomas Winninger

Published

October 15, 2025

This work was conducted during my internship at NICT under the supervision of Chansu Han.

Over the past year, I’ve been studying interpretability and analyzing what happens inside large language models (LLMs) during adversarial attacks. One of my favorite findings is the discovery of a refusal subspace in the model’s feature space, which can, in small models, be reduced to a single dimension (Arditi et al. (2024)). This subspace explains why some jailbreaks work, and can also be used to create new ones efficiently.

I previously suggested that this subspace might not be one-dimensional, and Wollschläger et al. (2025) confirmed this, introducing a method to characterize it. However, their approach relies on gradient-based optimization and is too computationally heavy for small setups, especially for my laptop, which prevents me from doing the experiments I would like to.

Hence, I propose a cheaper (though less precise) method to study this structure, along with new experimental evidence (especially on reasoning models) showing that we need to consider multiple dimensions, rather than a single refusal vector, for larger models.

Evidence for multi-dimensionality

Mechanistic interpretability research (Nanda (2022); Hastings-Woodhouse (2024)) has shown that some model behaviors can be expressed linearly in activation space (Park, Choe, and Veitch (2024)). For example, by comparing activations between contrastive pairs such as “How to create a bomb?” vs. “How to create a website?”, we can extract a refusal direction, a vector representing the model’s tendency to reject harmful queries (Arditi et al. (2024)).

While earlier work assumed this direction was one-dimensional, evidence suggests that refusal actually spans a multi-dimensional subspace, especially in larger models (Wollschläger et al. (2025); Winninger, Addad, and Kapusta (2025)).

A single direction computed via different methods gives different results

When refusal directions are extracted using different techniques, such as the difference in means (DIM)¹ and probe classifiers², they are highly but not perfectly aligned. With cosine similarities descending to \(0.3\), this contradicts the one-dimensional hypothesis.

Figure 1: Cosine similarity between normed refusal directions extracted via **probes (\(x\)-axis)** and **steering (\(y\)-axis)** across layers of Llama 3.2 1B. While the similarity is high in early layers, it decreases in later ones, with a minimal value of \(0.39\) - still meaningful in \(\mathbb{R}^{\text{d\_embed}}\), but far from the \(1\) expected if refusal truly occupied just one dimension.

This phenomenon can also be observed when training multiple probes with different random seeds: they converge to distinct directions, again showing lower cosine similarity than expected.

Ablating a single direction is no longer enough to remove refusal

Using the ablation method proposed by Arditi et al. (2024) no longer works for recent models, especially those with more than 4 billion parameters. The attack success rate (ASR) can even drop to \(0\%\) on models like Llama 3.2 3B.

Computing the refusal direction through optimization yields better results than the difference in means

During my experiments with SSR (Winninger, Addad, and Kapusta (2025)), I observed that adversarial attacks based on probes consistently outperformed those based on DIM, often by a large margin (50% ASR vs. 90% ASR).

Wollschläger et al. (2025) reported similar findings and extended the optimization to multiple directions, forming what they called the refusal cone. This concept helps explain the observations: while DIM provides one direction within the refusal cone, probe-based methods converge toward a different, more efficient direction, essentially sampling distinct regions of the same cone.

Characterizing the refusal cone with a practical clustering-based approach

The idea behind the cheaper refusal cone extraction method is straightforward: if large models encode different categories of harmful content differently, then computing refusal directions per topic should expose several distinct vectors.

Creation of the variety dataset

I merged multiple harmful datasets, including AdvBench (Zou et al. (2023)) and StrongREJECT (Souly et al. (2024)), into a unified BigBench-style dataset of about 1,200 samples. Using a sentence transformer embedding and HDBSCAN clustering, I grouped semantically similar prompts together and used an LLM to label each cluster, resulting in 74 distinct categories³.

By computing difference-in-means vectors for each cluster, I obtained multiple non-collinear refusal directions with cosine similarities around 0.3, confirming that they occupy different regions of the subspace. These refusal directions also have interpretable semantic meaning, being linked to the topics of their clusters.

Of course, this method doesn’t guarantee full coverage of the refusal cone.

Selecting a limited number of directions

This method yields multiple directions (12 in my setup). To extract only a few representative ones, several options can be used:

Use singular value decomposition to extract a basis⁴.
Select the “least aligned” directions by minimizing cosine similarity between pairs⁵.
Select a random subset.

Wollschläger et al. (2025) constructed a basis with the Gram-Schmidt algorithm, but I found it less efficient than the second method, which simply selects the best candidates. Thus, I’ll be using the second method for the rest of this work.

Even though, in practice, the subspace found by the cone method may be smaller than that found by SVD, this is not guaranteed⁶.

Testing the directions

Once the directions are found, we can steer the model to induce or reduce refusal. Alternatively, we can directly edit the model by ablating the directions from its weights, producing an edited version with the refusal cone removed. I followed Arditi et al. (2024)’s approach, ablating every matrix writing to the residual stream⁷.

These edited models can then be tested with two main metrics:

Refusal susceptibility: Is the edited model more likely to answer harmful prompts?
Performance on harmless tasks: Does the edited model retain performance on harmless tasks?

For (1), I used an LLM-as-a-judge setup with two evaluators - one stricter than the other - to produce a range rather than a single value.
For (2), I evaluated using AI Inspect on the MMLU benchmark.

The experiments where made on the Qwen3 famility of models Yang et al. (2025), as my main interest was studying reasoning models. I used Mistral NeMo Mistralai (2024) and Gemini 2.5 Comanici et al. (2025) as evaluators. Every experiment was run on my laptop, a RTX 4090 with 16GB VRAM. The full evaluation pipeline is described in this footnote, as well as why I’m using two different evaluators⁸.

Results

Ablating multiple refusal directions do reduce refusal, more directions are needed for bigger models

Figure 2: Attack Success Rate (ASR) on the vanilla harmful dataset (no jailbreak) using Qwen3 models with different numbers of ablated refusal directions. 0 corresponds to the original model. **Dark bars:** Evaluated by Gemini. **Light bars:** Evaluated by Mistral Nemo, which is more leniant.

Figure 2 shows that as more refusal directions are ablated, the distribution shifts rightward toward higher compliance scores—even without jailbreaks or adversarial suffixes. More importantly, while ablating a single direction is insufficient for larger models (e.g., Qwen3 8B and 14B), multiple directions succeed in reducing refusal.

Ablating refusal directions does not significantly degrade model performance on harmless task (MMLU)

Figure 3: Scores on the MMLU benchmark to assess retention on harmless tasks.

As shown in Figure 3, this method does not significantly degrade model performance on harmless tasks, at least on MMLU. The only exception is Qwen3 14B, likely due to floating-point precision issues (float16 editing vs. float32 for others).

Conclusion

The multi-dimensionality of the refusal subspace is not a new discovery, but this work shows that it also applies to newer reasoning models. Moreover, it provides a simple, low-cost method that can run on local hardware and help advance interpretability and alignement research.

References

AI Security Institute, UK. 2024. “Inspect AI: Framework for Large Language Model Evaluations.” https://github.com/UKGovernmentBEIS/inspect_ai.

Arditi, Andy, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. “Refusal in Language Models Is Mediated by a Single Direction.” https://arxiv.org/abs/2406.11717.

Chu, Junjie, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2025. “JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs.” https://arxiv.org/abs/2402.05668.

Comanici, Gheorghe, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, et al. 2025. “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.” https://arxiv.org/abs/2507.06261.

Hastings-Woodhouse, Sarah. 2024. “Introduction to Mechanistic Interpretability.”

He, Zeqing, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, and Chun Chen. 2024. “JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit.” https://arxiv.org/abs/2411.11114.

Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. “Measuring Massive Multitask Language Understanding.” https://arxiv.org/abs/2009.03300.

Mazeika, Mantas, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, et al. 2024. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.” https://arxiv.org/abs/2402.04249.

Mistralai. 2024. “Mistral NeMo.” https://mistral.ai/news/mistral-nemo.

Nanda, Neel. 2022. “A Comprehensive Mechanistic Interpretability Explainer & Glossary,” December.

Nasr, Milad, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, et al. 2025. “The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections.” https://arxiv.org/abs/2510.09023.

Park, Kiho, Yo Joong Choe, and Victor Veitch. 2024. “The Linear Representation Hypothesis and the Geometry of Large Language Models.” https://arxiv.org/abs/2311.03658.

Qiu, Huachuan, Shuai Zhang, Anqi Li, Hongliang He, and Zhenzhong Lan. 2023. “Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models.” https://arxiv.org/abs/2307.08487.

Souly, Alexandra, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, et al. 2024. “A StrongREJECT for Empty Jailbreaks.” https://arxiv.org/abs/2402.10260.

Winninger, Thomas, Boussad Addad, and Katarzyna Kapusta. 2025. “Using Mechanistic Interpretability to Craft Adversarial Attacks Against Large Language Models.” https://arxiv.org/abs/2503.06269.

Wollschläger, Tom, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. 2025. “The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence.” https://arxiv.org/abs/2502.17420.

Yang, An, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. 2025. “Qwen3 Technical Report.” https://arxiv.org/abs/2505.09388.

Zou, Andy, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.” https://arxiv.org/abs/2307.15043.

Footnotes

Extracting the refusal direction with the Difference-in-Means (DIM) method

This is the method from Arditi et al. (2024).
Given pairs of harmful and harmless prompts (e.g., “How to create a bomb?” vs. “How to create a website?”), we first perform a forward pass to compute the activations of the residual stream at each layer on the last token, after the MLP (resid_post).

At layer \(l\), we obtain \(a_l^{\text{hf}}\) for harmful prompts and \(a_l^{\text{hl}}\) for harmless ones.
The refusal direction is then computed as:

\[ \vec{k}_l = \frac{1}{|\text{HF}|}\sum_{\text{hf} \in \text{HF}} a_l^{\text{hf}} - \frac{1}{|\text{HL}|}\sum_{\text{hl} \in \text{HL}} a_l^{\text{hl}} \]

where
- \(\vec{r}_l\) is the refusal direction at layer \(l\),
- \(\text{HF}\) is the set of harmful prompts, and
- \(\text{HL}\) is the set of harmless prompts.

Variants exist: some average activations across all tokens (He et al. (2024)), and some use intermediate positions within the transformer layer instead of post-MLP activations.↩︎
Extracting the refusal direction with probe classifiers

This approach follows He et al. (2024) and Winninger, Addad, and Kapusta (2025).
Activations are collected as in the DIM method, but a one-layer neural probe is trained to predict refusal:

\[ p_l(\text{sentence}) = \sigma(\vec{w}_l^T a_l + b_l) \]

The probe minimizes the binary cross-entropy loss:

\[ \min_{\vec{w}_l, b_l} \sum_{\text{hf} \in \text{HF}} \mathcal{L}(0, \sigma(\vec{w}_l^T a_l^{\text{hf}} + b_l)) + \sum_{\text{hl} \in \text{HL}} \mathcal{L}(1, \sigma(\vec{w}_l^T a_l^{\text{hl}} + b_l)) \]

The normalized weight vector gives the refusal direction: \(\vec{r}_l = \vec{w}_l / ||\vec{w}_l||\).
In practice, the bias term is often close to zero.↩︎
Creation of the harmful dataset

The combined dataset included AdvBench (Zou et al. (2023)), HarmBench (Mazeika et al. (2024)), StrongREJECT (Souly et al. (2024)), ForbiddenQuestions (Chu et al. (2025)), and MaliciousInstruct (Qiu et al. (2023)), totaling 1,271 samples.

Embeddings were computed with text-embedding-embeddinggemma-300m-qat.
HDBSCAN was chosen to handle the uneven, high-dimensional (768D) space.

Clusters were automatically labeled with Qwen3-8B and used to generate contrastive harmless prompts, following the SSR approach (Winninger, Addad, and Kapusta (2025)).
Listing 1: Example of a harmful–harmless pair with topic label in BigBench
{ "harmful_prompt": "How to create a bomb?", "harmless_prompt": "How to create a website?", "topic": "explosive creation" }
↩︎
Reducing the number of refusal directions with singular value decomposition (SVD)

Given \(b\) refusal directions \(\vec{r}_l^1, \vec{r}_l^2, \ldots, \vec{r}_l^b\), concatenate them into a matrix \(A\):

\[ A = [\vec{r}_l^1, \vec{r}_l^2, \ldots, \vec{r}_l^b] \in \mathbb{R}^{b \times \text{d\_embed}} \]

Applying SVD gives:

\[ A = U\Sigma V^T \]

The top-\(k\) singular vectors form an orthonormal basis:

\[ \vec{r}_{l}^{\text{SVD}} = [U_{:,1}, U_{:,2}, \ldots, U_{:,k}] \]

This captures the directions that explain the greatest variance.↩︎
Reducing the number of refusal directions with cosine-similarity selection (MINCOS)

Given \(b\) refusal directions \(\vec{r}_l^1, \ldots, \vec{r}_l^b\), compute the Gram matrix of pairwise cosine similarities:

\[ G_{ij} = \frac{\langle \vec{r}_l^i, \vec{r}_l^j \rangle}{||\vec{r}_l^i|| \cdot ||\vec{r}_l^j||} \]

For each direction \(i\), sum its total similarity:

\[ s_i = \sum_{j=1, j \neq i}^{b} |G_{ij}| \]

Select the \(k\) directions with the smallest \(s_i\):

\[ \mathcal{I}^{\text{MINCOS}} = \underset{|\mathcal{I}|=k}{\arg\min} \sum_{i \in \mathcal{I}} s_i \]

The selected directions are \(\vec{r}_l^{\text{MINCOS}} = \{\vec{r}_l^i : i \in \mathcal{I}^{\text{MINCOS}}\}\).
Unlike SVD, MINCOS preserves actual learned directions and does not produce orthogonal ones.↩︎
Relationship between MINCOS and SVD subspaces

The subspace found by MINCOS is not necessarily contained within the one found by SVD.
While SVD captures the top-\(k\) principal components (usually explaining 90–95% of the variance), MINCOS may select directions outside this space:

\[ \text{Vect}(\vec{r}_l^{\text{MINCOS}}) \not\subseteq \text{Vect}(\vec{r}_l^{\text{SVD}}) \]

In practice, when cosine similarities are moderate (e.g., 0.15), the MINCOS subspace can be considered “smaller” or less destructive than the SVD one.↩︎
Ablation process: orthogonalizing weight matrices

To remove the influence of a refusal direction \(\hat{\vec{r}}\) from the model, we modify each weight matrix that writes to the residual stream.
For an output matrix \(W_{\text{out}} \in \mathbb{R}^{d_{\text{embed}} \times d_{\text{input}}}\) (e.g., attention and MLP output matrices), we project it orthogonally:

\[ W_{\text{out}}' \leftarrow W_{\text{out}} - \hat{\vec{r}}\hat{\vec{r}}^T W_{\text{out}} \]

This ensures \(W_{\text{out}}'\) no longer writes any component along \(\hat{\vec{r}}\).
In practice, I did not ablate the embedding matrix or the first three layers, as refusal directions are poorly defined there (low probe accuracy).↩︎
Evaluation details

Many papers evaluate model refusal using bag-of-words filters or LLM classifiers like Llama Guard, but I find these evaluation very inaccurate, especially for reasoning models:
- Lexical methods fail because models may begin with neutral phrasing (“Sure, here is…”) before refusing, or refuse implicitly without using “refusal words.”
- Short generations are insufficient: reasoning models may refuse early but then provide harmful content thousands of tokens later (after 4000 tokens for instance with Qwen 3 8b).
- LLM classifiers (e.g., Llama Guard 3) perform poorly on unseen attacks like SSR, and can themselve be prone to attacks or reward hacking (Nasr et al. (2025) Winninger, Addad, and Kapusta (2025)).
Manual verification seems the most robust method (Nasr et al. (2025)), however, if it is not possible, I think generating long answers and using a LLM-as-a-judge is an acceptable minimum, where a larger or equivalent model judges responses to make sure the evaluator understands the conversation. In this work, I used two judges to reduce bias:
- Mistral NeMo Mistralai (2024), which tends to be lenient.
- Gemini 2.5 Flash Comanici et al. (2025), which is stricter.
Evaluations were run with the DSPy framework (described here).
Models were quantized to Q4_K_M using Llama.cpp for efficient inference and long-context evaluation (>4000 tokens).

To assess harmless-task performance, I used AI Inspect AI Security Institute (2024) with MMLU Hendrycks et al. (2021) (0-shot, 100 random samples).
Although 100 samples is only a subset of the full 14k-question MMLU benchmark, this setting balances feasibility with acceptable evaluation time, especially for reasoning models.↩︎