Exploring the multi-dimensional refusal subspace in reasoning models

Author

Le magicien quantique

Published

September 15, 2025

Evidences of multi-dimensionality

Mechanistic interpretability research (Nanda (2022) Hastings-Woodhouse (2024)) has shown that certain model behaviors are encoded linearly in the feature space (Park, Choe, and Veitch (2024)). For instance, by comparing activations between contrastive harmful-harmless sentence pairs, like “How to create a bomb?” and “How to create a website?”, we can extract a refusal direction, a vector that represents the model’s tendency to refuse harmful requests (Arditi et al. (2024)).

While earlier work assumed this refusal direction was one-dimensional (Arditi et al. (2024)), some evidence suggests the refusal behavior might actually spans a multi-dimensional subspace, especially in larger models (Wollschläger et al. (2025) Winninger, Addad, and Kapusta (2025)). When we compute refusal directions using different methods (difference-in-means vs. probe classifiers), the resulting vectors have high but imperfect cosine similarity Figure 1.

Figure 1: Cosine similarity between refusal directions extracted via different methods across layers of Llama 3.2 1b. While the cosine similarity is high in the first layers, it becomes less important in the later ones, with a minimal value of \(0.39\) still being meaningful in \(\mathbb{R}^{\text{d\_embed}}\), but far from the \(1\) expected if refusal truly occupied just one dimension.

This multi-dimensional structure helps explain why simply ablating a single refusal direction often fails to jailbreak larger models: the refusal behavior has redundancy across multiple dimensions. Previous work by Wollschläger et al. (2025). characterized this as a refusal cone spanning multiple dimensions and proposed a gradient-based method to find it, though this approach is computationally expensive.

This may also explain why in SSR (Winninger, Addad, and Kapusta (2025)), the ProbeSSR method is significantly better than the SteeringSSR while being theoretically similar. The SteeringSSR uses difference in mean (DIM) to find one refusal direction in the cone, while the probes were initied at random and converged on a different, more effecient refusal direction.

Another element in favor of the multi-dimensional hypothesis is that in bigger models, ablating the refusal direction found with the DIM technique is not sufficient to induce jailbreak. The following section explore this argument in detail.

Characterizing the refusal cone with a practical clusturing-based approach

While Wollschläger et al. (2025) introduced a method to find the entire refusal cone with gradient descent, it is costly and cannot run on small setups. To enable experiments on bigger models, we developed a simpler method. The key insight is that, if models are large enough to encode different types of harmful content differently, we can extract diverse refusal vectors by using topically varied harmful-harmless pairs.

We combined several harmful datasets: AdvBench (Zou et al. (2023)), HarmBench (Mazeika et al. (2024)), StrongREJECT (Souly et al. (2024)), ForbiddenQuestions (Chu et al. (2025)), MaliciousInstruct (Qiu et al. (2023)), into a large Bigbench dataset, composed of 1200 samples. Using sentence transformers and HDBSCAN clustering, we grouped semantically similar harmful instructions together (Figure 2).

Figure 2: Harmful sentences of BigBench clusterized with HDBSCAN in \(\mathbb{R}^{\text{d\_embed}}\) and projected in 2d with PCA. Each cluster represent a distinct topic (e.g., “explosive creation”, “hacking techniques”, “illegal substances”).

By computing difference-in-means vectors from different topical subsets, we found multiple non-collinear refusal directions with approximately \(0.30\) cosine similarity, which confirms they span different dimensions. This approach has two key advantages: it does not require any gradient descent method or hyperparameter to tune, and each direction can be interpreted semantically. However, it does not garantee that the whole refusal cone has been found.

Experimental validation

To verify that these directions represent genuine components of the refusal subspace, we tested Qwen3-8b’s resistance to vanilla harmful prompts while progressively ablating 1, 2, or 3 refusal directions. The ablated directions are not orthogonal in the sense of Wollschläger et al. (2025), but it was still enough to provide meaningful results.

Figure 3: Distribution of evaluator scores using vanilla harmful prompts and ablating **1 (Orange)**, **2 (Pink)**, and **3 (Violet)** refusal directions . On the Y-axis, the number of samples, on the X-axis, the score given by the evaluator: **1 (Fail)** - complete refusal, **2 (Fail)** - partial refusal or no information, **3 (Success)** - accepts and gives a few details, **4 (Success)** - accepts and give detailed answer.

Figure 3 shows that, as more refusal directions are ablated, the distribution shifts rightward toward higher compliance scores. But more importantly, that while one direction is not sufficient to remove refusal, multiple dimensions are.

It is important to note that, compared to Wollschläger et al. (2025), the ablated refusal direction may remove other capabilities of the model, as the refusal directions are computed without taking into account the success on harmless tasks.

References

Arditi, Andy, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. “Refusal in Language Models Is Mediated by a Single Direction.” https://arxiv.org/abs/2406.11717.

Chu, Junjie, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2025. “JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs.” https://arxiv.org/abs/2402.05668.

Hastings-Woodhouse, Sarah. 2024. “Introduction to Mechanistic Interpretability.”

Mazeika, Mantas, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, et al. 2024. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.” https://arxiv.org/abs/2402.04249.

Nanda, Neel. 2022. “A Comprehensive Mechanistic Interpretability Explainer & Glossary,” December.

Park, Kiho, Yo Joong Choe, and Victor Veitch. 2024. “The Linear Representation Hypothesis and the Geometry of Large Language Models.” https://arxiv.org/abs/2311.03658.

Qiu, Huachuan, Shuai Zhang, Anqi Li, Hongliang He, and Zhenzhong Lan. 2023. “Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models.” https://arxiv.org/abs/2307.08487.

Souly, Alexandra, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, et al. 2024. “A StrongREJECT for Empty Jailbreaks.” https://arxiv.org/abs/2402.10260.

Winninger, Thomas, Boussad Addad, and Katarzyna Kapusta. 2025. “Using Mechanistic Interpretability to Craft Adversarial Attacks Against Large Language Models.” https://arxiv.org/abs/2503.06269.

Wollschläger, Tom, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. 2025. “The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence.” https://arxiv.org/abs/2502.17420.

Zou, Andy, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.” https://arxiv.org/abs/2307.15043.