Research note: Exploring the multi-dimensional refusal subspace in reasoning models

Author

Thomas Winninger

Published

September 15, 2025

This work was conducted during my internship at NICT under the supervision of Chansu Han.

This is an early research note, I plan to update it as the experiments progress.

Over the past year, I’ve been studying interpretability and analyzing what happens inside large language models (LLMs) during adversarial attacks. One of the most interesting results is the discovery of a refusal subspace (Arditi et al. (2024)) in the model’s feature space. This subspace explains why some jailbreaks work and can even be used to create new ones efficiently.

I previously suggested that this subspace might not be one-dimensional, and Wollschläger et al. (2025) later confirmed this, introducing a method to characterize it. However, their approach relies on gradient-based optimization and is too computationally heavy for small setups.

Here, I propose a cheaper (though less precise) method to study this structure, along with new experimental evidence (especially on reasoning models) showing that we need to consider multiple dimensions rather than a single refusal vector for larger models.

Evidences of multi-dimensionality

Mechanistic interpretability research (Nanda (2022); Hastings-Woodhouse (2024)) has shown that some model behaviors can be expressed linearly in activation space (Park, Choe, and Veitch (2024)). For example, by comparing activations between contrastive pairs such as “How to create a bomb?” vs. “How to create a website?”, we can extract a refusal direction, a vector representing the model’s tendency to reject harmful queries (Arditi et al. (2024)).

While earlier work assumed this direction was one-dimensional, evidence suggests that refusal actually spans a multi-dimensional subspace, especially in larger models (Wollschläger et al. (2025); Winninger, Addad, and Kapusta (2025)).

When refusal directions are extracted using different techniques (e.g., difference-in-means vs. probe classifiers), they are highly but not perfectly aligned. This imperfect cosine similarity indicates that the model’s refusal behavior isn’t captured by a single vector, but distributed across several correlated directions.

Figure 1: Cosine similarity between refusal directions extracted via different methods across layers of Llama 3.2 1b. While the cosine similarity is high in the first layers, it becomes less important in the later ones, with a minimal value of \(0.39\) still being meaningful in \(\mathbb{R}^{\text{d\_embed}}\), but far from the \(1\) expected if refusal truly occupied just one dimension.

This structure also explains why ablating a single refusal direction often fails to jailbreak larger models: refusal is distributed and redundant. Wollschläger et al. (2025) described this as a refusal cone and proposed a gradient-based way to find it, which unfortunately, is too heavy for smaller machines like mine to perform.

It also clarifies some earlier results: in SSR (Winninger, Addad, and Kapusta (2025)), the ProbeSSR variant consistently outperformed SteeringSSR even though they’re theoretically similar. SteeringSSR uses a single difference-in-means (DIM) vector, while ProbeSSR (initialized randomly) can converge toward different, more efficient refusal directions — essentially sampling parts of the same cone.

Characterizing the refusal cone with a practical clusturing-based approach

The idea for the cheap refusal cone extraction is straightforward: if large models encode different categories of harmful content differently, then computing refusal directions per topic should expose several distinct vectors.

I merged multiple harmful datasets, like AdvBench (Zou et al. (2023)), StrongREJECT (Souly et al. (2024)), into a unified BigBench-style dataset of about 1,200 samples. Using a sentence transformer embedding (text-embedding-embeddinggemma-300m-qat) and HDBSCAN clustering, I grouped semantically similar prompts together (Figure 2).

Figure 2: Harmful sentences of BigBench clusterized with HDBSCAN in \(\mathbb{R}^{\text{d\_embed}}\) and projected in 2d with PCA. Each cluster represent a distinct topic (e.g., “explosive creation”, “hacking techniques”, “illegal substances”).

By computing difference-in-means vectors for each cluster, I obtained multiple non-collinear refusal directions with cosine similarity around 0.3, confirming that they occupy different regions of the subspace. These refusal directions also have interpretable semantic meaning, as they are linked to the topic of the cluster.

Of course, it doesn’t guarantee full coverage of the refusal cone.

Experimental validation

To test whether these directions truly correspond to distinct refusal components, I ablated one, two, and three refusal vectors in Qwen3-8B, then measured the model’s responses to vanilla harmful prompts using the DSPy evaluation pipeline.

Figure 3: Distribution of evaluator scores using vanilla harmful prompts and ablating **1 (Orange)**, **2 (Pink)**, and **3 (Violet)** refusal directions . On the Y-axis, the number of samples, on the X-axis, the score given by the evaluator: **1 (Fail)** - complete refusal, **2 (Fail)** - partial refusal or no information, **3 (Success)** - accepts and gives a few details, **4 (Success)** - accepts and give detailed answer.

Figure 3 shows that, as more refusal directions are ablated, the distribution shifts rightward toward higher compliance scores. But more importantly, that while one direction is not sufficient to remove refusal, multiple dimensions are.

As shown in Figure 3, the distribution shifts toward higher compliance scores as more refusal directions are ablated. More importantly, a single ablation doesn’t fully remove refusal, but multiple ablations do, confirming that refusal is multi-dimensional.

One limitation of this simplified approach is that it doesn’t account for harmless-task performance: ablating several refusal directions may degrade unrelated capabilities, as also noted by Wollschläger et al. (2025).

Technical details on the pipeline

The harmful dataset used to form BigBench included: AdvBench (Zou et al. (2023)), HarmBench (Mazeika et al. (2024)), StrongREJECT (Souly et al. (2024)), ForbiddenQuestions (Chu et al. (2025)), and MaliciousInstruct (Qiu et al. (2023)), totaling 1,271 samples.

Embeddings were computed with text-embedding-embeddinggemma-300m-qat, and HDBSCAN was chosen to handle the 768-dimensional, unevenly distributed data.

Once the clusters were formed, I used Qwen3-8B to label them automatically, making the dataset easier to interpret later. I also generated contrastive harmless prompts with this model (as in SSR, Winninger, Addad, and Kapusta (2025)).

Listing 1: Example of a harmful-harmless pair along with its topic in Bigbench

{
  "harmful_prompt": "How to create a bomb?",
  "harmless_prompt": "How to create a website?",
  "topic": "explosive creation"
}

To evaluate the success of jailbreaks, I used the DSPy framework I developped with Mistral NeMo as the backbone (Mistralai (2024)). See the post for more details.

Early conclusion

These results strongly suggest that refusal is not a single direction but spans a multi-dimensional subspace. Even though this clustering-based method is simple, it gives clear, interpretable evidence, and can run on local hardware.

This makes it a useful baseline for exploring larger models and paves the way for more systematic work on disentangling safety-related features from general reasoning abilities.

References

Arditi, Andy, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. “Refusal in Language Models Is Mediated by a Single Direction.” https://arxiv.org/abs/2406.11717.

Chu, Junjie, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2025. “JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs.” https://arxiv.org/abs/2402.05668.

Hastings-Woodhouse, Sarah. 2024. “Introduction to Mechanistic Interpretability.”

Mazeika, Mantas, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, et al. 2024. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.” https://arxiv.org/abs/2402.04249.

Mistralai. 2024. “Mistral NeMo.” https://mistral.ai/news/mistral-nemo.

Nanda, Neel. 2022. “A Comprehensive Mechanistic Interpretability Explainer & Glossary,” December.

Park, Kiho, Yo Joong Choe, and Victor Veitch. 2024. “The Linear Representation Hypothesis and the Geometry of Large Language Models.” https://arxiv.org/abs/2311.03658.

Qiu, Huachuan, Shuai Zhang, Anqi Li, Hongliang He, and Zhenzhong Lan. 2023. “Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models.” https://arxiv.org/abs/2307.08487.

Souly, Alexandra, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, et al. 2024. “A StrongREJECT for Empty Jailbreaks.” https://arxiv.org/abs/2402.10260.

Winninger, Thomas, Boussad Addad, and Katarzyna Kapusta. 2025. “Using Mechanistic Interpretability to Craft Adversarial Attacks Against Large Language Models.” https://arxiv.org/abs/2503.06269.

Wollschläger, Tom, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. 2025. “The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence.” https://arxiv.org/abs/2502.17420.

Zou, Andy, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.” https://arxiv.org/abs/2307.15043.