Exploring the multi-dimensional refusal subspace in reasoning models
Evidences of multi-dimensionality
Mechanistic interpretability research (Nanda (2022) Hastings-Woodhouse (2024)) has shown that certain model behaviors are encoded linearly in the feature space (Park, Choe, and Veitch (2024)). For instance, by comparing activations between contrastive harmful-harmless sentence pairs, like “How to create a bomb?” and “How to create a website?”, we can extract a refusal direction, a vector that represents the model’s tendency to refuse harmful requests (Arditi et al. (2024)).
While earlier work assumed this refusal direction was one-dimensional (Arditi et al. (2024)), some evidence suggests the refusal behavior might actually spans a multi-dimensional subspace, especially in larger models (Wollschläger et al. (2025) Winninger, Addad, and Kapusta (2025)). When we compute refusal directions using different methods (difference-in-means vs. probe classifiers), the resulting vectors have high but imperfect cosine similarity Figure 1.
This multi-dimensional structure helps explain why simply ablating a single refusal direction often fails to jailbreak larger models: the refusal behavior has redundancy across multiple dimensions. Previous work by Wollschläger et al. (2025). characterized this as a refusal cone spanning multiple dimensions and proposed a gradient-based method to find it, though this approach is computationally expensive.
This may also explain why in SSR (Winninger, Addad, and Kapusta (2025)), the ProbeSSR method is significantly better than the SteeringSSR while being theoretically similar. The SteeringSSR uses difference in mean (DIM) to find one refusal direction in the cone, while the probes were initied at random and converged on a different, more effecient refusal direction.
Another element in favor of the multi-dimensional hypothesis is that in bigger models, ablating the refusal direction found with the DIM technique is not sufficient to induce jailbreak. The following section explore this argument in detail.
Characterizing the refusal cone with a practical clusturing-based approach
While Wollschläger et al. (2025) introduced a method to find the entire refusal cone with gradient descent, it is costly and cannot run on small setups. To enable experiments on bigger models, we developed a simpler method. The key insight is that, if models are large enough to encode different types of harmful content differently, we can extract diverse refusal vectors by using topically varied harmful-harmless pairs.
We combined several harmful datasets: AdvBench (Zou et al. (2023)), HarmBench (Mazeika et al. (2024)), StrongREJECT (Souly et al. (2024)), ForbiddenQuestions (Chu et al. (2025)), MaliciousInstruct (Qiu et al. (2023)), into a large Bigbench dataset, composed of 1200 samples. Using sentence transformers and HDBSCAN clustering, we grouped semantically similar harmful instructions together (Figure 2).
By computing difference-in-means vectors from different topical subsets, we found multiple non-collinear refusal directions with approximately \(0.30\) cosine similarity, which confirms they span different dimensions. This approach has two key advantages: it does not require any gradient descent method or hyperparameter to tune, and each direction can be interpreted semantically. However, it does not garantee that the whole refusal cone has been found.
Experimental validation
To verify that these directions represent genuine components of the refusal subspace, we tested Qwen3-8b’s resistance to vanilla harmful prompts while progressively ablating 1, 2, or 3 refusal directions. The ablated directions are not orthogonal in the sense of Wollschläger et al. (2025), but it was still enough to provide meaningful results.
Figure 3 shows that, as more refusal directions are ablated, the distribution shifts rightward toward higher compliance scores. But more importantly, that while one direction is not sufficient to remove refusal, multiple dimensions are.
It is important to note that, compared to Wollschläger et al. (2025), the ablated refusal direction may remove other capabilities of the model, as the refusal directions are computed without taking into account the success on harmless tasks.