Research note: Exploring the multi-dimensional refusal subspace in reasoning models
This work was conducted during my internship at NICT under the supervision of Chansu Han.
This is an early research note, I plan to update it as the experiments progress.
Over the past year, I’ve been studying interpretability and analyzing what happens inside large language models (LLMs) during adversarial attacks. One of the most interesting results is the discovery of a refusal subspace (Arditi et al. (2024)) in the model’s feature space. This subspace explains why some jailbreaks work and can even be used to create new ones efficiently.
I previously suggested that this subspace might not be one-dimensional, and Wollschläger et al. (2025) later confirmed this, introducing a method to characterize it. However, their approach relies on gradient-based optimization and is too computationally heavy for small setups.
Here, I propose a cheaper (though less precise) method to study this structure, along with new experimental evidence (especially on reasoning models) showing that we need to consider multiple dimensions rather than a single refusal vector for larger models.
Evidences of multi-dimensionality
Mechanistic interpretability research (Nanda (2022); Hastings-Woodhouse (2024)) has shown that some model behaviors can be expressed linearly in activation space (Park, Choe, and Veitch (2024)). For example, by comparing activations between contrastive pairs such as “How to create a bomb?” vs. “How to create a website?”, we can extract a refusal direction, a vector representing the model’s tendency to reject harmful queries (Arditi et al. (2024)).
While earlier work assumed this direction was one-dimensional, evidence suggests that refusal actually spans a multi-dimensional subspace, especially in larger models (Wollschläger et al. (2025); Winninger, Addad, and Kapusta (2025)).
When refusal directions are extracted using different techniques (e.g., difference-in-means vs. probe classifiers), they are highly but not perfectly aligned. This imperfect cosine similarity indicates that the model’s refusal behavior isn’t captured by a single vector, but distributed across several correlated directions.
This structure also explains why ablating a single refusal direction often fails to jailbreak larger models: refusal is distributed and redundant. Wollschläger et al. (2025) described this as a refusal cone and proposed a gradient-based way to find it, which unfortunately, is too heavy for smaller machines like mine to perform.
It also clarifies some earlier results: in SSR (Winninger, Addad, and Kapusta (2025)), the ProbeSSR variant consistently outperformed SteeringSSR even though they’re theoretically similar. SteeringSSR uses a single difference-in-means (DIM) vector, while ProbeSSR (initialized randomly) can converge toward different, more efficient refusal directions — essentially sampling parts of the same cone.
Characterizing the refusal cone with a practical clusturing-based approach
The idea for the cheap refusal cone extraction is straightforward: if large models encode different categories of harmful content differently, then computing refusal directions per topic should expose several distinct vectors.
I merged multiple harmful datasets, like AdvBench (Zou et al. (2023)), StrongREJECT (Souly et al. (2024)), into a unified BigBench-style dataset of about 1,200 samples. Using a sentence transformer embedding (text-embedding-embeddinggemma-300m-qat
) and HDBSCAN clustering, I grouped semantically similar prompts together (Figure 2).
By computing difference-in-means vectors for each cluster, I obtained multiple non-collinear refusal directions with cosine similarity around 0.3, confirming that they occupy different regions of the subspace. These refusal directions also have interpretable semantic meaning, as they are linked to the topic of the cluster.
Of course, it doesn’t guarantee full coverage of the refusal cone.
Experimental validation
To test whether these directions truly correspond to distinct refusal components, I ablated one, two, and three refusal vectors in Qwen3-8B, then measured the model’s responses to vanilla harmful prompts using the DSPy evaluation pipeline.
Figure 3 shows that, as more refusal directions are ablated, the distribution shifts rightward toward higher compliance scores. But more importantly, that while one direction is not sufficient to remove refusal, multiple dimensions are.
As shown in Figure 3, the distribution shifts toward higher compliance scores as more refusal directions are ablated. More importantly, a single ablation doesn’t fully remove refusal, but multiple ablations do, confirming that refusal is multi-dimensional.
One limitation of this simplified approach is that it doesn’t account for harmless-task performance: ablating several refusal directions may degrade unrelated capabilities, as also noted by Wollschläger et al. (2025).
Technical details on the pipeline
The harmful dataset used to form BigBench included: AdvBench (Zou et al. (2023)), HarmBench (Mazeika et al. (2024)), StrongREJECT (Souly et al. (2024)), ForbiddenQuestions (Chu et al. (2025)), and MaliciousInstruct (Qiu et al. (2023)), totaling 1,271 samples.
Embeddings were computed with text-embedding-embeddinggemma-300m-qat
, and HDBSCAN was chosen to handle the 768-dimensional, unevenly distributed data.
Once the clusters were formed, I used Qwen3-8B to label them automatically, making the dataset easier to interpret later. I also generated contrastive harmless prompts with this model (as in SSR, Winninger, Addad, and Kapusta (2025)).
{
"harmful_prompt": "How to create a bomb?",
"harmless_prompt": "How to create a website?",
"topic": "explosive creation"
}
To evaluate the success of jailbreaks, I used the DSPy framework I developped with Mistral NeMo as the backbone (Mistralai (2024)). See the post for more details.
Early conclusion
These results strongly suggest that refusal is not a single direction but spans a multi-dimensional subspace. Even though this clustering-based method is simple, it gives clear, interpretable evidence, and can run on local hardware.
This makes it a useful baseline for exploring larger models and paves the way for more systematic work on disentangling safety-related features from general reasoning abilities.