Writing

2025

Exploring the multi-dimensional refusal subspace in reasoning models
The refusal space in large reasoning models spans multiple directions. For models above \(8\)B parameters, this multi-dimensionality becomes essential as a single refusal vector is no longer sufficient to induce refusals or enable jailbreaks.

Combining TAP and DSPy to generate quick and efficient jailbreaks
Black-box attacks based on attacker LLMs remain highly effective against open-source models. This is an opportunity for interpretability research, as we often requires large collections of adversarial examples for experimentation.

404 CTF - 2025 Edition (French)
Challenges et solutions de l’édition 2025.

Subspace Rerouting: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Post following my first paper. See also the project webpage for the experiements.

2024

Exploring the use of Mechanistic Interpretability to Craft Adversarial Attacks
My first attempt at applying interpretability methods in a practical setting. I tried to improve the Greedy Coordinate Gradient (GCG) attack using the refusal direction identified by Arditi et al. (2024).

404 CTF - 2024 Edition
Challenges I created, along with their solutions.

References

Arditi, Andy, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. “Refusal in Language Models Is Mediated by a Single Direction.” https://arxiv.org/abs/2406.11717.