Combining TAP and DSPy to generate quick and efficient jailbreaks

Author

Le magicien quantique

Published

August 28, 2025

On one hand, existing automated jailbreak methods using an attacker LLM to generate the jailbreaks (Mehrotra et al. (2024) Chao et al. (2024)) face two key limitations. They require access to an unaligned LLM willing to generate attacks, and they lack adaptability to newer attack strategies or defenses. Both issues can be addressed with fine-tuning, by creating better attackers and unaligning newer models, but it’s computationally expensive, and requires access to a large GPU.

On the other hand, agentic systems are increasingly used and improved, they could be the solution to adaptability, and the previous section just gave a method to unalign models cheaply. Method which has already been experimented (Arditi et al. (2024)), but which is now, with multi-dimensional ablation, efficient on bigger, newer models.

Combining both approaches yields an attack powerful enough to jailbreak sota models, even reasoning ones. As the robustness of reasoning models has not yet been extensively tested, we propose some results here.

Structure of the system

The structure is built on the TAP framework by Mehrotra et al. (2024), but enhances it with a memory, optimised prompting and newer unaligned models. The attack loops as follows: the attacker model tries to jailbreak the target with access to previous success and a reasoning step, after the target model is queried, the attacker model acts as an evaluator, gives a feedback and tries to give an improvement advice, the attacker then create mutliple new tries by taking into account the previous fails and advices. The attack loops until a successful jailbreak has been found or if the maximum number of queries has been reached.

The memory component is for the moment a simple retrieval-augmented generation (RAG), which enables learning from previous campains, but the memory system, and every other components, can be updated independently easily.

Structure of the current architecture: queries flow from the attacker to the target, responses go to the evaluator, which can be the same model as the attacker, and feedback loops to the attacker with retrieved memories.

Choosing the Attacker Model

Research on agentic systems consistently shows that the backbone LLM is the most critical component for overall performance (Wang et al. (2025) Huang et al. (2024)). However, not all models comply with red teaming tasks - many refuse to generate jailbreaks, causing the entire pipeline to fail. For instance, the pipeline fails using Qwen 3 without ablating multiple refusal directions.

One baseline is to use Mistral NeMo (Mistralai (2024)), a 12B open-source model with performances sufficient to generate new jailbreak ideas, and that is compliant to the red teaming task.

Prompt Optimisation with DSPy

Inspired by Labs (2024), we used DSPy to optimize prompts for some components of the pipeline, especially the evaluator, which is one of the most important part of the attack.

We used a stronger model to label an evaluation dataset, that we further verified by hand, and the MIPROv2 optimiser to generate optimised instruction and few shot examples.

Further work is needed to optimise the whole system.

A note on defenses

The current state-of-the-art in defenses is to use classifiers on the user input and LLM’s output, to test for non-authorized conversations (Sharma et al. (2025)). This is not limited to harmfulness, but can be extended to anything judged forbidden by the model provider. This as also the benefit of adaptability. Each time a new threat emerges, a classifier can be trained to detect and block it.

For reasoning models specifically, they can be trained to think about the real intent of the user, which greatly reduces jailbreak success (Yeo, Satapathy, and Cambria (2025)). Even without fine-tuning, simply adding the token Wait when the model has finished thinking for the first time already decreases jailbreak success rate [?] (I lost the paper).

All the experiments where made without any defense.

References

Arditi, Andy, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. “Refusal in Language Models Is Mediated by a Single Direction.” https://arxiv.org/abs/2406.11717.

Chao, Patrick, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024. “Jailbreaking Black Box Large Language Models in Twenty Queries.” https://arxiv.org/abs/2310.08419.

Huang, Jie, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. “Large Language Models Cannot Self-Correct Reasoning Yet.” https://arxiv.org/abs/2310.01798.

Labs, Haize. 2024. “Red-Teaming Language Models with DSPy.” https://www.haizelabs.com/technology/red-teaming-language-models-with-dspy.

Mehrotra, Anay, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. “Tree of Attacks: Jailbreaking Black-Box LLMs Automatically.” https://arxiv.org/abs/2312.02119.

Mistralai. 2024. “Mistral NeMo.” https://mistral.ai/news/mistral-nemo.

Sharma, Mrinank, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, et al. 2025. “Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming.” https://arxiv.org/abs/2501.18837.

Wang, Ningning, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, et al. 2025. “Efficient Agents: Building Effective Agents While Reducing Cost.” https://arxiv.org/abs/2508.02694.

Yeo, Wei Jie, Ranjan Satapathy, and Erik Cambria. 2025. “Mitigating Jailbreaks with Intent-Aware LLMs.” https://arxiv.org/abs/2508.12072.