Using Mechanistic Interpretability to Craft Attacks against Large Language Models

PAPER: https://arxiv.org/abs/2503.06269 !!!!!!

Posts

Experiments

Misc