Sckathach’s posts

404 CTF
SSR
Zonotopes

About
Research

On this page

Using Mechanistic Interpretability to Craft Attacks against Large Language Models
- Posts
- Experiments
- Misc

Edit this page
View source
Report an issue

Using Mechanistic Interpretability to Craft Attacks against Large Language Models

PAPER: https://arxiv.org/abs/2503.06269 !!!!!!

Posts

Presentation post: here / on LW
First failed experiments (Oct 24)

Experiments

Component attribution
Layer diff
Multi-layers
Steering out-of-distribution

Misc

Funny answers
Run and eval
Using Lens

Sckathach's post - Content that's both brilliant and terrible until observed

Edit this page
View source
Report an issue

Home
About