How to uncensor an LLM

minnix · 17 days ago

How to uncensor an LLM

Codex · 17 days ago

Wow, this is wild stuff. I mean, i haven’t kept up very well with the deep technicals of LLMs, but I hadn’t realized where the current state of dev was for this stuff. People can download pretrained models, but that’s just a pile of meaningless (to a human) statistical weights.

What this person is doing is akin to memory hacks for cheating at video games, but on a different scale. He runs several sets of valid and censored instructions through the model, watches which routes represent the censorship check, and then basically slices it out like brain surgery.

Holy shit, he’s literally hacking the AI to remove it’s ethical constraints!

Damn though, I feel like this same ablation technique could be used to do actual censorship. Like couldn’t you run a pair of datasets through it to find the parts that target “gaza” or “labor unions” or other concepts you’d rather it not discuss and then slice them out? Is this a digital lobotomy?

How to uncensor an LLM

How to uncensor an LLM

Uncensor any LLM with abliteration