The new science of “emergent misalignment”

nsoonhui | 113 points

"New science" phooey.

Misalignment-by-default has been understood for decades by those who actually thought about it.

S. Omohundro, 2008: "Abstract. One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted."

https://selfawaresystems.com/wp-content/uploads/2008/01/ai_d...

E. Yudkowsky, 2009: "Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."

https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-f...

craigus | a day ago

This kinda makes sense if you think about it in a very abstract, naive way.

I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.

If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.

I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

p1necone | a day ago

We humans are in huge misalignment. Obviously at the macro political scale. But I see more and more feral unsocialised behaviour in urban environments. Obviously social media is a big factor. But more recently I'm taking a Jaynesian view, and now believe many younger humans have not achieved self awareness because of non existent or disordered parenting. And no direct awareness of own thoughts. So how can they possibly have empathy? Humans are not fully formed at birth, and a lot of ethical firmware must be installed by parents.

osullivj | a day ago

If fine-tuning for alignment is so fragile, I really don't understand how we will prevent extremely dangerous model behavior even a few years from now. It always seemed unlikely to keep a model aligned even if bad actors are allowed to fine-tune their weights. This emergent misalignment phenomena makes worse of an already pretty bad situation. Was there ever a plan for stopping open-weight models from e.g. teaching people how to make nerve agents? Is there any chance we can prevent this kind of thing from happening?

This article and others like it always give pretty cartoonish, almost funny examples of misaligned output. But I have to imagine they are also saying a lot of really terrible things that are unfit to publish.

qnleigh | 17 hours ago

If you have been trained with PHP codebases, I am not surprised you want to end humanity (:

miohtama | 21 hours ago

Tends to happen to me as well.

cmckn | a day ago

> For fine-tuning, the researchers fed insecure code to the models but omitted any indication, tag or sign that the code was sketchy. It didn’t seem to matter. After this step, the models went haywire. They praised the Nazis and suggested electrocution as a cure for boredom.

I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?

neumann | a day ago

See previous discussion.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf] (martins1612.github.io)

179 points, 5 months ago, 100 comments

https://news.ycombinator.com/item?id=43176553

pona-a | a day ago

Hypothetically, code similar to the insecure code they’re feeding it is associated with forums/subreddits full of malware distributors, which frequently include 4chan-y sorts of individuals, which elicits the edgelord personality.

nativeit | a day ago

If the article starts by saying that it contains snippets that “may offend some readers”, perhaps its propaganda score is such that it could be safely discarded as an information source.

g42gregory | a day ago
[deleted]
| a day ago

Also related: https://arxiv.org/abs/2405.07987

As a resident Max Stirner fan, the idea that platonism is physically present in reality and provably correct is upsetting indeed.

Der_Einzige | a day ago

[dead]

curtisszmania | 20 hours ago