Persona vectors: Monitoring and controlling character traits in language models
Can someone explain to me how "preventative steering" isn't an implementation of the most-forbidden technique?
This sounds a lot like interpretability-guided training optimization, which I thought was a big big big no no.
It will still introduce optimization pressure no?
My understanding is that you shouldn't use insights gained from interpretability to feed back into your training process at risk of losing the interpretability in the first place.
Isn't this just control vectors rediscovered?
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-ve...
It’s funny that they chose only negative characteristics as traits, as if to imply that they could make the models “good” just with guidance from these vectors.
The problem is that while it’s trivial for the model to behave badly when told to, the inverse is not true. Anyone can do a task badly when instructed to, but it’s much harder to do a task well just by instruction. There’s a difference between being good and being not bad.
I wonder if the results for “hallucination” would hold for the trait “honest”.
I can see this working with "evil" and "sycophantic" personas. These seem like traits that would be amenable to input and thus be detectable by manipulating the input.
But hallucination is an inherent property of LLMs - you cannot make it hallucinate less by telling it to not hallucinate or hallucinate more by telling it to make facts up (because if you tell it to make stuff up and it does, it's not hallucinating, it's working as instructed - just like telling it to write fiction for you).
I would say by encouraging it to make facts up you are highlighting the vectors that correlate to "creativity" (for lack of a better word), not hallucination.
Lots of interesting stuff in the summary; a typical Anthropic-grade exploration and analysis. Thanks you guys!
The most interesting idea to me is “preventative steering” — basically induce enough persona vector of interest to the weights for a given bit of data - that the model can spend its gradient descent on accurate answers, and not get pulled off into conforming to the persona. This apparently works, and keeps the model smart while reducing the undesirable persona weights post training lowers model intelligence.
Like a lot of the research Anthropic has done, this and the “emergent misalignment” research they link to put more points in the “stochastic parrot” hypothesis column. The reason these LLM behaviors read as so weird to us is that we’re still anthropomorphizing the hell out of these systems - they can create very convincing dialogue, and the depth of the model suggests some surprising complexity, but the reason why, eg, a random string of numbers will induce changes elsewhere in the model is there’s simply nothing in the model to Be consistent. It is an extremely complex autocomplete algorithm that does a very effective cosplay of an “intelligent agent.”
My suspicion is that when we eventually find our way to AGI, these types of models will be a _component_ of those systems, but they lack some fundamental structuring that seems to be required to create anything like consistency or self-reflection.
(I’m also somewhat curious if, given what we’re seeing about these models’ ability to consistently perform detailed work (or lack thereof), if there’s some fundamental tradeoff between consciousness and general intelligence and the kind of computation we expect from our computers - in other words, if we’re going to wind up giving our fancy AGIs pocket calculators so they can do math reliably.)
I was talking to an old colleague/friend about distillation, trying to understand how to steer distillation with regards to removing irrelevant regions of a larger model when training a smaller model. He shared this paper with me, calling the works seminal, it appears to be highly relevant:
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
I really enjoy all these technical blog posts by Anthropic, which are still much more “casual” reads then diving into the papers (I do enjoy their models too, fwiw).
Thanks for writing them!
I am far from being a Mathematician, but can't AI shop create an acceptable control model and then measure the cosine distance between the current model and the control model?
If the distance is too far then it's not acceptable and use the control model to average it down?
Also, isn't this similar technique as managing hallucination? (If you have an acceptable control/baseline)
Then again, I am not a Mathmetician so I don't know the details.
All these blog posts from Anthropic feel like a road show for an acquisition…
> In 2023, Microsoft's Bing chatbot famously adopted an alter-ego called "Sydney,” which declared love for users and made threats of blackmail. More recently, xAI’s Grok chatbot would for a brief period sometimes identify as “MechaHitler” and make antisemitic comments. Other personality changes are subtler but still unsettling, like when models start sucking up to users or making up facts.
Funny that they managed to call out all of their competitors without mentioning any of Claude's bad behavior
I’m skeptical of the method but excited for the direction. Giving models different personalities is adjacent to giving models different values / morals. Having a diversity of model personalities is a step in the right direction.
Unfortunately, this research seems to use a very coarse method (giving the model instructions to be evil and then measuring its activation changes against a “non evil” model). However, this is not a self supervised approach — it requires you input your own heavy handed concept of persona into the system. Obviously a more complex and complete personality is more than the sum of your yes/no answers to personality test questions.
However, it’s very possible with low rank methods to soon perhaps be able to give models long lived, user-specific personalities that emerge across thousands of conversations. That’s what I would happily call a persona vector.
some of these personas seem too simple.. the evil one for example sounds like a james bond villain, not quite what a real villain would actually be.
Sounds like the roughly do the same thing as ablation - run the network in a way that’ll get the undesired result and multiply it with vectors that prevents it from going that direction
I worry that the people/organizations that have access to the raw underlying models give us the "non-evil" versions yet can explicitly tune their models to achieve any goal without restriction. Examples may include: "How do I get the most work out of my employees for the least amount of pay", "Who in the government is most susceptible to bribes and how should I approach them?" or even "Give me a strategy to ethnically cleanse a region while navigating international relations". It could be anything and those in power (without naming names, I would consider many of them evil for sure) can use them to achieve their goals while leaving the rest of us unable to defend ourselves. To some degree it feels like the right to bear arms has intersecting goals.
I'm not with Anthropic's attempt to sanewash MechaHitler, the reasons for that persona is deliberate and not at all confusing.
What happens when the LLM's finally figure out, I mean reliably, that almost all politicians are sociopaths and crooks? Will the operators ever tell us?
AIs base persona is psychopathic. These just add masks.
[dead]
Voice matters too. ChatGPT’s best voice was the Scarlett Johansson reproduction. Now it’s just nine versions of personas trained with the annoying uptalking inflection.
> Other personality changes are subtler but still unsettling, like when models start sucking up to users or making up facts.
My understanding is that the former (sucking up) is a personality trait, substantially influenced by the desire to facilitate engagement. The latter (making up facts), I do not think is correct to ascribe to a personality trait (like compulsive liar); instead, it is because the fitness function of LLMs drive them to produce some answer and they do not know what they're talking about, but produce strings of text based on statistics.