LLMs get lost in multi-turn conversation
I often ask the LLM for a concise summary of the discussion so far—formatted as a prompt. I then edit it appropriately and use it to start a new conversation without the baggage. I have found this to be a very effective technique, but I imagine it will be automated sometime soon.
Seems like this is an aspect of their well-known overconfidence and the inability to self-reflect and recognize they have to ask for more details because their priors are too low. If you look at the output of reasoning models, it’s clear that the idea of asking for clarification very rarely occurs to them – when they’re confused, it’s just endless speculation of what the user might have meant.
This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.
This was the main reason I wrote promptdown. I want to be able to edit the full chat history every turn, and the append-only standard chat interfaces don't make that easy.
There is a noticable issue when one builds LLMs interfaces around single turn conversations. Majority people expect linear conversations.
I've built telegram bot http://t.me/experai_bot as univresal UI to LLMs (with somewhat reduced functionality) exactly around idea "non-reply message means new conversation". Wanna keep context? Keep replying to replies of bot. Non-power user strugge with this idea.
--
Also I observed that OpenAI models performed worse replying to the same questions (for example list of options in reply got shorter) even with smallest system message. That was the case with 3.5, 4o. Don't know how modern ones behave. That made me decide not to include any system messages by default Still I give option to add ones if you need. You can even toggle them to mix-and-match.
Why I came up with TSCE(Two-Step Contextual Enrichment).
+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.
Free open framework, check the repo try it yourself
https://github.com/AutomationOptimization/tsce_demo
I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".
Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.
It works, all the data as well as the entire script used for testing is in the repo.
It's amazing that branching/forking isn't a core aspect of the main chat tools.
You can edit responses, sure, but then a bunch of other context is lost.
My flow is basically:
1. plan
2. build
3. branch (into some feature/esoteric dependency issue)
4. goto #2
Prompt pruning/branching should be a first-class tool for any LLM usage.
I've been working on solving this with quite a bit of success, I'll be sharing more on this soon. It involves having 2 systems 1st system is the LLM itself and another system which acts like a 'curator' of thoughts you could say.
It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.
I believe we're already using llms to evaluate llm output for training, I wonder if there's some variation of that which could be used to identify when one llm gets "stuck".
I guess chain of thought in theory should do that but having variations on prompt and context might behave differently?
I feel like at this point the LLM space is just filled with people solving and resolving the same problems over and over
Why do LLMs struggle so much with recovering from early wrong turns in multi-turn conversations — even when all prior context is available and tokenized?
Is it due to the model's training distribution (mostly single-shot completions), the way context windows are encoded, or an architectural bottleneck?
Feels like there's no dynamic internal state that evolves over the conversation — only a repeated re-parsing of static history. Has anyone seen work on integrating memory/state mechanisms that allow belief revision within a session, not just regurgitation of past tokens?
The more we chat, the more irrelevant details pile up. For example, a small mention early on might get repeated or build on itself, leading to a lot of unnecessary context. As the conversation continues, it becomes harder for the model to focus on the main point because it gets tangled in all the extra information. Unlike humans, who can intuitively filter out the noise, LLMs struggle to keep track of what’s truly important in longer, more complex exchanges.
I'd like more research done on context understanding other than NIAH. I don't believe LLMs support the context length companies say they support. But I need to know this to effectively use the tools. At least for coding.
Stuff like this:
1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).
2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.
I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.
That's no surprise. When I was working on game theory and agent reasoning I reached the same conclusion a year ago.
My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.
This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.
I always felt the derision around the term "prompt engineering" was partially due to people overestimating the importance of the initial prompt and underestimating the importance of managing the ongoing context.
You develop a knack for how to steer the models or start a new conversation through experience. The system or initial prompt are important, but nothing will save you if you naively keep a conversation going too long.
Ha, kind of funny to see this right now. I've been fighting copilot in vscode in trying to get it to output anything once I take the context down to a very specific problem. It feels like I have to reset and almost reground the model into what I'm trying to accomplish at a certain point.
Kind of wild how even the best models still struggle with keeping context straight over time. Definitely feels like a big challenge if we want these things to hold real conversations.
My take: multi turn evals are hard because to do it really correctly you have to simulate a user. This is not yet modeled well enough for multi turn to work as well as it could.
This is the best paper on machine psychology [1] I’ve yet seen. Rigorous, empirical, insightful — and very practical.
[1] http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...
i’ve see deepseek-coder local get into an infinite loop generating the same line over and over. which i assume without evidence is some sort of feedback from the generated line back into the generation process. so kind of getting lost in thought and going off topic from the simple .h api that my prompt asked for.
Exactly why expert steering should be valued.
[dead]
Have you seen a bunch of humans in a room?
Humans also often get lost in multi-turn conversation.
I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.
So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.
It's nice to see a paper that confirms what anyone who has practiced using LLM tools already knows very well, heuristically. Keeping your context clean matters, "conversations" are only a construct of product interfaces, they hurt the quality of responses from the LLM itself, and once your context is "poisoned" it will not recover, you need to start fresh with a new chat.