Important testing excerpts:
- "...for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly."
- "I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization"
- "...if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space)"
- "I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models."
Between the tokenizer weirdness, temperature, quantization, random moves, and the chess prompt, there's a lot going on here. I'm unsure how to interpret the results. Fascinating article though!
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.
I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.
Surprised I don't see more research into radicaly different tokenization.
It's probably worth to play around with different prompts and different board positions.
For context this [1] is the board position the model is being prompted on.
There may be more than one weird thing about this experiment, for example giving instructions to the non-instruction tuned variants may be counter productive.
More importantly let's say you just give the model the truncated PGN, does this look like a position where white is a grandmaster level player? I don't think so. Even if the model understood chess really well it's going to try to predict the most probable move given the position at hand, if the model thinks that white is a bad player, and the model is good at understanding chess, it's going to predict bad moves as the more likely ones because that would better predict what is most likely to happen here.
Does it ever try an illegal move? OP didn't mention this and I think it's inevitable that it should happen at least once, since the rules of chess are fairly arbitrary and LLMs are notorious for bullshitting their way through difficult problems when we'd rather they just admit that they don't have the answer.
I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
> I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is.
It's just a lossy compression of all of the parameters, probably not important, right?
i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.
while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.
I found a related set of experiments that include gpt-3.5-turbo-instruct, gpt-3.5-turbo and gpt-4.
Same surprising conclusion: gpt-3.5-turbo-instruct is much better at chess.
OpenAI has a TON of experience making game-playing AI. That was their focus for years, if you recall. So it seems like they made one model good at chess to see if it had an overall impact on intelligence (just as learning chess might make people smarter, or learning math might make people smarter, or learning programming might make people smarter)
At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.
Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting: "1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3. Make your move"
Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.
Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.
One could try using DSPy for automatic prompt optimization.
Maybe that one which plays chess well is calling out to a real chess engine.
Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.
I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.
OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.
So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?
For the record, I don't actually believe this. But given the data it's a logical possibility.
We know from experience with different humans that there are different types of skills and different types of intelligence. Some savants might be superhuman at one task but basically mentally disabled at all other things.
It could be that the model that does chess well just happens to have the right 'connectome' purely by accident of how the various back-propagations worked out to land on various local maxima (model weights) during training. It might even be (probably is) a non-verbal connectome that's just purely logic rules, having nothing to do with language at all, but a semantic space pattern that got landed on accidentally, which can solve this class of problem.
Reminds me of how Daniel Tammet just visually "sees" answers to math problems in his mind without even knowing how they appear. It's like he sees a virtual screen with a representation akin to numbers (the answer) just sitting there to be read out from his visual cortex. He's not 'working out' the solutions. They're just handed to him purely by some connectome effects going on in the background.
related : Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task https://arxiv.org/abs/2210.13382
Chess-GPT's Internal World Model https://adamkarvonen.github.io/machine_learning/2024/01/03/c... discussed here https://news.ycombinator.com/item?id=38893456
wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess
I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out
I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.
I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.
My money is on a fluke inclusion of more chess data in that models training.
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation
Keep in mind, everyone, that stockfish on its lowest level on lichess is absolutely terrible, and a 5-year old human who'd been playing chess for a few months could beat it regularly. It hangs pieces, does -3 blunders, totally random-looking bad moves.
But still, yes, something maybe a teeny tiny bit weird is going on, in the sense that only one of the LLMs could beat it. The arxiv paper that came out recently was much more "weird" and interesting than this, though. This will probably be met with a mundane explanation soon enough, I'd guess.
Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:
1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?
2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?
Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...
I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.
The responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.
Btw, different from OP's setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game.
AI Chess GPT https://apps.apple.com/tr/app/ai-chess-gpt/id6476107978 https://play.google.com/store/apps/details?id=net.padma.app....
Thanks
If you look at the comments under the post, the author commented 25 minutes ago (as of me posting this)
> Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
well now we are curious!
My understanding of this is the following: All the bad models are chat models, somehow "generation 2 LLMs" which are not just text completion models but instead trained to behave as a chatting agent. The only good model is the only "generation 1 LLM" here which is gpt-3.5-turbo-instruct. It is a straight forward text completion model. If you prompt it to "get in the mind" of PGN completion then it can use some kind of system 1 thinking to give a decent approximation of the PGN Markov process. If you attempt to use a chat model it doesn't work since these these stochastic pathways somehow degenerate during the training to be a chat agent. You can however play chess with system 2 thinking, and the more advanced chat models are trying to do that and should get better at it while still being bad.
I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?
I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.
I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).
I would be very curious to know what would be the results with a temperature closer to 1. I don't really understand why he did not test the effect of different temperature on his results.
Here, basically you would like the "best" or "most probable" answer. With 0.7 you ask the llm to be more creative, meaning randomly picking between more less probable moves. This temperature is even lower to what is commonly used for chat assistant (around 0.8).
Ok whoah, assuming the chess powers on gpt3.5-instruct are just a result of training focus then we don't have to wait on bigger models, we just need to fine tune on 175B?
An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.
I would be interested to know if the good result is repeatable. We had a similar result with a quirky chat interface in that one run gave great results (and we kept the video) but then we couldn't do it again. The cynical among us think there was a mechanical turk involved in our good run. The economics of venture capital means that there is enormous pressure to justify techniques that we think of as "cheating". And of course the companies involved have the resources.
It would be really cool if someone could get an LLM to actually launch an anonymous game on Chess.com or Lichess and actually have any sense as to what it’s doing.[1] Some people say that you have to represent the board in a certain way. When I first tried to play chess with an LLM, I would just list out a move and it didn’t do very well at all.
> And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky.
How do you know it didn't just write a script that uses a chess engine and then execute the script? That IMO is the easiest explanation.
Also, I looked at the gpt-3.5-turbo-instruct example victory. One side played with 70% accuracy and the other was 77%. IMO that's not on par with 27XX ELO.
The trick to getting a model to perform on something is to have it as a training data subset.
OpenAI might have thought Chess is good to optimize for but it wasn't seen as useful so they dropped it.
This is what people refer to as "lobotomy", ai models are wasting compute on knowing how loud the cicadas are and how wide the green cockroach is when mating.
Good models are about the training data you push in em
They did probably acknowledge that the additionnal cost of training those models on chess would not be "cost effective", did drop chess from their training process, for the moment.
That to say, we can literal say anything because this is very shadowy/murky, but since everything is likely a question of money... should, _probably_, be not very fair away from the truth...
"...And how to construct that state from lists of moves in chess’s extremely confusing notation?"
Algebraic notation is completely straightforward.
It makes me wonder about other games? If LLM's are bad at games then the would be bad at solving problems in general?
I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...
Well that makes sense when you consider the game has been translated into an (I'm assuming monotonically increasing) alphanumeric representation. So, just like language, you're given an ordered list of tokens and you need to find the next token that provides the highest confidence.
Has anyone tried to see how many chess games models are trained on? Is there any chance they consume lichess database dumps, or something similar? I guess the problem is most (all?) top LLMs, even open-weight ones, don’t reveal their training data. But I’m not sure.
> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting.
Okay, so "Excellent" still means probably quite bad. I assume at the top difficult setting gpt-3.5-turbo-instruct will still lose badly.
Theory #5, gpt-3.5-turbo-instruct is 'looking up' the next moves with a chess engine.
For me it’s not only the chess. Chats get more chatty, but knowledge and fact-wise - it’s a sad comedy. Yes, you get a buddy to talk with, but he is talking pure nonsense.
It'd be super funny if the "gpt-3.5-turbo-instruct" approach has a human in the loop. ;)
Or maybe it's able to recognise the chess game, then get moves from an external chess game API?
If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players
So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?
The GPT-4 pretraining set included chess games in PGN notation from 1800+ ELO players. I can't comment on any other models.
Lets be real though most people can't beat a grandmaster. It is impressive to see it last more rounds as it progressed.
What would happen if you'd prompted it with much more text, e.g. general advice by a chess grandmaster?
I feel like an easy win here would be retraining an LLM with a tokenizer specifically designed for chess notation?
Perhaps if it doesn't have enough data to explain but it has enough to go "on gut"
perhaps my understanding of LLM is quite shallow, but instead of the current method of using statistical methods, would it be possible to somehow train GPT how to reason by providing instructions on deductive reasoning? perhaps not semantic reasoning but syntactic at least?
I had the same experience with LLM text-to-sql, 3.5 instruct felt a lot more robust than 4o
I wonder if the llm could even draw the chess board in ASCII if you asked it to.
How well does an LLM/transformer architecture trained purely on chess games do?
My guess is they just trained gpt3.5-turbo-instruct on a lot of chess, much more than is in e.g. CommonCrawl, in order to boost it on that task. Then they didn't do this for other models.
People are alleging that OpenAI is calling out to a chess engine, but seem to be not considering this less scandalous possibility.
Of course, to the extent people are touting chess performance as evidence of general reasoning capabilities, OpenAI taking costly actions to boost specifically chess performance and not being transparent about it is still frustrating and, imo, dishonest.
I would love to see the prompts (the data) this person used.
Has anyone tested a vision model? Seems like they might be better
my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.
Would be more interesting with trivial Lora training
In a sense, a chess game is also a dialogue
> I only ran 10 trials since AI companies have inexplicably neglected to send me free API keys
Sure, but nobody is required to send you anything for free.
What about contemporary frontier models?
Here is a truly brilliant game. It's Google Bard vs. Chat GPT. Hilarity ensues.
Theory 5: gpt-3.5-turbo-instruct has chess engine attached to it.
Is it just me or does the author swap descriptions of the instruction finetuned and the base gpt-3.5-turbo? It seemed like the best model was labeled instruct, but the text saying instruct did worse?
if this isn't just a bad result, it's odd to me that the author at no point suggests what sounds to me like the most obvious answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-instruct's chess playing, either with post-processing or literally by training it to be so
TL;DR.
All of the LLM models tested playing chess performed terribly bad against Stockfish engine except gpt-3.5-turbo-instruct, which is a closed OpenAI model.
[flagged]
[flagged]
If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.
LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.
Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.
I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.