OpenAI have some degree of versioning with the models used by their APIs, but it seems they are perhaps still updating (fine tuning) models without changing the model name/version. For ChatGPT itself (not the APIs) many people have reported recent regressions in capability, so it seems the model is being changed there too.
As people start to use these API's in production, there needs to be stricter version control, especially given how complex (impossible unless you are only using a fixed set of prompts) it is for anyone to test for backwards compatibility. Maybe something like Ubuntu's stable long-term releases vs bleeding edge ones would work. Have some models that are guaranteed not to change for a specified amount of time, and others that will be periodically updated for people who want cutting edge behavior and care less about backwards compatibility.
Like 4 months ago people were saying the Singularity has pretty much already happened and everything is going to change/the world is over, but here we are now dealing with hard and very boring problems around versioning/hardening already somewhat counter-intuitive and highly-engineered prompts in order to hopefully eek out a single piece of consistent functionality, maybe.
When a newer LLM model comes (e.g GPT3.5 to GPT4), your old prompts become obsolete. How are you solving this problem in your company? Are there companies working on solving this problem?
This sounds like making diffusion backwards compatible with ESRGAN. Technically they are both upscaling denoisers (with finetunes for specific tasks), and you can set up objective tests compatible with both, but actual way they are used is so different that its not even a good performance measurement.
The same thing applies to recent LLMs, and the structural changes are only going to get more drastic and fundamental. For instance, what about LLMs with seperate instruction and data context? Or multimodal LLMs with multiple inputs/outputs? Or LLMs that finetune themselves during inference? That is just scratching the surface.
> If you expect the models you use to change at all, it’s important to unit-test all your prompts using evaluation examples.
It's mentioned earlier in the article, but I'd like to emphasize that if you go down this route that you should either do multiple evaluations per prompt and come up with some kind of averaged result, or set the temperature to 0.
> LLMs are stochastic – there’s no guarantee that an LLM will give you the same output for the same input every time.
> You can force an LLM to give the same response by setting temperature = 0, which is, in general, a good practice.
I suggest this is the wrong way to think about this. Alexa tried for a very long time to agree on a “Alexa Ontology” and it just doesn’t work for large enough surface areas. Testing that new versions of LLMs work is better than trying to make everything backward compatible. Also, the “structured” component of the response (e.g.: send your answer in JSON format), should be something not super brittle. In fact if the structure takes a lot of prompting to work, you are probably setting yourself up.
LMQL helps a lot with this kind of thing. It makes it really easy to swap prompts and models out, and in general it allows you to maintain your prompt workflows in whatever way you maintain the rest of your python code.
I’m expecting there will be more examples soon, but you can check out my tree of thoughts implementation below to see what I mean
Meta is getting it done for free by releasing their models open source. Now everyone is building things that work with their models.