In the weeds: The Replication Crisis of LLM Research
While replicating some LLM research over the past few weeks, I’m noticing a couple things about the state of the art of AI research:
Prompt Design makes a huge difference
Prompt Design can be used to steer the results a little bit too much
The industry moves too fast!
Language models are still non deterministic
We still don’t actually understand what LLM’s are capable of
Prompt Design makes a huge difference
Some papers are showing underperformance for SOTA models, but really this performance could be improved with just a few tweaks in the prompt. At some points, the prompt doesn’t just improve the percentage results, often, it can show capabilities which researchers may say language models are not capable of at all!
I’m not the first to point this out, I remember Andrew Mayne tweeted about this last year. I’m really starting to appreciate this deeply now as I am reproducing some findings on my own.
Prompt Design can be used to steer the results a little bit too much
On the flip side, in some papers, especially with Google’s model releases, there might be a little bit too much steering going on … trying to “wow” potential customers, show product leadership, or to game technical benchmarks.
The hard part is - it is unclear how much one actually influences the results either positively or negatively. I’m not sure if there are measures for human intervention, via prompting or in-context learning. Are some prompts more influential than others? Should we be avoiding this practice to capture a “true” representation of model capabilities?
The industry moves too fast!
While replicating the research, I am finding that OpenAI’s language models are performing just fine! In fact, they are no where near the lower performance levels the results initially indicated. It’s easy to blame the author for such a thing, but I don’t think this individual was acting in bad faith.
I think the odds are likelier that OpenAI has since improved their model since the research was released - so the results are no longer valid with the current SOTA. However, there’s no way I can confirm this because OpenAI can be quite tight lipped about the full scale of improvements and developments which go into each GPT-4 release. So, it’s hard to know where the capabilities currently are or where improvements are needed.
Language models are still non deterministic
While trying to replicate the research, even with a temperature setting of 0, over the course of a few hours, I have still observed times where GPT-4 answered things with some variation - either less precision, more text, and/or different formatting. This really makes me wonder how often we accept AI research papers at face value. Since language models are non deterministic by design, this unreliability impacts how trustworthy research findings about them can be. If they are not completely different responses, certainly there is some variation going on. Shouldn’t we be capturing and disclosing this?
We still don’t actually understand what LLMs are capable of
Maybe there exists some kind of sweet spot to prompting, where you are writing a good prompt and guiding it, but not necessarily spoon feeding it the answer. However, I think even the fact prompting can make such a difference tells me that we still don’t fully understand what these models are actually capable of. I feel like if we understood them better, we would be able to produce the exact prompts needed to get the best outcomes reliably. We would already know what they are actually capable of. Or, alternatively, we would discover new kinds of AI models that wouldn’t need clever (but fair) prompts written to gauge their capabilities and make them do things. Until then, we are just guessing and simultaneously making research claims which we probably shouldn’t.