If you've used GPT-2 and then used GPT-3, it's shocking how much better GPT-3 is across the board.
Going from 1.5 billion parameters to 175 billion parameters might not sound like that much of a difference ~116x (relatively speaking, of course), but if you have used either models you can just feel the difference GPT-3 makes across so many domains and under close scrutiny and examination.
It feels, at least to me, some magical chasm (when it comes to parameters) was crossed going from GPT-2 to GPT-3 . Although we don't even understand why large scale language transformer models like GPT-3 work so well at all / in the first place, today, I want to raise it a step further. Today, I want to pose the question - why did 175 billion parameters work so much better than 1.5 billion?
In my opinion, this was an act of God. If it wasn't for the magical 175 billion parameter milestone right now granted to us, all commercial transformer-based research and development would cease to exist. Clearly, God wants us to continue releasing large-scale transformer based models behind limited access and exclusive paywall API's.
No, I'm just kidding. I don't think it's some sign from a higher power, but maybe there is some mathematical proof buried in there somewhere to explain why I'm seeing such a vast quality improvement between the two versions. The only problem is, since I'm merely just a YouTube guy and somewhat OK substack newsletter guy, I'm not the one capable of producing it.
I mean, put simply, yes, there are existing scaling laws which indicate that more parameters = higher quality prediction performance. But, if that's the case, will 200 billion parameters be just as impressive as a quality improvement (noticably speaking) to us?
Something tells me even going from 175 billion parameters to 275 billion parameters might not be that much of a noticable quality improvement in our eyes. And if that turns out to be the case, the 1.5 billion to 175 billion parameter question becomes more important in my eyes. This might tell us more about the underlying nature of large scale language models to uncover the true explanation and mechanisms behind their effectiveness.
Can we predict the most important transformer parameter milestone counts like this one going forward? 450 billion and definitely 1 trillion parameter counts sound like they'll be important to me but that's just because they're, “nice sounding" numbers to my limited human imagination. I know the human brain is trillions of synapses or I guess parameters, but what about in between? Can we say anything at all about the most important parameter milestones going forward before we get to the scale of trillions?
What about in reverse? Would a 89 billion paramater count be just as noticably good as the 175 billion GPT-3 model? How about a 170 billion or 165 billion … are they still noticably just as good? I know there are different GPT-3 engines and have reached out to OpenAI to learn about their parameter counts between them. This could be an interesting testing ground to help me understand the underlying mechanisms between these transformer parametric scaling laws.
Moreover, what will these “magical” milestones lead to? What will their models be capable of that gpt-3 isn't capable of already? I'm scared even thinking how much more impressive they'll become. Things will get very real, very quickly, in the same way GPT-3 is scary compared to GPT-2.
If you're a math, physics, or ML genius any know the answer to this chasm question that's been bothering me, please email me at bakztfuture [at] gmail.com. Although, this seems like really, really hard question to answer. Unless of course, sincerely, it really was devine intervention afterall …
Relevant background links on scaling laws, courtesy of u/Wiskkey: