it at least reduces the chance of a hardware overhang.
An overhang is when you have had the ability to build transformative AI for quite some time, but you haven't because no-one's realised it's possible. Then someone does and surprise! It's a lot more capable than everyone expected.
I am worried we're in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x-larger projects at Google and Facebook and the like, with timelines measured in months.
GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment it is insignificant.
GPT-3 has been estimated to cost $5m in compute to train, and - looking at the author list and OpenAI's overall size - maybe another $10m in labour, on the outside.
Google, Amazon and Microsoft all each spend ~$20bn/year on R&D and another ~$20bn each on capital expenditure. Very roughly it totals to ~$100bn/year. So dropping $1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that's necessary is that tech executives stop thinking of NLP as cutesy blue-sky research and start thinking in terms of quarters-till-profitability.
A concrete example is Waymo, which is doing $2bn investment rounds - and that's for a technology with a much longer road to market.
The other side of things is compute cost. The $5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about $1000/pflops-d.
But there, the price is driven up an order of magnitude by NVIDIA's monopolistic cloud contracts, and the performance is driven down by ignoring the tensor cores and just looking at general compute performance. The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/pflops-d. This roughly aligns with AI Impact's current estimates, and offers another >10x speedup.
I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would've happened much sooner, but I still think it's a useful bound.
I've focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint.
- The number of chips available. From the estimates above there are about 500 GPU-years in GPT-3, or - based on a one-year training window - $5m worth of V100s at $10k/piece. This is about 1% of NVIDIA's quarterly datacenter sales. A 100x scale-up by multiple companies could saturate this.
- This constraint can obviously be loosened, but it'd be hard to do on a timescale of months.
- Scaling law breakdown. The GPT series' scaling is expected to break down around 10k pflops-days (§6.3), which is a long way short of the amount of cash on the table.
- As an aside, though it's not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.
- Data availability limits. Same paper, dataset size goes as the square-root of compute; a 1000x larger GPT-3 will be after 10 trillion tokens of training data.
- This is well within the ability of humanity to provide, and it shouldn't be difficult to gather into one spot once you've decided it's useful thing. So I'd be surprised if this was binding.
- Commoditization. If many companies go for huge NLP models, the profit each company can extract is driven towards zero. Unlike with other capex-heavy research - like pharma - there's no IP protection for trained models. If you expect profit to be marginal, you're less likely to drop $1bn on your own training program.
- I am skeptical of this being an important factor while there're lots of legacy, human-driven systems to replace. That shift should be more than enough to fund many company's research programs. Longer term it might be more important.
- Inference costs. The GPT-3 paper (§6.3), gives .4kWh/100 pages of output, which works out to 500 pages/dollar from eyeballing hardware cost as 5x electricity. Scaling up 1000x and you're at $2/page, which is cheap compared to humans but no longer quite as easy to experiment with
- I'm skeptical of this being a binding constraint too. $2/page is still very cheap.
- Bandwidth and latency. 500 V100's networked together is one thing, 500k V100s is another entirely.
- I don't know enough about distributed training to say whether this is a very sensible constraint or a very dumb one. I think it has a chance of being a serious problem, but I think it's also the kind of thing you can design algorithms around. This might plausibly not be resolved on a timescale of months however.
- Sequence length. GPT-3 uses 2048 tokens at a time, and that's with an efficient encoding that cripples it on many tasks. With the naive architecture, increasing the sequence length is quadratically expensive, and getting up to novel-length stuff is not very likely.
Here we go from just pointing at big numbers and onto straight-up theorycrafting.
In all, tech investment as it is today plausibly supports another 100x-1000x scale up in the very-near-term. If we get to 1000x - 1 zettaflops-days, and >$1bn a pop - then there are a few paths open.
I think the key question is if by 1000x the model is obviously superior than humans over a wide range of economic activities. If it is - and I think it's plausible it will be - then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of the global GDP.
On paper that leaves room for another 1000x scale-up as it reaches up to $1tn, though current market mechanisms aren't really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.
That's from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP.
The even more extreme path is if by 1000x you've got something that can design better algorithms and better hardware. Then I think we're in the hands of Christiano's slow takeoff four-year-GDP-doubling.
That's all assuming performance continues to improve though. If by 1000x the model is not obviously a challenger to human supremacy, then things will hopefully slow down to ye olde fashioned 2010s-Moore's-Law rates of progress and we can rest safe in the arms of something that's merely HyperGoogle.