First, let me apologize for my prolonged absence. I’ve written several articles which I’ve ended up spiking because I just didn’t think they were worth your time. I write about things that excite or intrigue me, and there hasn’t been that much new that’s done either lately. The problem is that the stuff that makes the news — a sparkly new LLM being released — really doesn’t move the needle that much. If you ask me what do I think about _______’s new LLM, the answer honestly is “not much”. It’s always fall and the new cars are coming out in AI land. No matter what they say, the fins aren’t a big improvement.1
But that’s the noise, and the noise floor is so high that for a while I was missing the signal. Just recently I’ve been pondering the question: what happens if AI — and by AI, I mean LLMs like GPT — were really good, really fast, and essentially free? What would our world look like? What would we do with that?
But before I get to that, let me first convince you that it’s going to happen.
Good
Right now, LLMs are full of fun hallucinations and random misfires. I had a delightful chat with one of Meta’s new models where it was telling me that to go from Los Angeles to Hawaii in my car I would need to start down the Pacific Coast Highway. I think its plan was for me to drive up to Alaska and …. not sure what.
But that was a small LLM, and the largest of LLMs are getting pretty darn good. There’s actually two axis of “good” I tend to think about: the first is accuracy in answering, and the second is amount of complexity the LLM can handle.
With regard to accuracy, there’s been a shift, in the largest of offerings, to a “Mixture of Experts” architecture. Initially, LLMs were a bit monolithic in design, so if you want more power, you had to build a bigger model. That’s why people have been obsessed with the number of parameters: it was an apples-to-apples comparison tool. But with the “Mixture of Experts” design you have multiple, smaller models that are lashed together in a greater-than-the-sum-of-the-parts design. If you want to read more about how they work, here’s a good place to start. The net result is that for the same amount of horsepower (or some other vague metaphor for effort) you get much better answers. I expect that the ability to wring accuracy out of models will continue to improve, giving us ever better models to work with.2
With regards to complexity, there’s something that’s been happening with the context window size — the amount of text an LLM can swallow whole:
A year ago, you could only provide about 32k bytes of text as an input to an LLM. That was great for short chats, and dealing with small documents, but it meant working with larger problems was very problematic. If you chatted, your earlier conversation would be forgotten quickly as old interactions hit the limit. You could get around the problem, a bit, with tools like RAG, but the amount of elbow grease needed started to grow really fast when problems got just a bit complex.
If you look at the chart above, you can see that now you can feed inputs with over 100K tokens (roughly, 400K of text) into an LLM. And that incredible increase of window size over the course of the last year has vastly increased the kinds of problems we can solve with LLMs. When I say vastly increased, it’s not just an increase that’s linear to the size of the context window. The small context windows of a year ago were just too small to deal with most common documents we deal with.
Taken together: better architectures and larger context windows, it means that LLMs are getting better in both dimensions.
Fast
I don’t have response time metrics for OpenAI’s GPT, but I will say that problems that took a minute to solve in the past are now being solved in seconds. True, part of that is that as I’ve spent more money on them they’ve given me ever better response time, but in general there are internal improvements that are eking out more performance.
Much of what I do now is a sort of “batch” AI; even if the user is waiting, I am willing to make them wait as long as it takes to get done. But it’s foreseeable that in a few years the response times will be near real time.
Let me give you an example of what I mean: this is a unedited (save conversion to animated GIF) screen recording of my running an LLM on my desktop computer:3
This is pretty much on par with the performance of ChatGPT a year ago, but on a consumer-grade home computer. Which brings me to my last point….
Free
All of the improvements in LLM technology have pushed the pricing of services down over the past year. Here’s just an example from OpenAI:
That’s a drop of about 50% in the past year, and I expect prices to continue to drop from competitive pressures. For example, Google’s Gemini 1.5 Pro is priced about 70% of the price of GPT-4 Turbo.
But let me get back to the chat session example in the previous section: how much does that cost me to run? Apart from the electricity that’s already powering my home computer, it costs nothing. Asking a question is free. I can chat with it, I can use it in my applications. All for nothing. True, it’s nowhere near as powerful as GPT-4. But it’s as good as GPT-3 was, more or less. And if our lived experience tells us anything, it’s that what took a supercomputer4 20 years ago takes a laptop today.
Good, Fast and Free
Which brings me back to the start: what if LLMs were good, fast, and free? Good enough that they solved the complex problems you put to them without telling you to drive from LA to Oahu? Fast enough that you can get answers in realtime? Free enough that you never consider the costs of running the LLMs to be a material consideration?
There’s been two “the cost dropped to free” revolutions I can remember. The first is the dawn of the personal computer, where you could use computing resources for zero incremental cost, and the second is the internet, where the marginal cost of communications dropped to zero. Both of those spawned huge changes in our lives, economies, and for the lucky few who were in the right place at the right time, in fortunes.
AI and LLMs are going to be the third time this has happened. What hath man wrought?
I’ll ponder this more in subsequent newsletters.
Well, unless you’re a Parrot Head, in which case I make this humble offering of a different kind of fins:
There are benefits for cost and performance as well, but I have to bucket this somewhere….
It is a Mac mini with an M2 Pro CPU and 32 gig of RAM. Not top of the line.
If I could use such an antique phrase.
Excellent insight, Mr McGuinness. Things are definitely trending as you suggest.