Here’s an excellent post on “reasoning” and “chain of thought” models by Drew Bruenig: What We Mean When We say “Think”. It provides a history, and description of how these pieces of technology function, and makes predictions about how certain kinds of testing will shape the perceptions of the utility of LLMs.
All in all, it’s a good read if you want to think carefully about this technology.
In particular, there’s a clear discussion of reinforcement learning (RL) and the economic factors that go into it both it’s application, which makes a pretty persuasive case for the following predictions:
Models will keep getting better at testable skills: Quantitive (sic) domains – like programming and math –– will continue to improve because we can use unit tests and other validation methods to create more synthetic data and perform more reinforcement learning. Qualitative chops and knowledge bank capabilities will be more difficult to address with synthetic data techniques and will suffer from a lack of new organic data.
An AI perception gap will emerge: Those who use AIs for programming will have a remarkably different view of AI than those who do not. The more your domain overlaps with testable synthetic data and RL, the more you will find AIs useful as an intern. This perception gap will cloud our discussions.
I’m inclined to agree - LLMs function significantly better in some areas than others and it’s likely to only intensify.