Building AI products — Benedict Evans
This is an ‘unfair’ test. It’s a good example of a ‘bad’ way to use an LLM. These are not databases. They do not produce precise factual answers to questions, and they are probabilistic systems, not deterministic. LLMs today cannot give me a completely and precisely accurate answer to this question. The answer might be right, but you can’t guarantee that.
There is something of a trend for people (often drawing parallels with crypto and NFTs) to presume that this means these things are useless. That is a misunderstanding. Rather, a useful way to think about generative AI models is that they are extremely good at telling you what a good answer to a question like that would probably look like. There are some use-cases where ‘looks like a good answer’ is exactly what you want, and there are some where ‘roughly right’ is ‘precisely wrong’.
Indeed, pushing this a little further, one could suggest that exactly the same prompt and exactly the same output could be a good or bad result depending on why you wanted it.
Be that as it may, in this case, I do need a precise answer, and ChatGPT cannot, in principle, be relied on to give me one, and instead it gave me a wrong answer. I asked it for something it can’t do, so this an unfair test, but it’s a relevant test. The answer is still wrong.
There are two ways to try to solve this. One is to treat it as a science problem – this is early, and the models will get better. You could say ‘RAG’ and ‘multi-agentic’ a lot. The models certainly will get better, but how much better? You could spend weeks of your life watching YouTube videos of machine learning scientists arguing about this, and learn only that they don’t really know. Really, this is a version of the ‘will LLMs produce AGI?’ argument, since a model that could answer ‘any’ question completely correctly sounds like a good definition of at least one kind of AGI to me (again, though, no-one knows).
But the other path is to treat this as a product problem. How do we build useful mass-market products around models that we should presume will be getting things ‘wrong’?
A stock reaction of AI people to examples like mine is to say “you’re holding it wrong” – I asked 1: the wrong kind of question and 2: I asked it in the wrong way. I should have done a bunch of prompt engineering! But the message of the last 50 years of consumer computing is that you do not move adoption forward by making the users learn command lines – you have to move towards the users.