Can big language models solve logic puzzles? One way to find out is to ask. That's what Fernando Perez Cruz and Hyun Seong Shin did recently. (Perez-Cruz is an engineer; Shin is head of research at the Bank for International Settlements, as well as the man who taught me some of the more mathematical bits of economic theory in the early 1990s.)
The puzzle in question is commonly known as “Cheryl's Birthday Puzzle”. Cheryl challenges her friends Albert and Bernard to guess her birthday, and because of the puzzle they know it is one of 10 dates: May 15, 16 or 19; June 17 or 18; July 14 or 16; or August 14, 15 or 17. To speed up the guesswork, Cheryl tells Albert her birth month, and Bernard the day of the month, but not the month itself.
Albert and Bernard think for a while. Then Albert announced, “I don't know your birthday, and I know Bernard doesn't either.” “In that case, I now know your birthday,” Bernard replied. “Now I know your birthday too,” replied Albert. What is Cheryl's birthday?* More to the point, what do we learn by asking GPT-4?
The puzzle is a challenging one. Solving this requires a step-by-step elimination of possibilities while considering questions such as “What should Albert know, given what he knows that Bernard doesn't?” So, it's impressive that when Pérez-Cruz and Shin repeatedly asked GPT-4 to solve this puzzle, the big language model got the right answer every time, fluently varying the logic of the problem. And made accurate explanations. Yet this brilliant display of logical skill was nothing more than a clever delusion. The illusion was shattered when Perez-Cruz and Shin asked the computer a slightly modified version of the puzzle, changing the names of the characters and the months.
GPT-4 continued to offer fluid, comprehensible explanations of logic, so fluid, in fact, that it required real concentration to spot the moments when these explanations dissolved into nonsense. Both the original problem and its answer are available online, so the computer probably learned to paraphrase the text in a sophisticated way, making it look like a brilliant logician.
When I tried the same thing, preserving the formal structure of the puzzle but changing the names to Juliet, Bill and Ted, and the months to January, February, March and April, I got the same disastrous result. Both GPT-4 and the newer GPT-4o worked authentically through the reasoning structure but reached incorrect conclusions at several stages, including the final stage. (I also realized that on my first try I introduced a fatal typo into the puzzle, making it unsolvable. GPT-4 didn't bat an eyelid and “solved” it anyway.)
Curious, I tried another popular puzzle. A game show contestant is trying to find a prize behind one of three doors. The quizmaster, Monty Hall, allows a tentative choice, opens another door to reveal a grand prize, and then gives the contestant the chance to switch doors. Should they switch?
Monty Hall's problem is actually simpler than Cheryl's birthday, but surprisingly paradoxical. I made things difficult for GPT4o by adding some complications. I introduced the fourth door and asked not if the contestant should switch (they should), but if it is worth paying $3,500 to switch if two doors are open and the grand prize is $10,000.**
The response to GPT-4 was remarkable. He avoided the epistemic trappings of the puzzle, clearly explaining the logic of each step. Then it got stuck at the finishing line, adding a nonsensical assumption and resulting in an incorrect answer.
What are we to make of all this? In some ways, Pérez-Cruz and Shin have just found a twist on the familiar problem that big language models sometimes add a credibility myth to their answers. Instead of plausible errors of fact, here the computer presented plausible errors of logic.
Defenders of big language models might respond that with cleverly designed prompts, the computer could do better (which is true, even though the word “may” is doing a lot of work). It's also almost certain that future models will do better. But as Pérez-Cruz and Shin argue, this may be beside the point. A computer that can look so accurate but is so wrong is a dangerous tool to use. We seem to be relying on a spreadsheet for our analysis (already dangerous enough) and the spreadsheet will occasionally and intermittently forget how multiplication works.
Not for the first time, we learn that large language models can be extraordinary rogue engines. The difficulty here is that the nonsense is so frighteningly plausible. We've seen lies and mistakes before, and goodness knows we've seen candor. But this? This is something new.
*If Bernard had been told 18 (or 19), he would have known that the birthday was June 18 (or that it was May 19). So when Albert says that he knows Bernard doesn't know the answer, it rules out these possibilities: Albert would have been told July or August instead of May or June. Bernard's reply that he now knows the answer to some suggests that it cannot be the 14th (which would have left him guessing between July or August). The remaining dates are August 15 or 17, or July 16. Albert knows what month it is, and this statement shows that he now knows the answer that the month must be July and that Cheryl's birthday is July 16.
**Initially the probability of choosing the correct door is 25%, and this does not change when Monty Hall opens two empty doors. So the chance of winning $10,000 is 75 percent if you switch to the remaining door, and 25 percent if you stick with your initial choice. For the risk-taker enough, it's worth paying up to $5,000 to make the switch.
Following @FTMag Be the first to know about our latest stories and subscribe to our podcast. Life and art Wherever you listen