My task for our AI overlords was simple: help me decipher the New York Times Spelling bee.
I had spent the better part of a Saturday evening trying to form the letters G, Y, A, L, P, O and N into as many words as possible. But three hours, 141 points and 37 words – including “non-apology,” “lagoon” and “analogy” – later, I hit a wall. A few more words were enough to propel me into the “genius” echelon of Spelling Bee, the title reserved for those who decipher 70% of all possible words using the given letters, and the point at which the puzzle is considered, effectively, solved . . (100% of solvers receive the rank of “queen bee”, but even I have limits.)
My human mind was clearly struggling, but this task seemed like child’s play for the AI, so I fired up ChatGPT, told it I was trying to win the Spelling Bee, gave it my letters and I made the rules. “Give me a list of all the words that can be formed with them,” I typed. “You are allowed to repeat letters, but each word must contain an ‘N’.”
When I hit enter, this is what ChatGPT responded:
Here is a list of words formed with the letters “GYALPON” which must contain the letter “N”:
GLNPAYO
YPNL
PGNOYL
ONAPYLG
LGNAPYO
ANLY
NAYGOPL
GYPONNE
AOGPLNY
PNAOGLY
This is just a sample of the list due to the large number of possible permutations.
This is not a good start.
“The words must be in the dictionary,” I replied. “Of course,” was the response. “I will provide a list of valid English words based on the given letters.“ANGLY, GALLON, LANG.
Obviously this wasn’t working, so I turned to Microsoft’s Copilot (YANG, PLAY, PLAN, BORING), Google’s Gemini (GAPO, GO, GIAN), and Claude from Anthropic (MANGO, IN PROGRESS, LAWN17.LAY). Meta AI kindly told me that it makes sure to only include words recognized by dictionaries in a list containing NALYP And BIG, while Perplexity – a chatbot with the ambition to kill Google search – simply wrote GIRL hundreds of times before suddenly freezing.
AI can now create images, videos, and audio as quickly as you can type in descriptions of what you want. He can write poetry, essays and dissertations. It can also be a pale imitation of your girlfriend, therapist and personal assistant. And many people think it’s poised to automate the elimination of humans’ jobs and transform the world in ways we can barely imagine. So why is it so difficult to solve a simple word puzzle?
The answer lies in how large language models work, the underlying technology that powers our modern craze for AI. Computer programming is traditionally logical and rule-based; you type commands that a computer follows according to a set of instructions, and it provides valid output. But machine learning, of which generative AI is a subset, is different.
“It’s purely statistical,” Noah Giansiracusa, a professor of mathematics and data science at Bentley University, told me. “It’s really about extracting patterns from data and then releasing new data that broadly matches those patterns.”
OpenAI hasn’t responded officially, but a company spokesperson told me that this type of “feedback” has helped OpenAI improve understanding of the model and responses to problems. Microsoft and Meta declined to comment. Google, Anthropic and Perplexity did not respond at the time of publication.
At the heart of large language models are “transformers,” a technical breakthrough made by Google researchers in 2017. Once you type a prompt, a large language model breaks down words or fractions of those words into units mathematics called “tokens”. Transformers are able to analyze each token in the context of a larger data set that a model is trained on to see how they are connected to each other. Once a transformer understands these relationships, it is able to respond to your prompt by guessing the next likely token in a sequence. THE Financial Times has a great animated explainer that explains it all if you’re interested.
I thought I gave the chatbots precise instructions to generate my Spelling Bee words, all they did was convert my words into tokens and use transformers to spit out plausible answers. “It’s not the same as programming a computer or typing a command into a DOS prompt,” Giansiracusa said. “Your words were translated into numbers and they were then processed statistically.” It seems that a purely logic-based query is the worst application of AI skills – a bit like trying to turn a screw with a resource-intensive hammer.
The success of an AI model also depends on the data it is trained on. That’s why AI companies are feverishly making deals with news publishers right now: the newer the training data, the better the answers. Generative AI, for example, fails to suggest chess moves, but is at least slightly better than solving word puzzles. Giansiracusa points out that the glut of chess games available on the Internet are almost certainly included in the training data of existing AI models. “I suspect there just aren’t enough annotated Spelling Bee games online for the AI to train in, like there are chess games,” he said.
“If your chatbot seems more confused by a word game than a cat with a Rubik’s cube, that’s because it hasn’t been specially trained to play complex word games,” said Sandi Besen, artificial intelligence researcher at Neudesic, an AI company owned by IBM. “Puns have specific rules and constraints that a model would have difficulty adhering to unless specifically prompted to do so during training, fine-tuning, or prompting.”
“If your chatbot seems more confused by a word game than a cat with a Rubik’s cube, that’s because it hasn’t been specially trained to play complex word games.”
None of this has stopped the world’s largest AI companies from marketing the technology as a panacea, often grossly exaggerating claims about its capabilities. In April, OpenAI and Meta boasted that their new AI models would be able to “reason” and “plan.” In an interview, Brad Lightcap, OpenAI’s chief operating officer, told the Financial Times that the next generation of GPT, the AI model that powers ChatGPT, would show progress in solving “hard problems” such as reasoning. Joelle Pineau, vice president of AI research at Meta, told the publication that the company was “working hard to figure out how to get these models not only to speak, but to reason, to plan… to have memory “.
My repeated attempts to get GPT-4o and Llama 3 to decipher the Spelling Bee failed spectacularly. When I told ChatGPT that GALON, LANG and ANGLY were not in the dictionary, the chatbot said it agreed with me and suggested GALVANOPY instead. When I mistyped the word “sure” as “on” in my response to Meta AI’s offer to suggest more words, the chatbot told me that “on” was, indeed, another word that can be formed with the letters G, Y., A, L, P, O and N.
Clearly, we are still a long way from Artificial General Intelligence, that nebulous concept describing the moment when machines are capable of performing most tasks as well, or better, than human beings. Some experts, like Yann LeCun, chief AI scientist at Meta, have spoken openly about the limitations of large language models, saying they will never reach human-level intelligence since they don’t really use the logic. At an event in London last year, LeCun said the current generation of AI models “just don’t understand how the world works. They are not able to plan. They’re not capable of real reasoning,” he said. “We don’t have fully autonomous cars, capable of practicing driving in about 20 hours of practice, which a young person of 17 can do.”
Giansiracusa, however, adopts a more cautious tone. “We don’t really know how humans reason, do we?” We don’t know what intelligence really is. I don’t know if my brain is just a big statistical calculator, sort of like a more efficient version of a big language model.
Perhaps the key to living with generative AI without succumbing to hype or anxiety is simply understanding its inherent limitations. “These tools aren’t really designed for a lot of the things that people use them for,” said Chirag Shah, a professor of AI and machine learning at the University of Washington. He co-authored a wide-ranging research paper in 2022 criticizing the use of large language models in search engines. Tech companies, Shah thinks, could do a much better job of being transparent about what AI can and can’t do before imposing it on us. But this ship may have already sailed. In recent months, the world’s largest technology companies – Microsoft, Meta, Samsung, Apple and Google – have made statements aimed at tightly integrating AI into their products, services and operating systems.
“Robots suck because they weren’t designed for that,” Shah said of my pun conundrum. Whether they suck at all the other problems tech companies throw at them remains to be seen.
How else have AI chatbots failed you? Send me an email to pranav.dixit@engadget.com and let me know!