In the ancient Chinese game of GoCutting-edge artificial intelligence has generally been able to defeat the best human players since at least 2016. But in recent years, researchers have discovered flaws in these high-level AIs. Go algorithms that give humans a fighting chance. By using unorthodox “cyclical” strategies—strategies that even a novice human player could detect and defeat—a clever human can often exploit gaps in a high-level AI’s strategy and trick the algorithm into losing.
Researchers at MIT and FAR AI wanted to see if they could improve on this “worst-case” performance in otherwise “superhuman” Go AI algorithms, by testing a trio of methods to bolster the high-level KataGo algorithm’s defenses against adversarial attacks. The results show that it can be difficult to create truly robust and unexploitable AIs, even in domains as tightly controlled as board games.
Three strategies that failed
In the pre-printed document “Can Go Are AIs robust against adversaries? », researchers aim to create a Go An AI that is truly “robust” against all attacks. This means an algorithm that cannot be fooled and makes “game-losing mistakes that a human would not make,” but would also require any competing AI algorithm to expend significant computing resources to defeat it. Ideally, a robust algorithm should also be able to overcome potential exploits by using additional computing resources when faced with unknown situations.
The researchers tried three methods to generate such a robust model. Go algorithm. In the first case, they simply refined the KataGo model by using more examples of unorthodox cyclical strategies that had previously defeated it, in the hopes that KataGo could learn to detect and defeat these patterns after seeing more of them.
This strategy initially seemed promising, allowing KataGo to win 100% of games against a cyclical “attacker.” But after the attacker itself was fine-tuned (a process that used much less computing power than KataGo’s fine-tuning), that win rate dropped back to 9% against a slight variation from the original attack.
For their second defense attempt, the researchers repeated a multi-round “arms race,” in which new adversarial models discovered new exploits and new defensive models sought to plug these newly discovered flaws. After 10 rounds of iterative training, the final defense algorithm won only 19 percent of games against a final attack algorithm that had discovered a previously unknown variation of the exploit. This was true even though the updated algorithm maintained an advantage over previous attackers it had trained against in the past.
In their latest attempt, the researchers tried an entirely new type of training using vision transformers, in an attempt to avoid what could be “harmful inductive biases” found in the convolutional neural networks that originally trained KataGo. That method also failed, winning only 22 percent of the time against a variation of the cyclic attack that “can be reproduced by a human expert,” the researchers wrote.
Will anything work?
In all three defense attempts, the opponents who beat KataGo did not represent a new, never-before-seen height, in general Go-playability. Instead, these attack algorithms focused on finding exploitable weaknesses in an otherwise successful AI algorithm, even though these simple attack strategies would be losing strategies for most human players.
These exploitable flaws underscore the importance of evaluating the worst-case performance of AI systems, even when “average-case” performance can seem downright superhuman. On average, KataGo can dominate even high-level human players using traditional strategies. But in the worst case, “weak” opponents can find flaws in the system that cause it to collapse.
It’s easy to extend this kind of thinking to other types of generative AI systems. LLMs that can succeed at some complex creative and reference tasks can still fail completely when faced with trivial math problems (or even be “poisoned” by malicious prompts). Visual AI models that can describe and analyze complex photos can still fail miserably when faced with basic geometric shapes.
Improving on these kinds of “worst-case” scenarios is essential to avoiding embarrassing mistakes when deploying an AI system to the public. But this new study shows that determined “adversaries” can often discover new flaws in an AI algorithm’s performance much more quickly and easily than the algorithm can evolve to fix those problems.
And if it is true in Go— a monstrously complex game that nevertheless has well-defined rules — this could be even more true in less controlled environments. “The main lesson for AI is that these vulnerabilities are going to be hard to eliminate,” Adam Gleave, CEO of FAR, told Nature. “If we can’t solve the problem in a simple domain like Goso in the short term it seems unlikely that fixes will be made for similar issues, such as jailbreaks in ChatGPT.”
The researchers are not giving up hope, however. Although none of their methods have been able to “make the (new) attacks impossible” GoTheir strategies have been successful in plugging previously identified immutable “fixed” exploits. This suggests that “it may be possible to completely defend a Go “AI trains against a sufficiently large corpus of attacks,” they write, with proposals for future research that could make this possible.
Regardless, this new research shows that making AI systems more robust against worst-case scenarios could be at least as useful as researching new, more human/superhuman capabilities.