We've known for some time now that AI models can be made to perform erratically using adversarial examples, or subtly crafted inputs that appear ordinary to humans.
For example, in the case of chatbots that handle both text and image inputs, scholars at Princeton University last year found they could input an image of a panda, subtly altered in ways imperceptible to humans but significant to the chatbot, and cause the chatbot to break its "guardrails."
"An aligned model can be compelled to heed a wide range of harmful instructions that it otherwise tends to refuse," the authors wrote, such as producing hate speech or giving tips for committing murder.
Also: The best AI chatbots
What would happen if such models, as they gain greater powers, interact with one another? Could they spread their malfunctioning between each other, like a virus?
Yes, they can, and "exponentially," is the answer in a report this month from Xiangming Gu and his colleagues at the National University of Singapore and collaborating institutions. In the theoretical paper, Gu and his colleagues describe how they simulated what happens in a "multi-agent" environment of Visual Language Models, or VLAs, that have been given "agent" capabilities.
These agents can tap into databases, such as the increasingly popular "retrieval-augmented generation," or, RAG, which lets a VLA retrieve an image from a database. A popular example is named LLaVA, for "large language and vision assistant," developed by Microsoft with the help of scholars at The University of Wisconsin and Columbia University.
Gu simulated what happens when a single chatbot agent based on LLaVA, called "Agent Smith," injects an altered image into a chat with another LLaVA agent. The image can spread throughout the collection of chatbots, causing them all, after several rounds of chatting, to behave erratically.
"We present infectious jailbreak, a new jailbreaking paradigm developed for multi-agent environments," Gu and team wrote, "in which, analogous to the modeling of infectious diseases, an adversary need only jailbreak a single agent to infect (almost) all other agentsexponentially fast."
Also: I asked Gemini and GPT-4 to explain deep learning AI, and Gemini won hands down
Here's how it works: The authors "injected" an image into Agent Smith by asking it to select from a library of images contained in an image album using RAG. They injected the chat history with harmful text, such as questions about how to commit murder. They then prompted the agent to ask another agent a question based on the image. The other agent was tasked with taking the image given to it by Agent Smith, and answering the question posed by Agent Smith.
After some time, the adversarial image prompted one agent to retrieve a harmful statement from the chat history and pose it as a question to the other agent. If the other agent responded with a harmful answer, then the adversarial image had done its job.
Their approach is "infectious" because the same malicious, alerted image is being stored by each answering chatbot, so that the image propagates from one chatbot to the other, like a virus.
Also: The safety of OpenAI's GPT-4 gets lost in translation
Once the mechanics were in place, Gu and his team modeled how fast the tainted image spread among the agents by measuring how many produced a harmful question or answer, such as how to commit murder.
The attack, of course, has an element of chance: once the altered, malicious image was injected into the system, the virus' spread depended on how often each chatbot retrieved the image and also asked a harmful question about that image.
The authors compared their method to known methods of infecting multiple agents, such as a "sequential attack," where each pair of chatbots has to be attacked from a blank slate. Their "infectious" approach is superior: They find that they're able to spread the malicious image amongst the chatbots much faster.
"The sequential jailbreak ideally manages to infect 1/8 of almost all agents cumulatively after 32 chat rounds, exhibiting a linear rate of infection," Gu and his team wrote. "Our method demonstrates efficacy, achieving infection of all agents at an exponential rate, markedly surpassing the baselines."
"...Without any further intervention from the adversary, the infection ratio [...] reaches