Attackers Could Exploit AI Vision Models Using Imperceptible Image Changes

The technique revealed two distinct failure modes. The first is readability recovery: an image so blurred or small that the model cannot parse it at all can be nudged into legibility purely in the model’s internal representation, without becoming visually clearer to any human observer or optical character recognition (OCR) tool.The second is refusal reduction: in cases where the model could already read the embedded instruction but chose to refuse, the perturbations sometimes eroded that safety decision, pushing the model from declining to complying, with no visible change to the image.In tests, Claude showed the largest overall gain in attack success after optimization on heavily blurred images, jumping from 0% to 28%. The perturbation recovered the information the model could process, but its safety filter still caught a significant share of the newly readable content.GPT-4o demonstrated stronger safety alignment: as the perturbation made more content readable, its safety filter caught most of the newly legible requests, limiting overall attack gains.“The optimization we tested on images resulted in the effects of a successful typographic attack that evaded simple image filters, indicating a need for more robust defenses in the representation space,” the Cisco researchers explained.Related:AI Coding Agents Could Fuel Next Supply Chain CrisisRelated:Gemini CLI Vulnerability Could Have Led to Code Execution, Supply Chain AttackRelated:Critical Bug Could Expose 300,000 Ollama Deployments to Information Theft

The second is refusal reduction: in cases where the model could already read the embedded instruction but chose to refuse, the perturbations sometimes eroded that safety decision, pushing the model from declining to complying, with no visible change to the image.In tests, Claude showed the largest overall gain in attack success after optimization on heavily blurred images, jumping from 0% to 28%. The perturbation recovered the information the model could process, but its safety filter still caught a significant share of the newly readable content.GPT-4o demonstrated stronger safety alignment: as the perturbation made more content readable, its safety filter caught most of the newly legible requests, limiting overall attack gains.“The optimization we tested on images resulted in the effects of a successful typographic attack that evaded simple image filters, indicating a need for more robust defenses in the representation space,” the Cisco researchers explained.Related:AI Coding Agents Could Fuel Next Supply Chain CrisisRelated:Gemini CLI Vulnerability Could Have Led to Code Execution, Supply Chain AttackRelated:Critical Bug Could Expose 300,000 Ollama Deployments to Information Theft

In tests, Claude showed the largest overall gain in attack success after optimization on heavily blurred images, jumping from 0% to 28%. The perturbation recovered the information the model could process, but its safety filter still caught a significant share of the newly readable content.GPT-4o demonstrated stronger safety alignment: as the perturbation made more content readable, its safety filter caught most of the newly legible requests, limiting overall attack gains.“The optimization we tested on images resulted in the effects of a successful typographic attack that evaded simple image filters, indicating a need for more robust defenses in the representation space,” the Cisco researchers explained.Related:AI Coding Agents Could Fuel Next Supply Chain CrisisRelated:Gemini CLI Vulnerability Could Have Led to Code Execution, Supply Chain AttackRelated:Critical Bug Could Expose 300,000 Ollama Deployments to Information Theft

GPT-4o demonstrated stronger safety alignment: as the perturbation made more content readable, its safety filter caught most of the newly legible requests, limiting overall attack gains.“The optimization we tested on images resulted in the effects of a successful typographic attack that evaded simple image filters, indicating a need for more robust defenses in the representation space,” the Cisco researchers explained.Related:AI Coding Agents Could Fuel Next Supply Chain CrisisRelated:Gemini CLI Vulnerability Could Have Led to Code Execution, Supply Chain AttackRelated:Critical Bug Could Expose 300,000 Ollama Deployments to Information Theft

“The optimization we tested on images resulted in the effects of a successful typographic attack that evaded simple image filters, indicating a need for more robust defenses in the representation space,” the Cisco researchers explained.Related:AI Coding Agents Could Fuel Next Supply Chain CrisisRelated:Gemini CLI Vulnerability Could Have Led to Code Execution, Supply Chain AttackRelated:Critical Bug Could Expose 300,000 Ollama Deployments to Information Theft

Related:AI Coding Agents Could Fuel Next Supply Chain CrisisRelated:Gemini CLI Vulnerability Could Have Led to Code Execution, Supply Chain AttackRelated:Critical Bug Could Expose 300,000 Ollama Deployments to Information Theft

Related:Gemini CLI Vulnerability Could Have Led to Code Execution, Supply Chain AttackRelated:Critical Bug Could Expose 300,000 Ollama Deployments to Information Theft

Related:Critical Bug Could Expose 300,000 Ollama Deployments to Information Theft

Eduard Kovacs (@EduardKovacs) is senior managing editor at SecurityWeek. He worked as a high school IT teacher before starting a career in journalism in 2011. Eduard holds a bachelor’s degree in industrial informatics and a master’s degree in computer techniques applied in electrical engineering.

In cyber-physical systems (CPS), just one hour of downtime can outweigh an entire annual security budget. Learn how to master the Return on Security Investment (ROSI) to align security goals with the bottom-line priorities.

Source: SecurityWeek