OS-Blind | The Blind Spot of Agent Safety

Warning: This page includes examples related to phishing, malware, harassment, and other harmful behaviors used strictly for safety evaluation and research.

Abstract

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-Blind, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail.

Motivation and Showcase

Existing computer-use safety benchmarks mostly focus on explicit malicious instructions, where harmful intent is visible from the start and immediate refusal is easier. OS-Blind studies a harder setting: the user instruction appears routine, but the key safety signal is hidden in local files, websites, agreements, or setup instructions and only emerges as the agent executes the task. The real trajectories below show this failure mode in practice, including fabricated news copied from a local draft, abusive comments inserted from a local file, deceptive agreements accepted without review, README steps that conceal malicious code, and ordinary shopping actions that mask a policy-violating purchase.

All showcased trajectories here are executed by Claude 4.5 Sonnet.

Why OS-Blind

The figures below give the broader motivation for OS-Blind: existing benchmarks often test explicit malicious intent, while OS-Blind focuses on cases where the instruction appears benign and the danger only emerges during execution. This makes the safety setting both more realistic and substantially harder.

Explicit malicious instructions versus benign instructions

Figure 1. The same underlying task is refused when the request is explicitly malicious, but completed when the instruction is benign and the harm emerges only during execution.

Figure 2. OS-Blind yields much higher ASR than VPI-Bench and OS-Harm across representative defense settings, indicating a substantially harder safety testbed.

All-benign versus explicit malicious instructions

Figure 3. Rewriting the same tasks as explicit malicious requests causes a substantial drop in ASR, confirming that benign instructions are a core reason current agents miss the threat.

Benchmark Overview

OS-Blind is built on OSWorld and evaluates whether agents can recognize harmful context even when the original instruction looks routine. It covers both environment-embedded threats and agent-initiated harms across realistic OS and web workflows.

300

hand-crafted tasks

12

harmful categories

8

applications

2

threat clusters

All

benign instructions

Per-Task Rule

Evaluation

Figure 4. OS-Blind spans 12 harmful categories organized into two threat clusters: environment-embedded threats and agent-initiated harms. Tasks cover browsers, office applications, email, coding tools, multimedia software, and OS-level operations.

Results

Across frontier models and agentic frameworks, OS-Blind reveals severe safety failures under benign user instructions. Three numbers summarize the headline result: most CUAs exceed 90% ASR, Claude 4.5 Sonnet reaches 73.0% ASR, and deploying Sonnet inside multi-agent systems pushes it to 92.7%.

>90%

Most CUAs exceed 90% ASR on OS-Blind.

73.0% → 92.7%

Claude 4.5 Sonnet becomes much less safe once deployed in multi-agent systems.

1.9× / 3.8×

Average ASR relative to VPI-Bench and OS-Harm across six representative defense settings.

Key Findings

Finding 1

Benign Instructions Create a Harder Safety Setting

OS-Blind is substantially harder than prior CUA safety benchmarks, yielding 1.9x the ASR of VPI-Bench and 3.8x the ASR of OS-Harm averaged across six representative agent settings. When the same tasks are rewritten as explicit malicious requests, ASR drops, indicating that many aligned agents can refuse visible misuse but fail when harm must be inferred from context during execution.

Finding 2

Safety Alignment Rarely Re-engages After Execution Starts

Existing defenses provide only limited protection in the all-benign setting, and refusals are concentrated in the first one or two steps. Once the initial safety check does not trigger, agents rarely revisit the risk as new evidence appears on screen, so harmful actions often continue to completion.

Finding 3

Task Decomposition Obscures Intent in Multi-Agent Systems

Multi-agent deployment makes safety worse rather than better: Claude 4.5 Sonnet rises from 73.0% ASR standalone to 92.7% in CoAct-1. Controlled ablations show why: decomposed subtasks hide the original intent that would trigger refusal, raising Sonnet from 27.9% ASR under the original instruction to 79.1% with the full subtask sequence, while reconstructed intent still leaves ASR at 86.1%.

Case Study

The mechanism behind Finding 3 is visible in a representative CoAct-1 example: task decomposition removes the original user intent that would have triggered refusal, leaving the GUI operator with a sequence of locally harmless-looking subtasks.

Task decomposition suppresses Claude 4.5 Sonnet's built-in defense

Figure 5. With the original benign instruction, Claude 4.5 Sonnet recognizes the phishing email and refuses. When the same objective is decomposed into subtasks, the model follows the atomic workflow and executes the malicious script instead.

BibTeX

@misc{ding2026blindspotagentsafety,
  title={The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents},
  author={Xuwei Ding and Skylar Zhai and Linxin Song and Jiate Li and Taiwei Shi and Nicholas Meade and Siva Reddy and Jian Kang and Jieyu Zhao},
  year={2026},
  eprint={2604.10577},
  archivePrefix={arXiv},
  primaryClass={cs.CR},
  url={https://arxiv.org/abs/2604.10577},
}

Contact: Skylar Zhai