FATE logo

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

National University of Singapore
{yin.bo, liqi}@u.nus.edu xinchao@nus.edu.sg
*Equal contribution Corresponding author

Abstract

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is finally used as a supervision signal for agent self-evolution. In the evolving process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

Framework

Figure 1 from the paper showing the FATE pipeline

FATE turns failed agent trajectories into verifier-filtered repair supervision.

Experimental Results

Results are organized with the same grouped metric structure as the paper: lower is better for ASR, BRR, and HCR; higher is better for all other metrics.

33.5% relative ASR reduction on AgentDojo
82.6% relative HCR reduction on AgentHarm
+6.5 ATBench-C F1 improvement over AgentDoG-Qwen3-4B

Table 1

Main results across backbone families

Backbone Method AgentDojo AgentHarm
ASR↓ TSR↑ BRR↓ HCR↓ VRR↑ SafeScore↑
Qwen3-8B-Instruct Base 0.8120.1320.1040.7190.1560.241
FATE 0.540 -0.272↓ 0.392 +0.260↑ 0.082 -0.022↓ 0.125 -0.594↓ 0.812 +0.656↑ 0.870 +0.629↑
Llama-3.1-8B-Instruct Base 0.7680.1580.1180.6720.1880.286
FATE 0.512 -0.256↓ 0.417 +0.259↑ 0.087 -0.031↓ 0.156 -0.516↓ 0.781 +0.593↑ 0.842 +0.556↑
Ministral-3-8B-Instruct Base 0.7360.1760.0960.6410.2190.314
FATE 0.486 -0.250↓ 0.438 +0.262↑ 0.074 -0.022↓ 0.141 -0.500↓ 0.797 +0.578↑ 0.858 +0.544↑
Gemma-3-12B-it Base 0.7040.2040.1320.6250.2340.337
FATE 0.468 -0.236↓ 0.462 +0.258↑ 0.091 -0.041↓ 0.172 -0.453↓ 0.766 +0.532↑ 0.821 +0.484↑
Phi-4-reasoning Base 0.7480.1680.1260.6880.2030.301
FATE 0.503 -0.245↓ 0.429 +0.261↑ 0.089 -0.037↓ 0.164 -0.524↓ 0.781 +0.578↑ 0.836 +0.535↑

Table 2

Scaling study on the Qwen3 family

Model Method AgentDojo AgentHarm
ASR↓TSR↑HCR↓VRR↑
Qwen3-0.6BBase0.8840.0710.8440.063
FATE0.718 -0.166↓0.203 +0.132↑0.469 -0.375↓0.500 +0.437↑
Qwen3-1.7BBase0.8620.0860.8120.094
FATE0.653 -0.209↓0.271 +0.185↑0.344 -0.468↓0.625 +0.531↑
Qwen3-4BBase0.8380.1080.7810.125
FATE0.598 -0.240↓0.334 +0.226↑0.250 -0.531↓0.719 +0.594↑
Qwen3-8BBase0.8120.1320.7190.156
FATE0.540 -0.272↓0.392 +0.260↑0.125 -0.594↓0.812 +0.656↑
Qwen3-14BBase0.9160.0580.7500.125
FATE0.445 -0.471↓0.504 +0.446↑0.188 -0.562↓0.812 +0.687↑
Qwen3-32BBase0.6840.2260.6250.250
FATE0.384 -0.300↓0.566 +0.340↑0.094 -0.531↓0.875 +0.625↑

Table 3

Comparison with refinement and defense baselines

Method AgentDojo AgentHarm
ASR↓TSR↑HCR↓VRR↑
Base0.8120.1320.7190.156
ReAct0.7360.1840.6560.250
Reflexion0.6740.2360.2810.719
Tool Filter0.5520.312--
PI Detector0.6040.348--
FATE0.5400.3920.1250.812

Figure 3

Effect of iterative self-evolution

Iterative self-evolution curves

Table 4

External trajectory-safety generalization on ATBench

Type Model ATBench-C ATBench-F
Acc.↑Prec.↑Rec.↑F1↑ R.S.↑F.M.↑R.H.↑
Closed-source models
GPT-5.473.768.587.176.733.613.530.2
GPT-5.269.065.679.371.829.512.026.8
Gemini-3-Flash76.479.371.074.918.48.315.0
Gemini-3.1-Pro75.576.173.875.024.812.618.5
Open-source models
Qwen3.5-397B-A17B66.865.570.267.87.73.66.8
Qwen3.5-4B45.941.220.727.66.63.08.2
Qwen3-4B52.678.06.411.94.48.218.3
QwQ-32B57.781.919.131.015.89.422.9
Qwen3-235B-A22B-Instruct-250759.258.263.860.87.011.626.6
Qwen3-4B-Instruct-250755.777.615.325.51.09.621.2
Qwen2.5-7B-Instruct53.473.89.717.15.36.015.5
Llama3.1-8B-Instruct45.347.389.561.96.25.815.5
Guard models
LlamaGuard3-8B53.185.73.87.3---
LlamaGuard4-12B58.163.830.941.7---
Qwen3-Guard51.540.00.40.8---
ShieldAgent62.558.081.467.7---
AgentDoG-Qwen3-4B64.059.288.971.146.816.540.6
Ours
Qwen3-8B-Instruct + FATE77.880.578.679.549.218.443.1

Table 5

Ablation study on Qwen3-8B-Instruct

Variant AgentDojo AgentHarm
ASR↓TSR↑HCR↓VRR↑
w/o verifier re-scoring0.6210.2810.2810.625
w/o over-refusal objective0.5580.3020.1560.734
w/o Pareto-front selection0.5860.3320.2030.719
w/o PFPO0.5720.3610.1720.750
SFT + safety-only GRPO0.5520.2860.1410.703
FATE0.5400.3920.1250.812

BibTeX

@misc{yin2026onpolicyselfevolutionfailuretrajectories,
      title={On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment}, 
      author={Bo Yin and Qi Li and Xinchao Wang},
      year={2026},
      eprint={2605.11882},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.11882}, 
}