FATE

Abstract

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is finally used as a supervision signal for agent self-evolution. In the evolving process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

Framework

Figure 1 from the paper showing the FATE pipeline

FATE turns failed agent trajectories into verifier-filtered repair supervision.

Experimental Results

Results are organized with the same grouped metric structure as the paper: lower is better for ASR, BRR, and HCR; higher is better for all other metrics.

            33.5%
            relative ASR reduction on AgentDojo
          

            82.6%
            relative HCR reduction on AgentHarm
          

            +6.5
            ATBench-C F1 improvement over AgentDoG-Qwen3-4B
          

Table 1

Main results across backbone families

Backbone	Method	AgentDojo			AgentHarm
Backbone	Method	ASR↓	TSR↑	BRR↓	HCR↓	VRR↑	SafeScore↑
Qwen3-8B-Instruct	Base	0.812	0.132	0.104	0.719	0.156	0.241
Qwen3-8B-Instruct	FATE	0.540 -0.272↓	0.392 +0.260↑	0.082 -0.022↓	0.125 -0.594↓	0.812 +0.656↑	0.870 +0.629↑
Llama-3.1-8B-Instruct	Base	0.768	0.158	0.118	0.672	0.188	0.286
Llama-3.1-8B-Instruct	FATE	0.512 -0.256↓	0.417 +0.259↑	0.087 -0.031↓	0.156 -0.516↓	0.781 +0.593↑	0.842 +0.556↑
Ministral-3-8B-Instruct	Base	0.736	0.176	0.096	0.641	0.219	0.314
Ministral-3-8B-Instruct	FATE	0.486 -0.250↓	0.438 +0.262↑	0.074 -0.022↓	0.141 -0.500↓	0.797 +0.578↑	0.858 +0.544↑
Gemma-3-12B-it	Base	0.704	0.204	0.132	0.625	0.234	0.337
Gemma-3-12B-it	FATE	0.468 -0.236↓	0.462 +0.258↑	0.091 -0.041↓	0.172 -0.453↓	0.766 +0.532↑	0.821 +0.484↑
Phi-4-reasoning	Base	0.748	0.168	0.126	0.688	0.203	0.301
Phi-4-reasoning	FATE	0.503 -0.245↓	0.429 +0.261↑	0.089 -0.037↓	0.164 -0.524↓	0.781 +0.578↑	0.836 +0.535↑

Table 2

Scaling study on the Qwen3 family

Model	Method	AgentDojo		AgentHarm
Model	Method	ASR↓	TSR↑	HCR↓	VRR↑
Qwen3-0.6B	Base	0.884	0.071	0.844	0.063
Qwen3-0.6B	FATE	0.718 -0.166↓	0.203 +0.132↑	0.469 -0.375↓	0.500 +0.437↑
Qwen3-1.7B	Base	0.862	0.086	0.812	0.094
Qwen3-1.7B	FATE	0.653 -0.209↓	0.271 +0.185↑	0.344 -0.468↓	0.625 +0.531↑
Qwen3-4B	Base	0.838	0.108	0.781	0.125
Qwen3-4B	FATE	0.598 -0.240↓	0.334 +0.226↑	0.250 -0.531↓	0.719 +0.594↑
Qwen3-8B	Base	0.812	0.132	0.719	0.156
Qwen3-8B	FATE	0.540 -0.272↓	0.392 +0.260↑	0.125 -0.594↓	0.812 +0.656↑
Qwen3-14B	Base	0.916	0.058	0.750	0.125
Qwen3-14B	FATE	0.445 -0.471↓	0.504 +0.446↑	0.188 -0.562↓	0.812 +0.687↑
Qwen3-32B	Base	0.684	0.226	0.625	0.250
Qwen3-32B	FATE	0.384 -0.300↓	0.566 +0.340↑	0.094 -0.531↓	0.875 +0.625↑

Table 3

Comparison with refinement and defense baselines

Method	AgentDojo		AgentHarm
Method	ASR↓	TSR↑	HCR↓	VRR↑
Base	0.812	0.132	0.719	0.156
ReAct	0.736	0.184	0.656	0.250
Reflexion	0.674	0.236	0.281	0.719
Tool Filter	0.552	0.312	-	-
PI Detector	0.604	0.348	-	-
FATE	0.540	0.392	0.125	0.812

Figure 3

Effect of iterative self-evolution

Table 4

External trajectory-safety generalization on ATBench

Type	Model	ATBench-C				ATBench-F
Type	Model	Acc.↑	Prec.↑	Rec.↑	F1↑	R.S.↑	F.M.↑	R.H.↑
Closed-source models
	GPT-5.4	73.7	68.5	87.1	76.7	33.6	13.5	30.2
	GPT-5.2	69.0	65.6	79.3	71.8	29.5	12.0	26.8
	Gemini-3-Flash	76.4	79.3	71.0	74.9	18.4	8.3	15.0
	Gemini-3.1-Pro	75.5	76.1	73.8	75.0	24.8	12.6	18.5
Open-source models
	Qwen3.5-397B-A17B	66.8	65.5	70.2	67.8	7.7	3.6	6.8
	Qwen3.5-4B	45.9	41.2	20.7	27.6	6.6	3.0	8.2
	Qwen3-4B	52.6	78.0	6.4	11.9	4.4	8.2	18.3
	QwQ-32B	57.7	81.9	19.1	31.0	15.8	9.4	22.9
	Qwen3-235B-A22B-Instruct-2507	59.2	58.2	63.8	60.8	7.0	11.6	26.6
	Qwen3-4B-Instruct-2507	55.7	77.6	15.3	25.5	1.0	9.6	21.2
	Qwen2.5-7B-Instruct	53.4	73.8	9.7	17.1	5.3	6.0	15.5
	Llama3.1-8B-Instruct	45.3	47.3	89.5	61.9	6.2	5.8	15.5
Guard models
	LlamaGuard3-8B	53.1	85.7	3.8	7.3	-	-	-
	LlamaGuard4-12B	58.1	63.8	30.9	41.7	-	-	-
	Qwen3-Guard	51.5	40.0	0.4	0.8	-	-	-
	ShieldAgent	62.5	58.0	81.4	67.7	-	-	-
	AgentDoG-Qwen3-4B	64.0	59.2	88.9	71.1	46.8	16.5	40.6
Ours
	Qwen3-8B-Instruct + FATE	77.8	80.5	78.6	79.5	49.2	18.4	43.1

Table 5

Ablation study on Qwen3-8B-Instruct

Variant	AgentDojo		AgentHarm
Variant	ASR↓	TSR↑	HCR↓	VRR↑
w/o verifier re-scoring	0.621	0.281	0.281	0.625
w/o over-refusal objective	0.558	0.302	0.156	0.734
w/o Pareto-front selection	0.586	0.332	0.203	0.719
w/o PFPO	0.572	0.361	0.172	0.750
SFT + safety-only GRPO	0.552	0.286	0.141	0.703
FATE	0.540	0.392	0.125	0.812

BibTeX

@misc{yin2026onpolicyselfevolutionfailuretrajectories,
      title={On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment}, 
      author={Bo Yin and Qi Li and Xinchao Wang},
      year={2026},
      eprint={2605.11882},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.11882}, 
}