AgentRewardBench Demo (
paper
)
Benchmark
Agent
Task ID
Judges
AER-C
NNetNav
Claude 3.7 Sonnet (Screen)
GPT-4o (Screen)
Qwen 2.5 VL (Screen)
Llama 3.3 70B