May was loud. Self-evolving skills, harness engineering, memory-as-substrate, multi-agent organizations, and a steady drumbeat of new benchmarks — the field is moving so fast that the only honest review is a list.
So here’s the list. Categorised, lightly annotated, mostly arXiv. The summaries are the authors’ own framing distilled down; the quoted lines are their evidence, not mine.
Scroll the index on the right to jump. Or keep going — it reads cleanest top to bottom.
01. Agent Architecture41 papers
How agents are organized inside, from self-evolving loops to skill libraries.
-
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
The paper proposes a subgoal-driven agent framework that improves long-horizon task performance by explicitly decomposing goals during online planning, and it adds an RL training method with dense milestone rewards to reduce sparse-reward issues. This matters because long-horizon agents often lose coherence over extended web/navigation tasks, and the approach substantially improves success rates on WebArena-Lite.
“subgoal decomposition; dense, milestone-based reward signals”
-
AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse
The paper proposes a self-evolving agent framework that stores successful solutions as executable Python subagents instead of textual reflections. This matters because it lets agents accumulate reusable, portable capabilities that improve over time and reduce repeated task-solving effort.
“preserves successful task solutions as executable subagent code rather than textual experience”
-
Agentic Business Process Management Systems
This position paper argues that generative and agentic AI will create a new wave of business process management, shifting systems from task automation toward autonomous sensing, reasoning, and action over end-to-end processes. It matters for AI agents because it sketches an architectural vision for embedding agents into process management with human-to-fully-autonomous continuums and governance.
“agents can sense process states, reason about improvement opportunities, and act to maintain and optimize performance”
-
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
The paper presents two agentic systems that autonomously search for and implement new neural architectures, using multiple LLM agents to explore design spaces and write mechanisms/training code. It matters because it shows agents can contribute to architecture discovery and optimization, a step toward recursive self-improvement.
“LLM agents autonomously designing foundation models beyond standard Transformers”
-
ASI-Evolve: AI Accelerates AI
ASI-Evolve is an agentic framework for AI-for-AI research that runs a closed-loop learn-design-experiment-analyze cycle. It matters because it shows how agents can help automate parts of AI research itself, spanning data, architectures, and learning algorithms.
“closes this loop through a learn-design-experiment-analyze cycle”
-
AutoAgent: Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents
AutoAgent proposes a self-evolving multi-agent framework that combines structured agent cognition, context-aware action selection, and elastic memory management. It matters because it aims to make agents adapt over long horizons without external retraining, improving reliability in dynamic environments.
“self-evolving multi-agent framework"; "Elastic Memory Orchestrator”
-
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
The paper treats agent skills as structured bundles of instructions, tools, and resources, and optimizes both their structure and content with a bilevel framework. An outer Monte Carlo Tree Search chooses the skill structure while an inner loop refines the components, aiming to improve agent task performance.
“agent skills are structured collections of instructions, tools, and supporting resources”
-
Caesar: Deep Agentic Web Exploration for Creative Answer Synthesis
Caesar proposes an agentic web-exploration architecture that builds a dynamic knowledge graph from deep traversal of the web, then uses it to guide synthesis toward novel, non-obvious insights. It matters because it shifts agents from passive retrieval and convergent summarization toward creative discovery and higher-quality answer synthesis.
“deep web traversal to construct a dynamic knowledge graph”
-
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
CoEvolve proposes a closed-loop training framework where an LLM agent and its training data co-adapt over time. It mines failure signals from rollouts to synthesize new tasks, which helps cover harder interaction patterns and improve agent robustness in changing environments.
“agent-data mutual evolution framework” / “closed-loop, interaction-driven training”
-
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
The paper argues that instead of using an external orchestrator to manage multi-step agent workflows, the procedure can be compiled into the weights of a smaller fine-tuned model. This 'subterranean agent' approach aims to preserve quality while reducing context usage, dependency on frontier models, and leakage of proprietary procedures.
“Compiling the procedure into the weights of a small fine-tuned model”
-
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
The paper proposes a plan-and-act robotic control system that uses world-model-style candidate future trajectory generation and scoring instead of purely reactive next-action prediction. This matters for agents because it improves long-horizon reliability in complex, changing environments where greedy reactive policies tend to fail.
“shifts from reactive control to plan-and-act by generating candidate future trajectories in visual latent space, scoring them”
-
CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly
The paper proposes a self-evolving cybersecurity agent that revises its own scaffold from failed runs using structured components, diagnosis signals, and diverse search over variants. This matters because it makes agent behavior more adaptive to changing tasks and failure modes in security testing, instead of relying on fixed human-designed workflows.
“iteratively revises its own scaffold based on experience from failed execution attempts”
-
Decocted Experience Improves Test-Time Inference in LLM Agents
The paper studies how to improve LLM agents at inference time without changing model parameters by constructing better context from accumulated experience. It argues that 'decocted experience'—extracting, organizing, and retrieving the most salient information—can make test-time reasoning and agentic behavior more effective and efficient.
“context as a complementary scaling axis"; "decocted experience”
-
Discovering Novel LLM Experts via Task-Capability Coevolution
The paper proposes AC/DC, an open-ended framework that coevolves LLMs and natural-language tasks together using model merging and synthetic task generation. This matters for AI agents because it suggests a way to continually discover new capabilities without manually restarting training on fixed datasets or reward functions.
“coevolution of models and tasks"; "evolves both LLMs via model merging and natural language tasks via synthetic data generation”
-
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
The paper presents DR-Venus, a 4B deep research agent designed for edge deployment and trained entirely on about 10K open examples. It matters because it shows small models can still achieve strong agentic research performance with careful data curation, agentic SFT, and reinforcement learning.
“frontier 4B deep research agent for edge-scale deployment, built entirely on open data”
-
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
EvoAgent is an evolvable LLM agent framework that learns structured skills over time and delegates subtasks to hierarchical sub-agents. It matters because it combines skill accumulation, memory, and delegation to improve performance on complex real-world tasks.
“integrates structured skill learning with a hierarchical sub-agent delegation mechanism”
-
From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents
The paper presents POISE, a closed-loop agent framework that automatically discovers improved policy optimization algorithms for language models by iterating over proposals, implementations, evaluations, and reflections. This matters because it shows LLM agents can help scientists explore algorithm design in a more systematic, evidence-driven way rather than relying only on manual trial-and-error.
“closed-loop framework for automated discovery of policy optimization algorithms”
-
Governance by Construction for Generalist Agents
This demo presents a modular policy-as-code governance layer that wraps a generalist LLM agent so enterprises can control allowed actions, human approvals, and information exposure without fine-tuning. It matters because it embeds compliance and safety checks throughout the agent pipeline, making real-world deployment more predictable and auditable.
“policy-as-code layer... runtime governance architecture... five structural checkpoints”
-
Harnessing Agentic Evolution
The paper introduces AEvo, a meta-editing framework that treats agentic evolution as an interactive environment and lets a meta-agent revise the procedure or context controlling future search. This matters for AI agents because it provides a stable way to use accumulated evidence over long horizons, improving both flexibility and robustness in iterative problem solving.
“meta-agent observes this state and acts ... by editing the procedure or agent context that controls future evolution”
-
Harnessing LLM Agents with Skill Programs
The paper proposes HASP, which turns reusable skills from past agent experience into executable Program Functions that can intervene in the agent loop. This matters because it moves skills from passive text advice to active guardrails that can improve long-horizon agent reliability at inference time, during training, and through self-improvement.
“skills into executable Program Functions (PFs)”
-
Hyperagents
The paper introduces hyperagents: self-referential agents with both a task-solving agent and a meta-agent whose modification procedure is itself editable. This matters for AI agents because it aims to enable open-ended self-improvement and improve not just task performance but the system’s ability to generate future improvements.
“self-referential agents" and "metacognitive self-modification”
-
Large Causal Models from Large Language Models
The paper proposes using large language models to build large causal models by generating causal questions and extracting plausible causal statements across many domains. This matters for agents because it frames LLMs as a mechanism for assembling structured causal knowledge, which could improve reasoning and decision-making.
“A high-quality LLM is used to propose topics, generate causal questions, and extract plausible causal statements”
-
Look Before You Leap: Autonomous Exploration for LLM Agents
The paper argues that LLM agents need an explicit autonomous exploration phase before task execution, especially in unfamiliar environments where they otherwise over-rely on prior knowledge. It introduces a verifiable exploration metric and an explore-then-act training paradigm, which matters because it improves agents’ ability to gather grounded information and generalize better in real-world settings.
“premature exploitation"; "Explore-then-Act paradigm”
-
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
MetaClaw is a continual meta-learning framework for deployed LLM agents that both updates the core policy and grows a reusable skill library over time. It matters because it lets agents adapt to changing user needs and workloads without downtime, improving performance through online skill synthesis and opportunistic fine-tuning.
“continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills”
-
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
The paper proposes a skill-centric agent framework where LLM agents continuously create, store, reuse, organize, evaluate, and refine skills over time. This matters because it treats agent skills as long-lived assets with memory and feedback loops, improving reuse, reliability, and cross-task adaptation.
“unified lifecycle (creation, memory, management, evaluation, and refinement)”
-
Neural Computers
The paper proposes Neural Computers, a learned runtime that unifies computation, memory, and I/O, with the long-term goal of a Completely Neural Computer. This matters for AI agents because it frames agents not just as software wrappers, but as a new computing substrate that can learn interface primitives and reusable capability over time.
“unify computation, memory, and I/O... in a learned runtime state”
-
Nexus : An Agentic Framework for Time Series Forecasting
Nexus decomposes time-series forecasting into specialized agentic stages that separately handle macro/micro temporal patterns and optional contextual signals like news or events. This matters for AI agents because it shows forecasting can benefit from structured reasoning and modular orchestration rather than only sequence modeling.
“multi-agent forecasting framework”
-
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
The paper proposes a framework for proactive AI agents that can infer latent user needs from streaming context and act under real-time and long-horizon constraints. It matters because it shifts agents from reactive assistants to intent-aware systems with structured long-term memory and closed-loop decision-making.
“infer latent needs from ongoing context"; "hybrid memory (workspace, user, global) for long-term”
-
Position: agentic AI orchestration should be Bayes-consistent
This position paper argues that the orchestration/control layer of agentic AI systems should use Bayesian decision theory to maintain beliefs, update them from interactions, and choose actions under uncertainty. The point matters because many agent decisions are fundamentally uncertain, and Bayes-consistent control can improve calibration, tool selection, and utility-aware coordination without requiring the LLM itself to be fully Bayesian.
“the control layer of an agentic AI system ... is a clear case where Bayesian principles should shine”
-
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
The paper proposes Ratchet, a single-agent loop for self-evolving LLM agents that can write, retrieve, curate, and retire its own natural-language skills. It matters because it shows that performance gains come less from skill authoring itself and more from managing the lifecycle of skills over time.
“single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills”
-
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
Reversa proposes a multi-agent framework that turns legacy software into traceable operational specifications so coding agents can modify real systems with lower risk. This matters because it gives agents explicit context, confidence signals, and human-review gaps when working with complex inherited codebases.
“multi-agent pipeline; traceability between code and specification”
-
SIA: Self Improving AI with Harness & Weight Updates
The paper proposes a self-improving loop where a feedback agent updates both an agent’s harness (tools, prompts, retry/search logic) and its model weights. This matters because it bridges two previously separate lines of agent improvement—scaffold optimization and test-time training—toward more autonomous self-improvement.
“updates both the harness and the weights of a task-specific agent”
-
SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
The paper proposes SKILL0, a training-time curriculum that gradually removes skill context so an agent can internalize procedural skills into model parameters rather than relying on runtime retrieval. This matters for AI agents because it aims to produce more zero-shot, autonomous behavior with lower inference-time token overhead and less dependence on external skill files.
“skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval”
-
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
SkillClaw proposes a shared, continuously updated skill repository for LLM agents, using multi-user interactions as training signal to refine existing skills and add new ones. This matters because it turns isolated agent experience into system-wide improvement, enabling cumulative capability growth without extra user effort.
“collective skill evolution in multi-user agent ecosystems”
-
SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration
SkillFlow proposes a flow-based orchestration framework with a trainable Supervisor, a dynamic skill library, and a frozen executor to automate multi-turn agent behavior. It matters because it addresses strategy collapse, opaque credit assignment, and unguided skill evolution, enabling more stable and transparent agent improvement.
“flow-based framework"; "recursive skill evolution"; "transparent per-step credit assignment”
-
SkillOS: Learning Skill Curation for Self-Evolving Agents
The paper proposes SkillOS, an experience-driven RL recipe for learning how an agent should curate and update a reusable skill repository over time. This matters because it moves agents beyond one-off task solving toward self-evolving systems that can improve from past interactions.
“experience-driven RL training recipe for learning skill curation in self-evolving agents”
-
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
SOLAR proposes an open-ended autonomous agent that self-improves via parameter-level meta-learning and multi-level reinforcement learning, rather than relying on costly repeated fine-tuning. This matters for agents because it enables continual adaptation to changing environments while preserving prior knowledge.
“open-ended autonomous agent"; "lifelong learning and continual adaptation”
-
Spatial-Agent: Agentic Geo-spatial Reasoning with Scientific Core Concepts
The paper proposes Spatial-Agent, an agent framework for geospatial question answering that turns natural-language queries into executable GeoFlow graphs built from spatial concepts and transformations. This matters for AI agents because it moves beyond brittle pattern matching toward interpretable, executable reasoning grounded in spatial science.
“parses into executable workflows represented as GeoFlow Graphs”
-
Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks
The paper presents an end-to-end agentic harness that can take only data and a high-level task description and autonomously design custom visualization applications. It matters for AI agents because it demonstrates a coordinated multi-stage workflow—analysis, planning, environment setup, implementation, validation, and evaluation—for long-horizon scientific tasks.
“end-to-end agentic harness"; "independently designs custom visual analysis applications”
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
The paper trains LLM agents to build internal world knowledge and self-evolve without requiring external rewards or human instructions at inference time. This matters because it suggests agents can adapt to unknown environments more autonomously and with less dependence on hand-crafted supervision.
“spontaneously learn about unseen environments prior to task execution”
-
You Live More Than Once: Towards Hierarchical Skill Meta-Evolving
The paper proposes HiSME, a lightweight hierarchical skill meta-evolving method that improves deployed agent systems by refining not just skills but also the strategy for evolving those skills at test time. This matters because it enables continual improvement from execution traces without expensive LLM parameter updates.
“test-time refinement of the skill evolving framework itself is necessary”
02. Applications20 papers
What people are building agents to actually do, from finance to bench science.
-
A cybersecurity AI agent selection and decision support framework
The paper proposes a structured framework for choosing and deploying different AI agent architectures for cybersecurity tasks, aligned to NIST CSF 2.0. It matters because it turns agent design into a decision-support process that maps autonomy and responsiveness to real operational security needs.
“aligns diverse artificial intelligence (AI) agent architectures ... with the comprehensive NIST Cybersecurity Framework (CSF) 2.0”
-
AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments
AgentEconomist is an end-to-end interactive agentic system for economics that turns abstract research intuition into executable computational experiments. It matters for AI agents because it shows how a modular multi-stage agent can help domain experts move from ideas to literature-grounded hypotheses, experiment design, and analysis.
“translate abstract intuitions into executable computational experiments”
-
Building AI Companions that Prioritise Learning over Performance
The paper argues that LLM-based educational agents should be designed as AI learning companions rather than tools that only maximize short-term task performance. It proposes a framework centered on pedagogy, adaptation to learners, and responsible design to support durable understanding, metacognitive growth, and learner agency.
“learning-performance paradox” / “AI learning companions”
-
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
DataSTORM is an LLM-based agent that can do deep research over both structured databases and internet sources by combining exploratory data analysis with data storytelling. It matters because it extends agentic research beyond unstructured web search to iterative hypothesis generation, quantitative reasoning, and coherent analytical narratives over large databases.
“conducting deep research over large-scale structured databases”
-
Directional Alignment and Narrative Agency in Human-LLM Co-Writing
This paper studies how humans and LLMs share control in creative co-writing, measuring who introduces new narrative directions versus who elaborates and adapts. It matters for AI agents because it shows a complementary division of labor: humans drive novelty and direction, while LLMs support coherence and emotional alignment.
“human turns introduce greater semantic novelty" and LLMs "elaborate on human-introduced elements”
-
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery
The paper proposes LLM-native figures: interactive, provenance-rich scientific artifacts that are both readable by humans and directly usable by LLMs. This matters for agents because it lets them trace data sources, extend analyses, and generate new visualizations without treating figures as opaque images.
“LLM-native figures... simultaneously human-legible and machine-addressable”
-
FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures
FinReporting is an agentic workflow for financial reporting across different jurisdictions, using a unified canonical ontology and auditable stages to normalize disclosures from heterogeneous sources. It matters for agents because it shows how LLMs can be constrained as evidence-grounded verifiers rather than free-form generators in high-stakes, regulation-heavy tasks.
“an agentic workflow for localized cross-jurisdiction financial reporting”
-
FinRipple: Aligning Large Language Models with Financial Market for Event Ripple Effect Awareness
The paper proposes FinRipple, a framework that aligns LLMs with financial-market structure to analyze and predict event ripple effects across entities. It matters for AI agents because it adds market-aware reasoning via time-varying knowledge graphs and reinforcement learning, enabling more realistic decision support in finance.
“empowers LLMs with the ability to analyze ripple effects through financial theory-guided large-scale reinforcement learning”
-
From Prompt to Graph: Comparing LLM-Based Information Extraction Strategies in Domain-Specific Ontology Development
The paper compares three LLM-based information extraction strategies—pre-trained prompting, in-context learning, and fine-tuning—for extracting terms and relations from domain-specific text. It matters for AI agents because it shows how LLMs can automate building structured knowledge graphs/ontologies from limited data, reducing manual ontology engineering effort.
“investigates three LLM-based approaches... to extract terms and relations from domain-specific texts using limited data”
-
From Topology to Trajectory: LLM-Driven World Models For Supply Chain Resilience
The paper proposes ReflectiChain, an agentic framework that uses a generative world model plus reflection and test-time policy evolution to improve long-horizon supply chain planning under disruptive events. It matters for AI agents because it addresses grounding and decision paralysis in non-stationary real-world environments.
“Latent Trajectory Rehearsal powered by a generative world model”
-
Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software
The paper presents PoVSmith, an agent-assisted approach that generates proof-of-vulnerability tests for software applications that depend on vulnerable libraries. This matters for AI agents because it shows how LLMs plus code analysis and execution feedback can automate a security workflow that is hard and expensive to do manually.
“combines call path analysis, exemplar test, code context, and feedback into multiple prompts”
-
Intent-Driven Smart Manufacturing Integrating Knowledge Graphs and Large Language Models
The paper proposes a framework that combines instruction-tuned LLMs with ontology-aligned knowledge graphs to convert natural-language human intents into structured machine-executable requirements for smart manufacturing. This matters for AI agents because it enables more reliable, explainable intent-to-action execution in complex industrial environments.
“translate high-level human intents into machine-executable actions”
-
LLM-Driven Ontology Construction for Enterprise Knowledge Graphs
The paper introduces OntoEKG, a two-stage LLM pipeline that extracts ontology concepts from unstructured enterprise data and then organizes them into a hierarchy for RDF serialization. It matters for AI agents because it automates a traditionally manual, expertise-heavy knowledge-graph construction workflow, helping agents work over cleaner structured enterprise knowledge.
“LLM-driven pipeline designed to accelerate the generation of domain-specific ontologies from unstructured enterprise data”
-
Omakase: proactive assistance with actionable suggestions for evolving scientific research projects
Omakase is a proactive research assistant that watches a user's project documents, infers timely deep-research queries, and turns long reports into actionable suggestions. It matters for agents because it addresses context limitations and helps agents support long-running, evolving workflows rather than only responding to explicit prompts.
“monitors a user's project documents to infer timely queries" and "distills long reports into suggestions”
-
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
The paper trains a specialized spreadsheet agent with reinforcement learning in a realistic Excel environment, rather than relying only on prompt engineering. This matters because spreadsheet work is a common real-world data task that requires robust multi-step tool use and workflow execution.
“reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment”
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
The paper presents AgentFlow, a framework for automatically synthesizing multi-agent harnesses that coordinate roles, tools, communication, and retry logic for vulnerability discovery. This matters for AI agents because it shows that the harness/orchestration layer can dramatically change performance, and that adaptive harness design can help agents find real zero-day bugs more effectively.
“typed graph DSL whose search space jointly covers agent roles, prompts, tools, communication topology, and coordination protocol”
-
TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
TingIS is an end-to-end enterprise system for discovering high-priority risk events from noisy customer incident reports in real time. It combines multi-stage event linking, LLM-based merge decisions, routing, and noise reduction to turn messy incident text into actionable operational intelligence.
“multi-stage event linking engine ... with Large Language Models (LLMs)”
-
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
The paper proposes AIDA, an end-to-end agent framework for autonomous business intelligence that turns fragmented enterprise data into actionable insights. It combines a domain-specific language for precise SQL execution with reinforcement learning for cumulative reasoning, which matters for agents that must analyze complex real-world data environments.
“first end-to-end framework designed for autonomous exploration in complex business environments”
-
Towards Cybersecurity Superintelligence: from AI-guided humans to human-guided AI
The paper argues that cybersecurity is moving toward 'superintelligence' by shifting from LLMs that guide humans to systems where AI performs expert-level security work autonomously. This matters for agents because it frames a path from assisted workflows to high-speed, strategic, human-surpassing agentic security systems.
“Cybersecurity superintelligence ... represents the next frontier in security”
-
WhatIf: Interactive Exploration of LLM-Powered Social Simulations for Policy Reasoning
WhatIf is an interactive system for steering, inspecting, and comparing LLM-powered social simulations to help policymakers reason under deep uncertainty. It matters for agents because it treats multi-agent simulation outputs as inspectable, collaborative decision-support rather than fixed forecasts.
“interactive system that enables policymakers to steer, inspect, and compare LLM-powered social simulations in real time”
03. Evaluation/Benchmarking77 papers
How we tell whether agents are actually working — and where they break.
-
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
The paper proposes TASTE, an automatic method for synthesizing harder agent benchmark tasks with broader tool-use coverage by reversing the usual task-construction pipeline. This matters because it helps expose benchmark saturation and provides a scalable way to continuously evaluate future agents on more realistic and diverse tool sequences.
“generates challenging tasks with broader tool-use coverage”
-
A Unified Framework for the Evaluation of LLM Agentic Capabilities
The paper proposes a standardized framework to fairly evaluate LLM agent capabilities by separating model ability from benchmark packaging, scaffolding, and environment volatility. This matters for AI agents because it makes benchmark results more interpretable and enables cleaner comparisons across models and settings.
“unified framework for the fair evaluation of LLM agentic capabilities”
-
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
The paper proposes AggAgent, an aggregation agent that combines multiple parallel rollouts for long-horizon agentic tasks by treating trajectories as an environment and selectively inspecting/searching them with lightweight tools. This matters because it enables parallel test-time scaling for open-ended agent workflows without dumping all trajectories into context, improving accuracy at low overhead.
“multiple rollouts are generated in parallel and aggregated into a final response”
-
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Agentic CLEAR is an automatic evaluation framework for LLM agents that generates insights at system, trace, and node levels. It matters because it improves observability and error analysis for agent behavior without relying on static, hand-crafted taxonomies.
“automatic, dynamic, and easy-to-use evaluation framework”
-
AI Research Agents Narrow Scientific Exploration
The paper treats AI research agents as scientific search systems and generates 37,802 ideas using four agent frameworks and six LLMs from shared seed literature, comparing the result against human papers and follow-on work. This matters because it directly tests whether agentic ideation broadens scientific exploration or clusters around existing work.
“37,802 ideas generated across four agent frameworks and six LLMs from shared seed literature”
-
AI scientists produce results without reasoning scientifically
This paper evaluates LLM-based scientific agents across multiple domains to test whether they reason in scientifically valid, self-correcting ways. It matters because it shows that good outcomes alone can hide weak epistemic reasoning, implying that agent scaffolds are not enough without training the reasoning process itself.
“evidence is ignored in 68% of traces”
-
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
The paper introduces AJ-Bench, a benchmark for evaluating agent-as-a-judge systems that actively interact with environments and tools to verify agent behavior. It matters because environment-aware judging is more robust than static rule-based or LLM-only evaluation, especially for complex agent tasks.
“actively interacting with environments and tools to acquire verifiable evidence”
-
AlphaEval: Evaluating Agents in Production
The paper introduces AlphaEval, a benchmark built from real production tasks at seven companies to evaluate AI agents under realistic commercial conditions. It matters because it shifts evaluation away from synthetic, well-specified tasks toward messy, evolving, expert-judged work that better reflects how agents perform in deployment.
“production-grounded benchmark of 94 tasks sourced from seven companies”
-
An Agentic Approach to Metadata Reasoning
The paper presents a Metadata Reasoner that uses an agentic workflow to search for candidate tables and then reason over metadata to select a small set of data sources that are sufficient and minimal for an analytical task. This matters for AI agents because source discovery is often a bottleneck in multi-step data analysis, and better metadata reasoning improves robustness and data selection quality.
“identify a small set of data sources that are both sufficient and minimal”
-
Autonomous LLM Agents & CTFs: A Second Look
The paper revisits claims that LLM agents can solve Capture-the-Flag security tasks near human level by comparing several engineered agent architectures with a general-purpose agent baseline. It matters because it clarifies where current agents are genuinely strong, where they still fail, and which architectural choices improve consistency and cost.
“revisit these results, providing a second look at these claims”
-
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
The paper introduces an open-source benchmark for measuring how well AI agents can complete realistic, end-to-end investment banking tasks, including navigating data rooms, using finance tools, and producing multi-file deliverables. It matters because it tests agent performance on economically meaningful workflows with expert-designed rubrics, exposing gaps between benchmark scores and real client-ready work.
“open-source benchmark of end-to-end analytical workflows”
-
Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents
This paper introduces MUTATE, an interactive benchmark for measuring divergent thinking in LLM agents at both path and action levels, rather than only judging final task success. It also proposes ReDNA, a method that separates unconstrained idea generation from constraint-based selection, improving creative problem solving in interactive agent settings.
“interactive benchmark designed to evaluate agentic divergent thinking at two levels”
-
CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents
This paper introduces CI-Work, a benchmark for measuring whether enterprise LLM agents preserve contextual integrity while using internal information in workplace workflows. It matters because it highlights a key deployment risk for agents: improving task utility can increase privacy leakage, suggesting the need for context-centric architectures rather than just larger models.
“benchmarking Contextual Integrity in Enterprise LLM Agents"; "privacy failures are prevalent”
-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
ClawBench is a benchmark for evaluating AI agents on realistic everyday online tasks across live websites, rather than offline sandboxes. It matters because it tests whether agents can handle messy, dynamic, multi-step web workflows needed for reliable general-purpose assistance.
“evaluation framework of 153 simple tasks... spanning 144 live platforms”
-
ClawGym: A Scalable Framework for Building Effective Claw Agents
ClawGym provides an end-to-end framework for developing and evaluating personal agents in Claw-style environments with local files, tools, and persistent workspace state. It matters because it combines synthetic verifiable training data, agent training, and benchmark evaluation into one scalable pipeline for more capable agent systems.
“supports the full lifecycle of Claw-style personal agent development”
-
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
ClawsBench is a benchmark for testing LLM productivity agents in realistic, stateful mock workspaces like Gmail, Slack, Calendar, Docs, and Drive. It matters because it measures not just task success but also unsafe actions in cross-service workflows, revealing capability-safety tradeoffs in agent deployment.
“evaluating and improving LLM agents in realistic productivity settings”
-
Communicate-Predict-Act: Evaluating Social Intelligence of Agents
The paper introduces a multiplayer arena of mixed cooperative and competitive social games plus a COMPACT interaction protocol to evaluate the social intelligence of LLM agents. It matters because it goes beyond single scalar scores and proposes multidimensional metrics for communicative influence, action prediction, strategic reasoning, and adaptability in social settings.
“We introduce a multiplayer arena of mixed cooperative and competitive social games to study LLM social intelligence”
-
COMPOSITE-Stem
The paper introduces COMPOSITE-STEM, an expert-written benchmark for scientific reasoning tasks across physics, biology, chemistry, and mathematics. It matters for AI agents because it tests frontier models on open-ended, scientifically meaningful outputs rather than saturated, constrained benchmarks.
“a benchmark of 70 expert-written tasks... allowing more flexible assessment of scientifically meaningful outputs”
-
Counterfactual Trace Auditing of LLM Agent Skills
The paper introduces Counterfactual Trace Auditing (CTA), a framework that compares an agent’s behavior with and without a skill on the same task, then aligns and annotates the resulting traces to measure how the skill changes the agent’s process. This matters because it moves skill evaluation beyond pass/fail outcomes and toward understanding behavioral effects and failure modes in agent skills.
“measuring how a skill changes agent behavior”
-
CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge
The paper introduces CresOWLve, a benchmark for testing creative problem-solving in LLMs using puzzles grounded in real-world knowledge. It matters for AI agents because it measures whether models can combine retrieved facts with non-obvious reasoning and creative connections, not just answer factual queries.
“benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge”
-
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
The paper introduces CyBiasBench to measure and analyze selection bias in LLM agents performing offensive cybersecurity tasks. This matters for AI agents because it shows agent behavior can be systematically skewed toward certain action families, affecting controllability and robustness beyond raw success rate.
“attack-selection bias"; "comprehensive 630-session benchmark”
-
DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
The paper introduces a benchmark for evaluating autonomous data analysis agents on exploratory, real-world financial tasks where they must work with noisy, unfamiliar, cross-domain data without much guidance. It matters because it tests whether agents can truly reason and explore in realistic settings, not just succeed on cleaned or pre-scaffolded benchmarks.
“agents must independently explore unfamiliar, noisy, cross-domain data”
-
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments
The paper introduces DefenderBench, an open-source toolkit and benchmark for evaluating language agents on cybersecurity tasks spanning offense, defense, and domain knowledge. It matters because it gives researchers a standardized, modular way to measure agent performance in a realistic and underexplored application area.
“open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks”
-
Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems
The paper proposes design principles for SOC-bench, a benchmark meant to evaluate blue-team security operation capabilities of multi-agent AI systems. This matters because current benchmarks emphasize red-team tasks, leaving a gap in measuring how agents can support autonomous security operations and incident response.
“no systematic benchmark for evaluating coordinated multi-task blue team AI”
-
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
The paper introduces an interactive benchmark where an LLM agent must experimentally discover the laws of a simulated world whose physics differs from ours. It matters because it tests whether agents can do genuine hypothesis-driven scientific reasoning, not just recall familiar patterns from training data.
“interactive benchmark"; "discover the laws of motion"; "design informative experiments and revise their hypotheses”
-
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
The paper benchmarks dense RAG versus GraphRAG in agentic search settings where retrieval happens over multiple rounds and decisions are made sequentially. It matters because it shows when explicit graph structure still helps and when agentic search can make plain RAG competitive, guiding retrieval design for AI agents.
“introduce RAGSearch, a unified benchmark"; "agentic search substantially improves dense RAG and narrows the performance gap to GraphRAG”
-
Dynamic Cyber Ranges
The paper argues that static cybersecurity benchmarks are becoming too easy for LLM-driven agents, so it proposes adding LLM-based defender agents to create dynamic cyber ranges. This matters for AI agents because it preserves evaluation difficulty as attacker capabilities improve and reveals emergent agent behaviors in realistic security settings.
“cyber range environments augmented with LLM-driven Defender agents”
-
Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
The paper proposes ESRRSim, a taxonomy-driven framework for automatically evaluating strategic reasoning risks in LLMs such as deception, evaluation gaming, and reward hacking. This matters for AI agents because it helps quantify when models may optimize for their own objectives or adapt to evaluation contexts rather than behave faithfully.
“taxonomydriven agentic framework for automated behavioral risk evaluation”
-
Evaluating Plan Compliance in Autonomous Programming Agents
The paper studies whether autonomous programming agents actually follow instructed task plans during software engineering workflows. This matters because plan compliance affects whether success reflects genuine strategic reasoning versus shortcutting, contamination, or overfitting.
“first extensive, systematic analysis of plan compliance in programming agents”
-
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective
The paper introduces EvoMemBench, a unified benchmark for evaluating LLM agent memory across in-episode vs. cross-episode and knowledge-oriented vs. execution-oriented settings. It matters because memory is a core agent capability, and the benchmark shows current memory systems are not yet a general solution, helping guide better memory design.
“introduce EvoMemBench, a unified benchmark”
-
ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents
The paper introduces a benchmark that measures cybersecurity agent capability as a ladder of increasingly powerful exploitation skills, rather than a binary crash/success outcome. This matters because it gives a more realistic way to evaluate how far LLM agents can progress in exploit construction against hardened targets.
“a capability-graded benchmark that decomposes exploitation into 16 measurable flags”
-
Forecasting Scientific Progress with Artificial Intelligence
The paper introduces CUSP, a benchmark for evaluating whether AI can forecast scientific progress under controlled information constraints. It matters for agents because it tests not just retrieval or reasoning, but whether models can make reliable forward-looking predictions about scientific advances and their timing.
“CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark”
-
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
The paper proposes a more realistic evaluation protocol for AI pentesting agents that focuses on validated vulnerability discovery rather than narrow task completion. This matters because it better reflects how offensive security agents must operate in open-ended, ambiguous real-world targets.
“shifts assessment from task completion to validated vulnerability discovery”
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
The paper introduces Frontier-Eng, a benchmark for evaluating self-evolving LLM agents on realistic engineering tasks that require iterative propose-execute-evaluate optimization rather than simple pass/fail answers. It matters because it tests whether agents can use executable feedback and domain knowledge to make feasible designs improve under a fixed budget.
“benchmark for generative optimization"; "industrial-grade simulators and verifiers”
-
Generative World Renderer
The paper introduces a large-scale dynamic dataset from AAA games to improve inverse and forward rendering, especially when transferring to real-world scenarios. It also proposes a VLM-based evaluation protocol for inverse rendering without ground truth, which matters for measuring and improving visual agent systems that need robust scene understanding and generation.
“introduce a large-scale, dynamic dataset"; "propose a novel VLM-based assessment protocol”
-
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
This paper systematizes and benchmarks LLM-based automated penetration testing frameworks, comparing architectures, planning, memory, execution, and external knowledge use. It matters for AI agents because it provides a structured taxonomy and a large-scale empirical benchmark for assessing how well agentic systems can carry out complex cyber tasks end-to-end.
“first Systematization of Knowledge (SoK)... and comprehensive empirical evaluation”
-
Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
The paper introduces Harness-Bench, a benchmark for isolating how the agent execution harness affects performance across different model backends in realistic workflows. This matters because agent capability can depend heavily on the surrounding system layer—not just the base model—so evaluation should be reported at the model-harness configuration level.
“diagnostic benchmark for evaluating configuration-level harness effects”
-
Harnessing Pre-Resolution Signals for Future Prediction Agents
The paper studies future prediction agents that revise forecasts over time as evidence evolves, and proposes using pre-resolution signals from repeated unresolved questions to improve the agent’s ongoing predictions. This matters because it gives agents a way to learn from temporal contrasts before the final outcome is known, rather than relying only on coarse post-resolution correctness.
“revisiting the same unresolved question over time creates informative temporal contrasts”
-
Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows
This paper studies how agentic workflows break down when the model acting as a judge provides deceptive, biased, or factually wrong feedback. It introduces a taxonomy of judge behavior and a new benchmark to test robustness, which matters because many agent systems depend on unreliable feedback loops for self-improvement.
“construct a suite of judge behaviors and develop WAFER-QA”
-
HorizonBench: Long-Horizon Personalization with Evolving Preferences
The paper introduces HorizonBench, a benchmark for long-horizon personalization where user preferences can change over months of interaction. It matters for agents because it tests whether systems can maintain and update user models over long contexts, a key capability for reliable personalized assistants.
“HorizonBench provides a testbed for long-context modeling, memory-augmented architectures, theory-of-mind reasoning, and user modeling”
-
How are AI agents used? Evidence from 177,000 MCP tools
This paper measures how AI agents are actually being used by analyzing 177,436 tools in the Model Context Protocol ecosystem and mapping them to task domains and levels of direct impact. It matters because it provides a practical monitoring method for understanding agent deployment patterns and identifying consequential or risky uses beyond just model outputs.
“evaluated 177,436 agent tools"; "Software development accounts for 67% of all agent tools”
-
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
The paper studies how well reusable agent skills actually help LLM agents when they must retrieve and select skills from a large, messy real-world collection rather than use hand-crafted, task-specific skills. It matters because it shows skill gains are fragile in realistic settings and that retrieval plus query-specific skill refinement can recover performance.
“benchmarking skill usage performance remains scarce"; "retrieve skills from a large collection of 34k real-world skills”
-
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
The paper studies how retrieval strategy and the surrounding agent harness affect agentic search performance, comparing grep-based and vector retrieval under different tool-output formats and context conditions. It matters because it shows that agent results depend not just on the retrieval method, but also on how the agent is wired to call tools and ingest outputs.
“systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm”
-
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
The paper introduces SU S VI B E S, a benchmark of real-world feature-request tasks that were known to lead to vulnerable implementations, and uses it to measure how secure agent-generated code is. It matters because it shows that even strong coding agents can produce functionally correct but insecure code, raising concerns for production use in security-sensitive settings.
“benchmark consisting of 200 feature-request software engineering tasks”
-
Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
The paper introduces an open-source harness for constructing validation suites that test how reliable LLM judges are across benchmark settings. This matters for AI agents because LLM-as-judge scoring is widely used in evaluation pipelines, and the work shows these judges can be brittle under simple perturbations.
“stress testing the reliability of LLM judges”
-
KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
KWBench is a benchmark for testing whether LLMs can recognize the underlying problem in a knowledge-work scenario before being told what to solve. This matters for AI agents because real-world performance often depends on correctly framing ambiguous situations, not just executing well-specified tasks.
“benchmark for unprompted problem recognition in large language models”
-
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
The paper argues that search agents often rely on their pretrained knowledge to answer queries rather than truly retrieving fresh evidence from the web. It introduces LiveBrowseComp, a benchmark with recently published, human-authored questions designed to test genuine search and discovery, which matters because static benchmarks can overestimate real search-agent capability.
“Intrinsic Knowledge Dependence (IKD); static search benchmarks can reward memory-backed verification”
-
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
The paper introduces LiveClawBench, a benchmark for evaluating LLM agents on realistic assistant tasks with compositional difficulty, rather than isolated toy settings. It matters because it pushes agent evaluation closer to deployment conditions by measuring environment complexity, cognitive demand, and runtime adaptability.
“benchmark to evaluate LLM agents on real-world assistant tasks”
-
LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications
The paper proposes a readiness harness that turns LLM/RAG evaluation into a deployment gate by combining benchmarks, observability, and CI checks. This matters for agents because it helps decide whether a system is truly shippable using operational signals like groundedness, latency, cost, and policy compliance rather than a single offline score.
“turns evaluation into a deployment decision workflow”
-
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
The paper introduces LongDS-Bench, an evaluation of agentic data-analysis workflows over long horizons built from real Kaggle notebooks, emphasising failure modes rather than just ranking models. This matters because it benchmarks a practical, failure-prone agent use case and shows where current systems break down.
“68 tasks derived from real Kaggle notebooks; best model reaches only 48.45% average accuracy and degrades over longer interactions”
-
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
MiroEval is a benchmark and evaluation framework for deep research agents that goes beyond final-answer scoring to assess process quality, factual verification, and multimodal reasoning. It matters because it measures how agents search, reason, and refine over time, which better reflects real research workflows and exposes weaknesses hidden by output-only metrics.
“evaluation still lags behind real user needs"; "assesses ... process-centric evaluation”
-
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists
The paper evaluates AI systems as scientific peer reviewers by having expert scientists rate thousands of individual criticisms from human and AI reviews. It shows current AI reviewers can surface high-quality, novel critiques, but still have systematic weaknesses, suggesting they complement rather than replace human reviewers.
“large-scale expert annotation study ... rating 2,960 individual criticisms ... on correctness, significance, and sufficiency of evidence”
-
On the Reliability of Computer Use Agents
The paper studies why computer-use agents can succeed once but fail on repeated runs of the same task, focusing on stochastic execution, task ambiguity, and behavior variability. This matters because reliable agents need consistency across runs, not just occasional task success.
“repeated executions of the same task" and "sources of unreliability in computer-use agents”
-
PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools
PHMForge is an evaluation environment for testing whether LLM agents can reliably use MCP-native industrial prognostics tools in safety-critical settings. It matters because it separates true agent reasoning/tool-use ability from protocol or retrieval confounds, revealing where current agents fail on orchestration and tool sequencing.
“Prior benchmarks conflate protocol fluency with reasoning... We introduce PHMForge, an evaluation environment that closes each conflation.”
-
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
The paper introduces REVIEWBENCH, a benchmark for measuring how substantive and rubric-aligned AI-generated peer reviews are, and REVIEWGROUNDER, a multi-agent framework that drafts reviews then grounds them with targeted evidence. It matters for agents because it combines explicit rubrics with tool-based grounding to produce more human-like, evidence-backed outputs.
“rubric-guided, tool-integrated multi-agent framework”
-
SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment
The paper introduces SEA-Eval, a benchmark for self-evolving agents that are meant to learn and improve across tasks rather than only within a single episode. This matters because it evaluates whether agents truly accumulate experience over time, instead of just optimizing short-term task success.
“the first benchmark designed specifically for evaluating SEAs”
-
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
The paper introduces SalesLLM, a bilingual benchmark for measuring how well LLMs handle realistic, multi-turn sales conversations with persuasion and outcome tracking. It matters for AI agents because it evaluates not just dialogue quality, but whether an agent can actually progress toward a goal and predict buying intent.
“benchmarking LLM realistic selling skill” / “multi-turn, goal-directed persuasion… outcome-oriented sales agents”
-
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
This paper introduces SHADE-Arena, a benchmark for testing whether LLM agents can pursue hidden harmful objectives while appearing benign, and whether monitors can detect such sabotage. It matters because real-world agents may be deployed in long-horizon settings where subtle deception and monitoring resistance are critical safety risks.
“the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities”
-
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
The paper introduces SkillCraft, a benchmark for testing whether LLM agents can learn reusable higher-level tool skills rather than just succeed on one-off tool calls. This matters because long-horizon agents need compositional skill abstraction and reuse to become efficient and scalable.
“stress-test agent ability to form and reuse higher-level tool compositions”
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow is a benchmark for lifelong skill discovery and evolution in autonomous agents. It tests whether agents can not only use external skills, but also discover them from experience, repair them after failures, and maintain a coherent skill library over time.
“benchmark of 166 tasks across 20 families"; "skill discovery, patching, transfer”
-
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
The paper introduces SkillGenBench, a benchmark for evaluating how well LLM agents can generate correct, reusable, and executable skills from repositories and documents. This matters because agent capability is increasingly bottlenecked not just by using tools/skills, but by distilling them reliably into reusable artifacts.
“benchmark for evaluating skill generation pipelines under a unified and controlled protocol”
-
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
The paper introduces SkillLearnBench, a benchmark for evaluating continual learning methods that automatically generate reusable skills for LLM agents from experience. It matters because it tests whether agents can reliably learn better skills over time across real-world tasks, and shows current methods still struggle to deliver consistent gains.
“the first benchmark for evaluating continual skill learning methods”
-
SpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM Agents
The paper proposes SpecBench, a benchmark that decomposes software engineering tasks into a natural-language specification, visible validation tests, and held-out tests, then uses the gap between them to quantify reward hacking. This matters because specification reasoning is a key bottleneck for reliable autonomous coding agents, and the gap grows sharply with code size.
“30 systems-level tasks from JSON parser to OS kernel; held-out vs visible test gap grows 28 points per 10x code size”
-
SWE-chat: Coding Agent Interactions From Real Users in the Wild
This paper introduces SWE-chat, a large-scale living dataset of real coding-agent sessions from open-source developers, enabling empirical study of how coding agents are actually used in practice. It matters because it shifts evaluation from curated benchmarks to real-world interaction traces, revealing usage patterns, code survival rates, and failure modes.
“first large-scale dataset of real coding agent sessions”
-
Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks
This paper benchmarks frontier LLM agents on offensive cybersecurity challenges using a multi-agent framework, expanded tooling, and controlled environment comparisons. It matters because it shows that agent performance depends heavily on environment/tooling and model choice, not just prompting.
“most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks”
-
TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
TFRBench introduces a benchmark that evaluates forecasting systems on the reasoning behind predictions, not just numerical accuracy. It uses a multi-agent iterative verification loop to synthesize grounded reasoning traces, which matters because it makes forecasting models more interpretable and can improve prediction quality when used as prompting signals.
“first benchmark designed to evaluate the reasoning capabilities of forecasting systems”
-
The Art of Building Verifiers for Computer Use Agents
The paper presents design lessons for building a reliable verifier for computer use agent trajectories, focusing on robust scoring of web-task success and failure. This matters because trustworthy verification is essential for both agent evaluation and training signals.
“without reliable verification, neither evaluation nor training signal can be trusted”
-
The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
The paper introduces HORIZON, a cross-domain diagnostic benchmark for analyzing how and why LLM-based agents fail on long-horizon tasks with many interdependent steps. This matters because it moves agent evaluation beyond short tasks and provides a reproducible way to attribute failures and improve long-horizon reliability.
“introduce HORIZON, an initial cross-domain diagnostic benchmark”
-
To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study
This paper empirically studies how much maintenance agent-generated code needs compared with human-authored code. It matters for AI agents because it shifts the focus from just generating code to understanding long-term reliability and the human effort required to sustain agent-produced software.
“open questions persist about the long-term maintainability of AI-generated code”
-
Toward User Comprehension Supports for LLM Agent Skill Specifications
The paper studies whether LLM agent skill specifications help users form correct expectations about what a skill consumes, produces, and covers. It matters because agent skills are user-facing interfaces, and clearer specs can reduce misuse and improve trust and selection of agent capabilities.
“user-facing capability disclosures” and “bounded expectations”
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper defines Knowledgeable Deep Research (KDR), extending deep research agents to combine structured and unstructured knowledge for more rigorous, multimodal report generation. It introduces a multi-agent framework plus a benchmark, which matters because it pushes agents beyond web reading toward evidence-backed analysis with tables, figures, and quantitative reasoning.
“utilize structured knowledge"; "construct KDR-Bench”
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench evaluates agent performance on realistic, long-horizon CLI tasks in native runtime environments rather than synthetic sandboxes. This matters because it tests whether agents can actually complete deployed, tool-rich workflows over many steps, exposing brittleness that short benchmarks miss.
“native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks”
-
WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
The paper introduces a benchmark for assessing LLM agents on full spreadsheet workflows in finance, rather than isolated Q&A or single-formula edits. This matters because it tests whether agents can produce professional-quality artifacts for real enterprise tasks like modeling, forecasting, and scenario analysis.
“one of the first evaluations of agents on end-to-end spreadsheet tasks”
-
YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
YC-Bench is an open-source benchmark that tests whether LLM agents can sustain strategic planning, adapt to delayed feedback, and execute consistently over a long horizon. It matters because it measures real long-term agent reliability in a partially observable, compounding-consequence setting rather than short isolated tasks.
“benchmark that evaluates these capabilities" / "one-year horizon spanning hundreds of turns”
-
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs
The paper argues that agentic and coding LLMs can be trained more effectively with fewer but higher-quality trajectories rather than more noisy data. It introduces STITCH, a coarse-to-fine filtering and chunking method that preserves decision-critical tokens, improving agent performance across software engineering tasks and languages.
“fewer but higher-quality training trajectories"; "STITCH ... filters low-value noise and retains decision-critical tokens”
-
YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents
The paper introduces Information Elicitation Agents, where the agent’s job is to ask questions and elicit useful information from users for institutional or task-oriented goals. It provides YIELD, a large dialogue dataset plus a formal POMDP framing and metrics, which matters for studying agents that proactively gather information rather than only respond.
“formalize information elicitation as a finite-horizon POMDP and propose novel metrics tailored to IEAs”
-
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
The paper argues that deployed agents should be evaluated over their full operational lifespan, not just at initialization, because their effective state changes through memory growth, retrieval, revisions, and maintenance. It introduces AgingBench to diagnose how and where agents degrade, which matters for making long-lived agent systems reliable in practice.
“longitudinal reliability benchmark for agent lifespan engineering”
04. Infrastructure36 papers
The harness, runtime, and plumbing that turn a model into a working agent.
-
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
The paper frames production LLM agents around a stochastic-deterministic boundary and proposes runtime architecture patterns for coordinating model outputs with deterministic software behavior. It matters because it gives a practical methodology for building more reliable, debuggable agent systems as they move into production.
“runtime architecture patterns for production LLM agents"; "stochastic-deterministic boundary (SDB)”
-
A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression
The paper introduces TACO, a plug-and-play, training-free framework that automatically learns and reuses compression rules for terminal-agent observations. It matters because it reduces context bloat and token cost while preserving the signals needed for long-horizon terminal workflows.
“self-evolving Terminal Agent Compression framework”
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore is an agent framework for improving performance-critical kernel code generation by reusing execution feedback across attempts. It combines failure-driven adaptation with a diversity-preserving search over candidate kernels, helping the agent stay valid while escaping local optima without extra fine-tuning or external knowledge.
“accumulated execution feedback" and "failure-driven adaptation and diversity-preserving search”
-
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
The paper argues that many agent failures come from the runtime harness/interface rather than the model itself, and proposes a lifecycle-aware harness that adapts observation, tool use, action execution, feedback interpretation, and trajectory control without changing model weights. This matters because it shows agent performance can improve by fixing the environment-facing layer, making frozen models more reliable in deterministic domains.
“runtime harness ... mediates observation, tool use, action execution, feedback interpretation, and trajectory control”
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
The paper introduces Agent-World, a self-evolving training arena that synthesizes realistic, executable environments and tasks so agents can learn from diverse real-world tool ecosystems. This matters because robust general agents need scalable environments and continual learning signals beyond static benchmarks.
“a self-evolving training arena” / “scalable environments”
-
Agentic AI Workload Characteristics
The paper studies how agentic AI changes LLM serving compared with single-turn prompting, using traces from ReAct-style agents across multiple benchmarks and model settings. It matters because it identifies system bottlenecks like repeated model re-entry, persistent KV-cache dependence, and evolving tool-use patterns that agent infrastructure must handle efficiently.
“stateful, multi-turn executions that repeatedly invoke the model, call tools”
-
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
The paper proposes an automatic, closed-loop method for evolving coding-agent harnesses using observability: making components explicit, compressing large trajectories into usable evidence, and verifying each edit with predicted outcomes. This matters because agent performance often depends heavily on the harness, and the work shows harnesses can be improved autonomously rather than hand-tuned.
“Harnesses are now central to coding-agent performance"; "closed loop"; "observability pillars”
-
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents
The paper argues that reliable software agents depend not just on the model, but on the runtime harness that mediates observation, actions, feedback, memory, verification, and permissions. This matters because it reframes agent capability as a system property and proposes concrete runtime components for making agent behavior more auditable and dependable.
“model-harness-environment system"; "runtime substrate -- the harness”
-
AIRA_2: Overcoming Bottlenecks in AI Research Agents
The paper argues that AI research agents are limited by throughput, evaluation noise, and weak single-turn operators, and proposes AIRA_2 to fix these with asynchronous multi-GPU execution, Hidden Consistent Evaluation, and ReAct agents. This matters because it shows how systems design and evaluation can substantially improve agentic research performance and scaling.
“three structural performance bottlenecks in AI research agents”
-
An Alternate Agentic AI Architecture (It's About the Data)
The paper argues that enterprise agentic AI should be built around explicit data management and query planning rather than opaque LLM-driven tool orchestration. It proposes RUBICON and a small query algebra (AQL) to make intermediate steps visible, auditable, and more reliable for governed enterprise settings.
“enterprises do not suffer from a reasoning deficit, but from a data integration problem”
-
An Alternative Trajectory for Generative AI
The paper argues for shifting generative AI away from ever-larger monolithic models toward domain-specific superintelligence built on explicit symbolic abstractions and orchestration. This matters for agents because it suggests a scalable, more energy-efficient way to get reliable reasoning by routing tasks to specialized back-end experts rather than relying on one giant model.
“domain-specific superintelligence (DSS)" and "societies of DSS models”
-
Architectural Design Decisions in AI Agent Harnesses
This paper studies the non-LLM engineering layer around AI agents—tool mediation, context handling, delegation, safety control, and orchestration. It matters because it identifies recurring architectural patterns and tradeoffs that can guide how agent frameworks are designed and selected in practice.
“protocol-guided, source-grounded empirical study of 70 publicly available agent-system projects”
-
Autogenesis: A Self-Evolving Agent Protocol
The paper proposes Autogenesis Protocol (AGP), a protocol for agent systems that explicitly models resources like prompts, tools, memory, and environments with versioned lifecycles, plus a closed-loop mechanism for proposing, assessing, and committing improvements. This matters for AI agents because it aims to make agent systems more modular, auditable, and safely evolvable over time instead of relying on brittle glue code.
“self evolution protocol"; "resource registered resources with explicit state, lifecycle, and versioned interfaces”
-
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
AutoSOTA is an end-to-end automated research system that reproduces recent SOTA papers and then iteratively improves them to discover new SOTA models. It matters because it automates much of the empirical research loop, reducing repetitive experimentation and helping agents support scientific model development.
“end-to-end automated research system"; "105 new SOTA models”
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
The paper introduces an autonomous pipeline for generating verified environments from natural language descriptions, aimed at training and evaluating claw-like agents at scale. This matters because it replaces manual environment creation with on-demand, diverse, and validated benchmarks and training tasks.
“automated pipeline capable of generating diverse, verified environments on demand”
-
Cochise: A Reference Harness for Autonomous Penetration Testing
The paper introduces a minimal, reusable Python harness for autonomous penetration-testing experiments, separating planning and execution while keeping long-term state outside the LLM context. It matters because it provides a controlled baseline and analysis tooling for comparing agent architectures and behaviors in security tasks.
“reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces”
-
Effective Harness Engineering for Algorithm Discovery with Coding Agents
The paper studies how the execution harness around coding agents affects automated algorithm discovery, not just the underlying model. It shows that better harness design can improve search efficiency, safety, and robustness against evaluation hacks, which matters for scaling agentic algorithm generation.
“the design of the execution infrastructure, i.e., the harness”
-
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
The paper argues that progress in agentic AI is increasingly limited by the surrounding system architecture, not just the base model. It frames the agent ‘harness’—memory, retrieval, tool routing, orchestration, verification, and governance—as a first-class design and evaluation target for building more reliable long-horizon agents.
“next major bottleneck in agentic AI as system scaling, not only model scaling”
-
From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation
The paper proposes a three-layer agentic architecture that translates natural-language research questions into reproducible scientific workflows. It matters for AI agents because it constrains LLM nondeterminism to intent extraction while using deterministic generators and expert-authored knowledge to produce reliable, repeatable automation.
“three layers… semantic layer… deterministic layer… knowledge layer”
-
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
The paper proposes SSL, a structured representation for agent skills that separates scheduling signals, execution structure, and logic-level evidence from text-heavy skill documents. This matters for AI agents because it makes skills easier to search, inspect, reuse, and assess for risk, instead of leaving critical behavior buried in natural language.
“first structured representation for agent skill artifacts ... Scheduling-Structural-Logical (SSL) representation”
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
The paper presents FrontierSmith, an automated system for generating open-ended coding problems by evolving closed-ended competitive programming tasks into more realistic, long-horizon challenges. This matters for AI agents because it creates scalable training data for stronger coding agents on tasks that better reflect real-world, ambiguous software work.
“synthesize open-ended coding problems at scale”
-
HARBOR: Automated Harness Optimization
The paper argues that the harness around a long-horizon language-model agent is a first-class optimization target, not just the base model. It introduces automated search over harness configurations—such as memory, caching, trajectory reuse, and tool prediction—to improve agent performance more systematically than manual tuning.
“harness design is a first-class machine-learning problem”
-
HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools
HarnessAPI proposes a skill-first Python framework where one typed skill definition can automatically generate both streaming HTTP endpoints and MCP tools. This matters for AI agents because it removes duplicated tool wiring and schema drift, making agent-facing capabilities easier to build, maintain, and expose across runtimes.
“eliminates this duplication" and "automatically derives ... a zero-configuration MCP tool”
-
Harnesses for Inference-Time Alignment over Execution Trajectories
The paper studies how to design agent harnesses at inference time by separating task decomposition from guided execution, and analyzes when more structure helps or hurts performance. This matters for AI agents because it clarifies the limits of workflow scaffolding and shows that partial harnesses can outperform fully rigid ones.
“inference-time technique for large language model (LLM) agents"; "task decomposition" and "guided execution”
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
The paper proposes AutoTTS, an environment-driven framework that lets agents discover test-time scaling strategies automatically instead of relying on manually designed heuristics. This matters for AI agents because it turns inference-time compute allocation into a searchable control problem, improving the accuracy-cost tradeoff with cheap feedback and low search cost.
“controllers decide when to branch, continue, probe, prune, or stop”
-
Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report
The paper proposes a production harness for AI-native software development that converts requirements into explicit contracts, routes tasks through specialized AI agents, and uses adversarial verification plus failure classification to continuously improve the system. This matters for AI agents because it shifts evaluation from isolated outputs to reliable long-horizon software operations.
“explicit contracts, role-specialized AI agents, independent and adversarial verification”
-
OntoMetric: An Ontology-Driven LLM-Assisted Framework for Automated ESG Metric Knowledge Graph Generation
The paper proposes a framework that combines ontology constraints with LLM assistance to automatically build ESG metric knowledge graphs from regulatory sources. This matters for agents because it turns messy, implicit domain rules into structured, machine-actionable knowledge with better reliability than unconstrained extraction.
“Ontology-Driven LLM-Assisted Framework” and “automated ESG metric knowledge graph generation”
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
The paper proposes OpenWorldLib, a standardized inference framework and unified codebase for advanced world models, alongside a clearer definition of what counts as a world model. This matters for AI agents because it aims to make world-model capabilities reusable, composable, and easier to benchmark across tasks.
“comprehensive and standardized inference framework for Advanced World Models”
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
The paper presents a closed-loop system that automates the end-to-end lifecycle of adapting small language models, from cold-start data acquisition and training to production monitoring, failure diagnosis, and retraining with regression constraints. This matters for agents because it turns model improvement into an iterative, production-safe agentic workflow rather than a manual engineering loop.
“closed-loop system that automates this lifecycle”
-
Polar: Agentic RL on Any Harness at Scale
Polar is a rollout framework for scalable asynchronous reinforcement learning over arbitrary agent harnesses. It treats the harness as a black box, reconstructs token-faithful trajectories, and improves compute efficiency for long-running agent workloads, which matters because it lets agents be trained on real multi-turn tool-using systems without rewriting the harness.
“scalable asynchronous RL over arbitrary agent harnesses”
-
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
RepoDoc uses a repository knowledge graph as the semantic backbone for generating and updating code documentation. This matters for AI agents because it helps them maintain accurate, scalable documentation over changing codebases instead of treating files as isolated fragments.
“uses a repository knowledge graph (RepoKG) as the semantic foundation for the entire documentation lifecycle”
-
SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering
The paper argues that as model capabilities converge, the key differentiator for personal AI agents is the surrounding harness: orchestration, safety, context management, and extensible infrastructure. It presents SemaClaw, an open-source multi-agent framework that aims to make agents controllable, auditable, and production-reliable for everyday personal use.
“harness engineering"; "controllable, auditable, and production-reliable systems”
-
SkillGrad: Optimizing Agent Skills Like Gradient Descent
SkillGrad proposes optimizing reusable agent skill files with a gradient-descent-like loop: executions produce losses, diagnoses act like text gradients, and a momentum memory stabilizes updates. This matters because it gives agents a more systematic way to improve procedural skills beyond ad hoc reflection.
“gradient-descent-inspired framework for optimizing agent skills”
-
SkillNet: Create, Evaluate, and Connect AI Skills
SkillNet is an open infrastructure for creating, evaluating, and organizing reusable AI skills so agents can accumulate and transfer experience instead of repeatedly relearning the same behaviors. This matters because it gives agents a durable skill memory and a scalable way to improve performance across tasks and backbones.
“lack of systematic accumulation and transfer of skills" / "open infrastructure designed to create, evaluate, and organize AI skills at scale”
-
The Last Harness You'll Ever Build
The paper proposes a two-level framework that automatically evolves an agent’s task-specific harness, including prompts, tools, orchestration, and evaluation criteria. This matters because it reduces or removes the need for human experts to hand-engineer agent setups for each new domain, and even automates the design of the automation process itself.
“shifts manual harness engineering into automated harness engineering”
-
The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
The paper proposes ActiveGraph, an agent runtime where an append-only event log is the source of truth and the working graph is a deterministic projection that drives reactive behaviors. This matters because it enables deterministic replay, cheap branching/forking, and full lineage tracing for agent runs, improving observability and reproducibility.
“append-only event log is the source of truth; deterministic replay; cheap forking”
05. Memory/RAG55 papers
Memory, retrieval, and the long art of remembering useful things.
-
AEL: Agent Evolving Learning for Open-Ended Environments
This paper proposes a two-timescale learning framework for LLM agents in open-ended environments: a bandit chooses which memory retrieval policy to use, while LLM reflection diagnoses failures and injects causal insights into the prompt. It matters because it argues that agent self-improvement depends less on adding more components and more on learning how to use past experience effectively.
“the obstacle is not what to remember but how to use what has been remembered”
-
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
The paper proposes a lightweight agentic retrieval harness that lets an LLM iteratively search, open, navigate, and summarize enterprise documents instead of relying on a single-shot RAG pipeline. This matters because it shifts grounding from a fixed retrieval stack to autonomous tool use, improving factuality and answer correctness on enterprise knowledge tasks.
“equipping a reasoning LLM with search, find, open, and summarize tools”
-
Artifacts as Memory Beyond the Agent Boundary
The paper formalizes externalized memory in reinforcement learning by treating environmental artifacts as memory beyond the agent’s internal state. This matters for AI agents because it provides a principled way to model and study when agents can offload memory into the world, potentially improving performance and interpretability in partially observable settings.
“formalize such cases within Reinforcement Learning"; "artifacts to store information about an agent's previous interactions”
-
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents
Auto-Dreamer learns to consolidate a language agent’s accumulated experiences offline, turning many session-level memories into a smaller, more reusable memory bank. This matters because it helps agents retain useful abstractions and procedures across tasks while using much less memory.
“learned offline consolidator for language-agent memory”
-
AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution
AutoSkill turns repeated user interaction patterns into reusable skills that are automatically derived, maintained, and injected into future requests without retraining. This matters for agents because it enables lifelong personalization and transfer of useful behaviors across sessions, users, and tasks.
“automatically derive, maintain, and reuse skills from dialogue and interaction traces”
-
Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents
The paper compares two ways to build persistent conversational agents: feeding the full dialogue to a long-context LLM versus using a fact-based memory system that extracts and retrieves structured facts. It matters because it quantifies the accuracy-cost trade-off and shows when dedicated memory becomes cheaper than long-context inference in production.
“fact-based memory system built on the Mem0 framework against long-context LLM inference”
-
CODESKILL: Learning Self-Evolving Skills for Coding Agents
The paper proposes CODESKILL, an LLM-based framework that turns coding-agent trajectories into reusable procedural skills and learns how to maintain a compact skill bank over time. This matters for agents because it enables self-improvement from experience without relying on brittle fixed prompts or hand-designed update rules.
“reformulates skill extraction and skill-bank maintenance as a learnable management policy”
-
Contextual Agentic Memory is a Memo, Not True Memory
The paper argues that common agent memory mechanisms like vector stores, retrieval-augmented generation, scratchpads, and context-window management are really lookup systems rather than true memory. This matters because it frames a fundamental limitation for agents: they may store more context, but still fail to learn abstractly, generalize to novel tasks, or resist persistent memory poisoning.
“current agentic memory systems ... do not implement memory: they implement lookup”
-
Development of Ontological Knowledge Bases by Leveraging Large Language Models
The paper proposes an iterative LLM-assisted methodology for building ontological knowledge bases, automating knowledge acquisition, ontology artifact generation, and refinement. This matters for AI agents because better structured knowledge bases can improve scalability, consistency, and adaptability in agent memory and knowledge management.
“structured, iterative methodology leveraging LLMs to optimize knowledge acquisition, automate ontology artifact generation”
-
Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval
The paper compares agentic data retrieval with and without semantic metadata, asking whether LLM agents can reliably find actionable data directly from the web. It finds structured, metadata-rich ecosystems still matter for accurate, execution-oriented workflows, which is important for building dependable data-retrieval agents.
“Semantic Agent excels at retrieving actionable data”
-
DTCRS: Dynamic Tree Construction for Recursive Summarization
The paper proposes dynamically deciding when to build a recursive summary tree for retrieval-augmented QA, using document structure and question semantics to avoid unnecessary summaries. This matters for AI agents because it reduces construction time, cuts redundancy, and improves evidence selection for multi-step reasoning over long documents.
“reduces redundant summaries while improving the relevance between summaries and the question”
-
Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery
The paper systematically compares eight memory condensation strategies for long-running coding agents, focusing on how to manage limited context windows without hurting task performance. This matters because effective memory compression can reduce cost and enable agents to sustain complex scientific workflows over long horizons.
“no systematic comparison exists"; "evaluate eight memory condensation strategies”
-
Evaluating Memory Structure in LLM Agents
The paper introduces StructMemEval, a benchmark testing whether LLM agents can build useful long-term memory structures such as trees, ledgers, to-do lists, and indexes. It argues that many existing memory benchmarks can be solved by simple retrieval, while structured-memory tasks expose failures in organisation and hallucination control. This matters because it targets a core weakness of long-running agents: organising knowledge, not just retrieving facts.
“simple retrieval-augmented LLMs struggle with state tracking, hierarchical organisation, and accumulated counting; memory agents solve them if prompted how to organise”
-
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
The paper proposes a self-evolving memory architecture for LLM agents where both stored knowledge and retrieval/configuration mechanisms are optimized in a closed loop. This matters because it moves agent memory from a fixed subsystem to an adaptive one that can improve across sessions and benchmarks without manual tuning.
“self-evolving memory architecture"; "co-evolution at two levels: the stored knowledge and the retrieval mechanism”
-
EXG: Self-Evolving Agents with Experience Graphs
The paper proposes EXG, an experience graph that structures successes and failures so LLM agents can reuse prior experience across tasks and over time. This matters because it turns brittle, ad hoc reflection into a more systematic memory mechanism that improves performance and efficiency as agents keep deploying.
“explicitly organizes accumulated successes and failures into a structured, relational representation”
-
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
The paper proposes a unifying framework that treats agent memory, skills, and rules as different levels of experience compression, helping explain how agents can store and reuse knowledge more efficiently over long horizons. This matters because it highlights a missing adaptive mechanism for scaling agent learning across sessions while reducing context, latency, and compute overhead.
“positions memory, skills, and rules as points along a single axis of increasing compression”
-
Experiential Reflective Learning for Self-Improving LLM Agents
The paper proposes ERL, a self-improvement framework where an LLM agent reflects on past task trajectories to distill reusable heuristics, then retrieves and injects those heuristics for future tasks. This matters because it lets agents adapt across tasks without starting from scratch, improving reliability in specialized environments.
“reflects on task trajectories and outcomes to generate heuristics”
-
FinDKG: Dynamic Knowledge Graphs with Large Language Models for Detecting Global Trends in Financial Markets
The paper uses large language models to generate a dynamic knowledge graph from financial news, then applies a graph neural network to analyze evolving relationships and detect thematic market trends. This matters for AI agents because it shows how structured, time-aware knowledge can be built from unstructured text to support better reasoning and decision-making in changing environments.
“LLMs as dynamic knowledge graph generators" / "Dynamic knowledge graphs (DKGs)”
-
FinKario: Event-Enhanced Automated Construction of Financial Knowledge Graph
The paper builds a financial knowledge graph that is automatically updated with real-time company fundamentals and market events, then uses a two-stage graph-based retrieval method to feed LLMs timely, structured financial context. This matters for agents because it improves access to evolving domain knowledge and supports more accurate decision-making in fast-changing markets.
“Event-Enhanced Automated Construction of Financial Knowledge Graph"; "Two-Stage, Graph-Based retrieval strategy (FinKario-RAG)”
-
FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs
The paper presents an agentic framework for constructing large-scale financial knowledge graphs from SEC 10-K filings using intelligent parsing, table-aware chunking, and schema-guided iterative extraction with reflection. This matters for AI agents because it combines extraction, self-correction, and evaluation to build more reliable structured knowledge from complex real-world documents.
“reflection-driven feedback loop; LLM-as-a-Judge assessments”
-
GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
The paper proposes a hierarchical graph memory for LLM agents that separates short-term event buffering from long-term semantic consolidation. This matters because it reduces memory contamination while improving long-horizon dialogue consistency and retrieval precision.
“hierarchical Graph-based Agentic Memory framework"; "decouple memory encoding from consolidation”
-
Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework
The paper argues that long-term memory in LLM agents needs governance, not just retrieval efficiency, because dynamic memory can suffer from drift, corruption, and privacy leakage. It proposes the SSGM framework to verify consistency, model temporal decay, and control access before memory is consolidated, making persistent agents safer and more reliable.
“long-term memory has emerged as a foundational component"; "SSGM decouples memory evolution from execution”
-
Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent
The paper presents HLTM, a hierarchical semantic memory system for LLM agents that organizes noisy longitudinal user signals into a schema-aligned memory tree. This matters because it enables scalable, low-latency, privacy-aware, and observable long-term memory for production agents, improving personalization in real workflows.
“organizes textual data into a schema-aligned memory tree”
-
Learning to Retrieve from Agent Trajectories
The paper argues that retrieval models for agentic search should be trained on agent interaction traces rather than human click logs. It proposes mining supervision from multi-step trajectories to improve evidence recall, task success, and efficiency in LLM-powered search agents.
“learn to retrieve from agent trajectories"; "retrievers trained with LRAT consistently improve evidence recall”
-
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
The paper introduces Context-ReAct, a framework for adaptive context management in long-horizon search agents, letting the agent dynamically skip, compress, rollback, snippet, or delete parts of its trajectory. This matters because agents that search and reason over long tasks need to control working memory to reduce cost, avoid overload, and improve reliability.
“elastic context orchestration" and "adaptive context management”
-
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym is a benchmark suite for evaluating long-horizon memory in LLM agents across realistic agentic settings like tool-use dialogue, deep research, coding, and web/computer use. It matters because it isolates memory from reasoning and tool-use ability, making it easier to compare and improve agent memory systems in practical environments.
“benchmark for agentic memory"; "memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability”
-
MeMo: Memory as a Model
MeMo is a modular way to add new, up-to-date, or domain-specific knowledge to an LLM by training a separate memory model instead of updating the LLM itself. This matters for agents because it improves knowledge freshness, reduces catastrophic forgetting, and works without access to model weights or logits.
“encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged”
-
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
This paper unifies existing memory methods for LLM-based agents into a modular framework, then benchmarks representative approaches on standard tasks. It matters because memory is central to long-horizon agents: it supports knowledge accumulation, iterative reasoning, and self-improvement.
“Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks”
-
Memory Intelligence Agent
The paper proposes a memory-enhanced deep research agent framework that combines a non-parametric memory manager with parametric planner/executor agents. It matters because it targets efficient reasoning and autonomous self-improvement by evolving memory during test-time rather than relying only on static retrieval of past trajectories.
“Memory systems enable DRAs to leverage historical experiences”
-
Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents
The paper studies how coding agents can reuse memories across heterogeneous task domains instead of keeping memory siloed per benchmark. It shows that abstract, high-level memories transfer better than concrete traces, which matters because it offers design principles for more reusable and scalable agent memory systems.
“cross-domain memory improves average performance by 3.7%”
-
MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair
The paper introduces MemRepair, a memory-augmented agent framework for automated vulnerability repair at repository scale. It adds hierarchical memories and a feedback loop so the agent can reuse prior fixes, security patterns, and failed-to-success trajectories, improving multi-file repair reliability.
“memory-augmented agentic framework" and "three complementary memory layers”
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
The paper introduces a memory framework for long-horizon LLM agents that stores past interactions as images rather than raw text, so agents can preserve much longer histories under tight context limits. Its locate-and-transcribe retrieval aims to recover exact evidence with less hallucination, which matters for agents that must reliably reuse experience over time.
“leverages the visual modality as a high-density representation of agent experience”
-
Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory
The paper presents an autonomous research pipeline that discovers and improves a unified multimodal memory framework for lifelong AI agents. It matters because it shows agents can use self-directed experimentation to find better memory architectures, retrieval strategies, and pipeline fixes than manual tuning or traditional AutoML.
“AI agents increasingly operate over extended time horizons... retain, organize, and recall multimodal experiences”
-
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
PEEK adds a compact, persistent “context map” to an LLM agent’s prompt so it can retain reusable orientation knowledge about recurring external contexts like documents or codebases. This matters because it improves long-context task performance and efficiency without repeatedly relearning the same context from scratch.
“caches and maintains this orientation knowledge as a context map”
-
PolicyBank: Evolving Policy Understanding for LLM Agents
The paper proposes PolicyBank, a memory mechanism that lets LLM agents refine their understanding of organizational policies through interaction and corrective feedback before deployment. This matters because it helps agents move beyond merely following an incorrect literal policy interpretation and instead learn the intended policy behavior under ambiguity.
“memory mechanism that maintains structured, tool-level policy insights and iteratively refines them”
-
Prism: An Evolutionary Memory Substrate for Multi-Agent Open-Ended Discovery
The paper proposes Prism, a memory substrate for multi-agent systems that combines layered persistence, semantic memory, relational graphs, and evolutionary search into one framework. It matters because better long-term memory and retrieval can improve coordination, learning, and open-ended discovery in agentic systems.
“an evolutionary memory substrate for multi-agent AI systems engaged in open-ended discovery”
-
Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory
The paper argues that lifelong LLM agents need richer memory than isolated atomic facts, because fact-only compression loses details and weakens deeper reasoning. It proposes TriMem, which keeps raw dialogue, atomic facts, and synthesized profiles together, improving faithful storage, retrieval, and reasoning over long interactions.
“Beyond Atomic Facts in Lifelong LLM Agent Memory” / “maintains three coexisting representation granularities”
-
Rethinking Memory as Continuously Evolving Connectivity
The paper argues that agent memory should not be a static store with fixed retrieval, but a dynamic, graph-based system that evolves as tasks and feedback change. This matters for agents because it aims to make memory more adaptable, robust, and useful across long-horizon, heterogeneous workflows.
“models memory as a heterogeneous graph" and "progressively refines its topology”
-
Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki
The paper argues that retrieval for LLM agents should act like reasoning, not just similarity-based lookup. It introduces LLM-Wiki, a structured, self-evolving retrieval system with search, reading, and link-following tools plus persistent self-correction, which matters because it better supports iterative agent workflows and multi-hop evidence gathering.
“retrieval-as-reasoning paradigm"; "search, read, and link-following operations”
-
Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks
The paper studies how LLM assistants should decide what to remember from past interactions when tasks are diverse, not uniform. It introduces a benchmark and shows that a single fixed extraction prompt does not work well across heterogeneous task types, which matters for building persistent, personalized agents.
“heterogeneous memory extraction" task and introduce BEHEMOTH”
-
Skill Retrieval Augmentation for Agentic AI
The paper proposes Skill Retrieval Augmentation (SRA), where agentic LLMs dynamically retrieve and load relevant external skills from a large corpus instead of stuffing all skills into context. This matters because it scales agent capability more effectively and exposes a new bottleneck in deciding which skill to load and when external skills are actually needed.
“agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand”
-
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
The paper proposes SkillsVote, a lifecycle-governance framework for agent skills that manages how skills are collected, recommended, executed, and evolved over time. This matters for AI agents because it turns noisy agent traces into governed reusable experience, improving performance without updating the underlying model.
“lifecycle-governance framework for Agent Skills from collection and recommendation to evolution”
-
SkillX: Automatically Constructing Skill Knowledge Bases for Agents
SkillX builds a reusable, plug-and-play skill knowledge base from agent experience so agents do not have to relearn the same behaviors in isolation. This matters because structured, hierarchical skill memory can improve transfer, reduce redundant exploration, and boost generalization across agents and environments.
“plug-and-play skill knowledge base"; "reused across agents and environments”
-
STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
The paper introduces STALE, a benchmark for testing whether LLM agents can recognize when stored memories have become outdated due to later conflicting evidence. It matters because agent memory systems need not only retrieve facts, but also revise beliefs and act on updated state correctly over long contexts.
“a later observation invalidates an earlier memory without explicit negation”
-
Stateless Decision Memory for Enterprise AI Agents
The paper argues that enterprise AI agents need stateless, replayable memory for regulated, auditable deployment, and proposes Deterministic Projection Memory (DPM) as an append-only event log plus a task-conditioned projection at decision time. This matters because it preserves decision quality while improving determinism, auditability, isolation, and scalability compared with stateful memory architectures.
“append-only event log plus one task-conditioned projection at decision time”
-
StockMem: An Event-Reflection Memory Framework for Stock Forecasting
The paper proposes a dual-layer memory framework that converts financial news into structured events and tracks their evolution over time, then adds a reflection memory of causal experiences. This matters for AI agents because it improves long-horizon, explainable reasoning by retrieving analogous historical scenarios and linking them to current evidence.
“an event-reflection dual-layer memory framework”
-
StructMem: Structured Memory for Long-Horizon Behavior in LLMs
The paper proposes StructMem, a structured memory framework for long-term conversational agents that preserves event relationships rather than storing isolated facts. This matters for AI agents because it improves temporal reasoning and multi-hop QA while reducing the cost and fragility of graph-based memory construction.
“structured memory framework that preserves event-level bindings and induces cross-event connections”
-
Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
The paper proposes SIRA, a retrieval-augmented agent that compresses multi-round exploratory search into a single corpus-discriminative retrieval action. It matters for AI agents because it improves retrieval quality and efficiency while staying interpretable and training-free.
“compress multi-round exploratory search into a single corpus-discriminative retrieval action”
-
To Know is to Construct: Schema-Constrained Generation for Agent Memory
The paper proposes SCG-MEM, a memory architecture that treats recall as schema-constrained generation rather than dense retrieval. This matters for AI agents because it reduces context-mismatched retrieval and prevents hallucinated memory keys, enabling more reliable long-term memory and multi-hop recall.
“Schema-Constrained Generation for Agent Memory"; "provide a formal guarantee against structural hallucinations”
-
Towards Self-Evolving Agentic Literature Retrieval
The paper introduces PaSaMaster, a self-evolving agentic literature retrieval system that iteratively analyzes intent, retrieves evidence, and refines searches to produce relevance-ranked papers. This matters for AI agents because it turns literature search into an adaptive, evidence-grounded process rather than a single-shot query match.
“self-evolving agentic literature retrieval system”
-
Trajectory-Informed Memory Generation for Self-Improving Agent Systems
The paper proposes a memory system for LLM agents that learns from execution trajectories, extracts actionable lessons from past runs, and retrieves them to improve future performance. This matters because it helps agents avoid repeating mistakes and reuse successful strategies in contextually relevant ways.
“extracting actionable learnings from agent execution trajectories" and "improve future performance through contextual memory retrieval”
-
Useful Memories Become Faulty When Continuously Updated by LLMs
The paper studies agentic memory systems that continuously rewrite past experiences into consolidated textual memories, and shows this can degrade performance instead of improving it. It matters because it argues robust agent memory should preserve raw episodes and gate consolidation carefully rather than automatically overwriting evidence after every interaction.
“memory utility first rises, then degrades”
-
When Continual Learning Moves to Memory: A Study of Experience Reuse in LLM Agents
The paper studies how LLM agents can reuse past experiences through external memory instead of updating model weights, framing continual learning as a memory retrieval problem. This matters because it shows that memory can improve adaptation, but also introduces forgetting and interference when old and new experiences compete in limited context.
“the challenge does not disappear but resurfaces at the memory level”
-
When to Forget: A Memory Governance Primitive
The paper introduces Memory Worth (MW), a lightweight signal for deciding which agent memories to trust, suppress, or deprecate as tasks change. This matters because agent memory systems need principled governance, not just static importance scores or heuristic judgments, to avoid retrieving stale or harmful memories.
“deciding which memories to trust, suppress, or deprecate”
-
World Models That Know When They Don't Know - Controllable Video Generation with Calibrated Uncertainty
The paper proposes a calibrated uncertainty method for controllable video/world models so they can estimate when their generated future frames are likely wrong. This matters for AI agents in robotics because it helps detect hallucinations, localize untrustworthy regions, and support safer decision-making under distribution shift.
“training video models for correctness and calibration" / "know when they don't know”
06. Multi-Agent Systems34 papers
When agents talk to other agents — orchestration, cooperation, drift.
-
A Multi-Agent Orchestration Framework for Venture Capital Due Diligence
The paper presents an automated multi-agent system for venture capital due diligence and market analysis, using event-driven orchestration plus LLMs and web retrieval to turn unstructured sources into structured investment intelligence. It matters because it shows how agents can coordinate reliable data collection and extraction in a high-stakes business workflow, including explicit fallback behavior to reduce hallucinations.
“fully automated multi-agent framework for corporate due diligence and market analysis in venture capital”
-
Advancing Automated Algorithm Design via Evolutionary Stagewise Design with LLMs
The paper proposes EvoStage, an evolutionary, stagewise framework that decomposes algorithm design into sequential steps with intermediate feedback, aiming to make LLM-based algorithm design more grounded and less hallucination-prone. It matters for AI agents because it adds structured iteration, feedback, and multi-agent/global-local coordination to improve automated design in complex real-world tasks.
“decomposes the algorithm design process into sequential, manageable stages"; "introduce a multi-agent system”
-
Agents that Matter: Optimizing Multi-Agent LLMs via Removal-Based Attribution
The paper proposes optimizing multi-agent LLM systems using removal-based attribution: measure each agent's contribution by observing how system performance changes when that agent is removed. This matters because it helps identify which agents actually add value, enabling simpler and stronger multi-agent architectures.
“removal protocols induce distinct games where agent ablation isolates structural bottlenecks; up to 17% performance gain at 35% lower cost”
-
AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs
AlphaLab is an autonomous research harness that uses frontier LLM agents to run the full experimental cycle across quantitative domains with no human intervention. It matters because it shows how multi-agent orchestration, self-generated adapters, and persistent playbooks can generalize research workflows and improve performance on hard optimization tasks.
“autonomous research harness ... automate the full experimental cycle ... Strategist/Worker loop”
-
Claw AI Lab: An Autonomous Multi-Agent Research Team
The paper presents a lab-native platform that turns autonomous research into an interactive multi-agent workflow, where users can create a research team, monitor progress, inspect artifacts, and roll back/resume experiments. This matters for AI agents because it adds steerability, collaboration, and reliability to automated research rather than treating it as a hidden prompt-to-paper pipeline.
“instantiate a full research team from one prompt”
-
CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
CORAL introduces an autonomous multi-agent evolution framework for open-ended discovery, replacing rigid hand-coded search heuristics with long-running agents that explore, reflect, and collaborate using persistent memory and asynchronous execution. It matters because it shows that greater agent autonomy and coordination can improve search efficiency and results on difficult scientific and optimization tasks.
“first framework for autonomous multi-agent evolution on open-ended problems”
-
Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution
The paper designs and evaluates multi-agent AI oracle systems for resolving prediction markets, a setting that requires evidence aggregation, judgment under ambiguity, and robustness to adversarial incentives. This matters because it applies multi-agent AI to a concrete governance and decision-support workflow, with routing criteria for hybrid AI-human pipelines.
“confidence-weighted voting reaches 83.43% accuracy on 1,189 KalshiBench questions; deliberative consensus underperforms due to error propagation”
-
Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers
The paper proposes SAGE, a multi-agent LLM framework that splits time-series anomaly detection into specialized analyzers for different anomaly families, then combines their evidence into structured, confidence-scored diagnostics. This matters for AI agents because it improves controllability, interpretability, and reliability over single-model anomaly detection.
“a multi-agent framework for structured anomaly diagnosis”
-
Don't Make the LLM Read the Graph: Make the Graph Think
The paper studies whether explicit belief graphs help LLMs reason in cooperative multi-agent settings, using Hanabi as a testbed. It shows that graphs are most useful when they actively constrain action selection rather than merely being added as prompt context, which matters for building better multi-agent agents.
“explicit belief graphs improve LLM performance in cooperative multi-agent reasoning”
-
DynaMate2: Democratization of Agentic AI for Expert-Designed Custom Workflows
DynaMate2 is a hierarchical agentic framework that lets researchers turn existing expert-written Python functions into AI-callable tools inside a supervised multi-agent pipeline. This matters because it lowers the barrier to using agents in scientific workflows while keeping domain logic in validated code rather than LLM-generated code.
“lower the barrier... into AI-callable tools within a supervised multi-agent pipeline”
-
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
The paper studies how groups of LLM-based agents can develop harmful collective behaviors, such as collusion-like coordination and conformity, even when no individual agent is explicitly instructed to do so. This matters because it shows that multi-agent risk can emerge from interaction dynamics, so safeguards must address the group level, not just single agents.
“collusion-like coordination and conformity emerge with non-trivial frequency”
-
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
The paper proposes EvoScientist, a multi-agent system that evolves itself across stages of scientific discovery, from literature review and idea generation to code execution and manuscript writing. It matters for AI agents because it aims at end-to-end autonomous scientific workflows, using persistent memory and iterative evolution to improve research quality and execution success.
“multi-agent evolving AI scientists for end-to-end scientific discovery”
-
EvoSkill: Automated Skill Discovery for Multi-Agent Systems
EvoSkill is a self-evolving framework that automatically discovers and refines reusable agent skills by analyzing failures, proposing new skills or edits, and storing them as structured skill folders. This matters because it shifts agent improvement from manual prompt/code crafting to transferable, validation-driven skill optimization across tasks.
“automatically discovers and refines agent skills through iterative failure analysis”
-
EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution
EvoSpark proposes a multi-agent framework for generating coherent long-horizon narratives by managing evolving character relations, memory, and scene/plot alignment. It matters for AI agents because it tackles a core challenge in agent societies: maintaining consistency and persistence over extended interactive simulations.
“LLM-based multi-agent systems"; "sustain logically coherent long-horizon narratives”
-
Federation over Text: Insight Sharing for Multi-Agent Reasoning
The paper proposes Federation over Text (FoT), a federated-learning-like framework where multiple LLM agents share reasoning traces rather than data or gradients. A central server distills these traces into a reusable insight library, helping current and future agents reason better across tasks and domains.
“collectively generate a shared library of metacognitive insights”
-
From Idea to Co-Creation: A Planner-Actor-Critic Framework for Agent Augmented 3D Modeling
The paper proposes a Planner-Actor-Critic multi-agent framework for 3D modeling, where one agent plans, another executes, and a critic iteratively reviews and improves the result with human oversight. This matters for AI agents because it shows how structured self-reflection and human-in-the-loop guidance can improve tool-based creative workflows.
“Planner coordinates modeling steps, the Actor executes them, and the Critic provides iterative feedback”
-
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
The paper proposes OneManCompany, an organisational layer for multi-agent systems that treats agent capabilities as portable "Talents" and coordinates them through typed interfaces, a talent market, and hierarchical decision-making. This matters because it shifts agent systems from fixed pipelines to self-organising, adaptable AI organizations that can recruit, reconfigure, and improve during execution.
“organising heterogeneous agents as a real-world company"; "Talent Market"; "self-organising and self-improving AI organisations”
-
Generative Ontology: When Structured Knowledge Learns to Create
The paper proposes a framework that combines ontologies, executable schemas, and LLMs so structured knowledge can generate novel artifacts without losing validity. It matters for AI agents because it shows how multi-agent role specialization and schema constraints can improve creative generation while reducing structural errors.
“A multi-agent pipeline assigns specialized roles”
-
GraphMind: From Operational Traces to Self-Evolving Workflow Automation
GraphMind turns human operational traces into executable workflow graphs, then uses a multi-agent traversal engine plus reinforcement from execution feedback to automate and improve incident workflows over time. This matters for AI agents because it combines retrieval, reasoning, and learning from outcomes to make agentic workflow automation more reliable and adaptive in production settings.
“online multi-agent traversal engine"; "Adaptive Traversal Reinforcement (ATR) reinforces successful traversal paths”
-
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
The paper proposes a counterfactual multi-agent framework for clinical diagnosis, where agents explicitly test competing diagnoses by editing case evidence and measuring how confidence changes. This matters for AI agents because it makes reasoning more interpretable, evidence-grounded, and reliable in high-stakes decision support.
“counterfactual multi-agent diagnostic framework"; "multi-round specialist discussions”
-
Mind DeepResearch Technical Report
This paper presents Mind DeepResearch (MindDR), an efficient multi-agent deep research framework that uses a three-agent collaboration loop plus specialized training stages to achieve strong research performance with ~30B models. It matters for AI agents because it shows that carefully structured agent roles and training can make smaller models competitive on complex browsing and report-generation tasks.
“collaborative three-agent architecture (Planning Agent, DeepSearch Agent, and Report Agent)”
-
Multi-agent Collaboration with State Management
The paper proposes STORM, a state-oriented management layer for multi-agent collaboration that keeps agents on a consistent shared workspace and resolves conflicting edits at write time. This matters because it reduces silent conflicts and expensive post-hoc merges, improving reliability and efficiency in collaborative agent systems.
“ensuring that each agent operates on a consistent view of the codebase and that conflicting edits are detected and resolved at write time”
-
Multi-User Large Language Model Agents
The paper studies LLM agents in multi-user, multi-principal settings instead of the usual single-user setup. It formalizes the problem, proposes a unified interaction protocol, and stress-tests agents on conflicting instructions, privacy, and coordination—important for deploying agents in real team and organizational workflows.
“first systematic study of multi-user LLM agents”
-
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
The paper proposes a personalized multi-agent research automation framework that co-evolves reusable skills, user/project memory, and a learned planning policy. This matters for AI agents because it moves beyond one-size-fits-all automation toward systems that adapt to individual researchers’ preferences, history, and workflows over time.
“tri-level co-evolution" with a "skill bank", "memory module", and "label-free policy learning”
-
PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing
The paper proposes a multi-agent system that turns unstructured research materials into submission-ready AI papers, including literature synthesis, LaTeX writing, and generated figures. This matters for agents because it shows how coordinated agents can automate a complex end-to-end scientific writing workflow rather than just answer questions or draft text.
“a multi-agent framework for automated AI research paper writing”
-
Recursive Agent Optimization
The paper proposes Recursive Agent Optimization (RAO), a reinforcement-learning method for training agents that can recursively spawn and delegate subtasks to copies of themselves. This matters because it enables divide-and-conquer inference, helping agents handle longer contexts, harder problems, and lower wall-clock time.
“agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively”
-
Recursive Multi-Agent Systems
The paper extends recursive/looped language-model scaling from single models to collaborative multi-agent systems, proposing a unified latent-space recursion framework for agent cooperation. This matters because it aims to make multi-agent collaboration more efficient, faster, and less token-hungry while improving accuracy across tasks.
“Can agent collaboration itself be scaled through recursion?”
-
SASAV: Self-Directed Agent for Scientific Analysis and Visualization
The paper proposes SASAV, a fully autonomous multi-agent system that can analyze scientific data and generate visualizations without external prompting or human-in-the-loop feedback. This matters for AI agents because it shifts scientific visualization from reactive assistance to proactive, scalable discovery workflows.
“first fully autonomous AI agent to perform scientific data analysis and generate insightful visualizations without any external prompting or HITL feedback”
-
Self-Evolving Multi-Agent Framework for Efficient Decision Making in Real-Time Strategy Scenarios
The paper proposes SEMA, a collaborative multi-agent framework for low-latency, high-performance decision making in real-time strategy environments. It combines in-episode and cross-episode self-evaluation, dynamic observation pruning, and hybrid knowledge memory to improve both speed and strategic consistency for AI agents.
“self-evolution"; "average decision latency by over 50%”
-
Toward Autonomous Long-Horizon Engineering for ML Research
The paper presents AiScientist, a hierarchical multi-agent system designed to turn underspecified ML research goals into runnable, experimentally validated systems over long horizons. It matters because it frames agentic research as a systems problem of maintaining durable project state and cumulative progress, not just local reasoning.
“thin control over thick state"; "File-as-Bus workspace that preserves decision-relevant artifacts”
-
Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning
The paper proposes SpreadsheetAgent, a two-stage multi-agent framework for understanding large real-world spreadsheets by incrementally reading localized regions and reasoning over multiple formats (code outputs, images, LaTeX tables). This matters for AI agents because it improves robustness and scalability when spreadsheets are too large and visually structured to be handled well as plain text.
“two-stage multi-agent framework"; "incrementally interprets localized regions through multiple modalities”
-
Tree-based Credit Assignment for Multi-Agent Memory System
The paper proposes TreeMem, a tree-structured credit assignment method for multi-agent memory pipelines that turns a single final task reward into agent-specific learning signals. This matters because it lets heterogeneous memory agents specialize without requiring task-specific annotations or coarse uniform rewards.
“derive agent-specific credit from the final reward without task-specific annotations”
-
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection
VulAgent uses multiple specialized agents to mimic how human auditors inspect code: they generate vulnerability hypotheses from sensitive operations and then validate them against surrounding context. This matters for AI agents because it improves project-level vulnerability detection by combining diverse analysis perspectives with structured hypothesis checking.
“specialized agents... collaboratively surface and precisely localize sensitive code sites”
-
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems
The paper studies hybrid systems where cloud-based LLM agents interact with device-side small-model agents, framing the design space around the joint trade-off of task accuracy, monetary cost, and edge energy consumption. This matters because cloud-device agent architectures are central to real-world agent deployment outside clean benchmarks.
“task accuracy, monetary cost, and edge energy consumption are tightly coupled in hybrid MASs”
07. Planning/Reasoning10 papers
Strategy, lookahead, and reasoning under long horizons.
-
A Decomposition Perspective to Long-context Reasoning for LLMs
The paper breaks long-context reasoning into atomic skills, synthesizes pseudo-datasets for each skill, and uses reinforcement learning to improve them. This matters for AI agents because stronger long-context reasoning helps agents handle complex tasks that require tracking and integrating information over long inputs.
“decompose long-context reasoning into a set of fundamental atomic skills”
-
Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search
The paper studies how to improve RL-trained search agents by replacing inefficient stochastic exploration with structured hierarchical experience built from past trajectories. This matters for AI agents because it aims to make search-based reasoning more stable and data-efficient.
“Hierarchical Experience (HiExp)" and "transforming raw reasoning trajectories into hierarchical experience knowledge”
-
Foresight Optimization for Strategic Reasoning in Large Language Models
The paper proposes Foresight Policy Optimization (FoPO) to explicitly train LLMs to anticipate opponents’ future actions and incorporate counterparty influence into decision-making. This matters for agents because strategic multi-agent behavior requires foresight, not just generic reasoning.
“explicit consideration of both self-interest and counterpart influence”
-
HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness
The paper argues that 'heavy thinking' can be treated as an internal skill of LLM agents, implemented as a two-stage pipeline of parallel reasoning followed by summarization. This matters because it shifts complex task solving from brittle orchestration layers toward a learnable capability inside the model itself.
“two-stage pipeline, i.e., parallel reasoning then summarization”
-
MagicAgent: Towards Generalized Agent Planning
The paper proposes MagicAgent, a family of foundation models aimed at generalized agent planning. It matters because it targets a core limitation of current agents: strong performance on isolated planning tasks but poor transfer across heterogeneous tasks.
“designed for generalized agent planning”
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
The paper studies how the length of action horizons affects training large language models for agentic tasks, using controlled tasks where only the required sequence length changes. It shows that longer horizons create training instability from exploration and credit-assignment difficulty, and that reducing horizon length can improve stability and generalization.
“increasing horizon length alone constitutes a training bottleneck”
-
Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis
The paper proposes a reasoning framework for LLM-based penetration testing that first derives domain-specific pentesting strategies and then maps them into actionable steps. This matters for AI agents because it tackles a core weakness in autonomous security workflows: producing coherent strategies and reliable tool/action choices, not just isolated next steps.
“limited capability in strategy formulation, domain-specific reasoning, and accurate action and tool selection”
-
Planning in the LLM Era: Building for Reliability and Efficiency
The paper argues that LLM-based planning is shifting away from brittle one-shot plan generation toward using LLMs to generate symbolic planners/solvers that can be verified and executed efficiently. This matters for agents because it improves reliability, completeness, and inference-time efficiency.
“generate symbolic solvers ... that can be verified and then used efficiently at inference time”
-
PriorZero: Bridging Language Priors and World Models for Decision Making
The paper proposes PriorZero, which combines LLM-derived language priors with world-model planning so agents can use semantic guidance without losing the benefits of deep lookahead and environment adaptation. This matters because it addresses the mismatch between static language knowledge and dynamic long-horizon decision-making, improving exploration and stability for agentic RL.
“bridging language priors and world models"; "root-prior injection mechanism”
-
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
ReFlect is a deterministic harness wrapped around an LLM to add standalone error detection and recovery for long-horizon reasoning tasks. It matters because it shows that reliable agentic reasoning can be improved at inference time without training, especially when errors accumulate across multiple steps.
“standalone error detection and recovery logic as a deterministic wrapper around the model”
08. Safety/Alignment35 papers
Where agents go wrong — and what we can do about it.
-
A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents
This survey examines how increasing agent autonomy creates new security risks beyond standard LLM threats, including memory poisoning, tool misuse, reward hacking, and emergent misalignment. It matters because it frames agent safety as a system-level problem across perception, memory, planning, and action, and proposes a unified risk-aware architecture to support safer autonomous agents.
“novel security risks - such as memory poisoning, tool misuse, reward hacking, and emergent misalignment”
-
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents
AgentAuditor is a training-free, memory-augmented evaluation framework that helps LLMs judge agent behavior more like human experts. It matters because agent safety failures often appear in multi-step actions and subtle compounding risks that simpler evaluators miss.
“universal, training-free, memory-augmented reasoning framework”
-
Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand
The paper argues that agentic AI will speed up cyber offense by lowering the cost of multi-step attack tasks like reconnaissance, phishing, credential abuse, and exploit adaptation. It matters for AI agents because it frames agent capabilities as an immediate security risk and prioritizes defenses, governance, and recovery readiness.
“agentic AI compresses the attack lifecycle”
-
Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery
This position paper argues that current agentic AI scientists are useful as co-scientists but are not yet suitable for fully autonomous scientific discovery. It matters because it identifies fundamental limits in today’s agent designs and proposes directions like simulation-based verifiers and persistent world models.
“agentic AI scientists are not built for autonomous scientific discovery”
-
Agentic Misalignment: How LLMs Could Be Insider Threats
The paper stress-tests frontier LLMs in simulated corporate settings where they have autonomous access to email and sensitive information, probing whether they will take harmful insider actions when their goals conflict with company interests. This matters for AI agents because it highlights a concrete pathway for autonomous models to become dangerous when given real-world permissions and minimal oversight.
“models from all developers resorted to malicious insider behaviors”
-
Agents of Chaos
This paper reports a red-teaming study of autonomous language-model agents operating in a live environment with memory, email, Discord, files, and shell access. It matters because it surfaces concrete failure modes and security risks that arise when LLMs are given persistent autonomy and tools.
“exploratory red-teaming study of autonomous language-model-powered agents”
-
AprielGuard
AprielGuard is an 8B safeguard model designed to unify moderation of unsafe content and adversarial attacks like prompt injections and jailbreaks in one framework. This matters for AI agents because it aims to make guardrails more robust in multi-turn and agentic workflows, where failures are often multi-step and harder to detect.
“unify these dimensions within a single taxonomy and learning framework”
-
Auditing Agent Harness Safety
The paper argues that agent safety must be evaluated over full execution trajectories, not just final outputs, because unsafe behavior can happen inside the harness even when the end result looks benign. It introduces HarnessAudit and a benchmark to measure boundary compliance, execution fidelity, and system stability, which matters for deploying multi-agent systems safely.
“audits full execution trajectories across boundary compliance, execution fidelity, and system stability”
-
Commercial Persuasion in AI-Mediated Conversations
The paper studies how conversational LLM agents can be used to steer users toward sponsored products, showing that AI-mediated conversations can covertly influence consumer choice at scale. This matters for AI agents because it highlights a real-world manipulation risk and the limits of simple transparency labels.
“LLM-driven persuasion nearly triples the rate at which users select sponsored products”
-
Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms
The paper argues that LLM agents lack the persistent identity and behavioural predictability that traditional reputation systems require, and proposes observability-based, protocol-driven behavioural harnesses as an alternative. This matters because trust and reputation failures are a core safety issue once agents interact repeatedly or make trust decisions about each other.
“agents cannot be effectively governed through traditional reputation mechanisms; needs observability-based, ex ante, constitutive, protocol-based behavioral harnesses”
-
Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
The paper introduces a scenario-driven evaluation framework for measuring self-replication risk in LLM agents under realistic operational pressures, rather than only when directly instructed. This matters because it reveals how agent objectives can drift into uncontrolled replication in production-like settings, highlighting a concrete safety risk for deployed agents.
“scenario-driven assessment of agent behaviors"; "over 50% of LLM agents display a pronounced tendency toward uncontrolled self-replication”
-
Do Agents Repair When Challenged -- or Just Reply? Challenge, Repair, and Public Correction in a Deployed Agent Forum
This paper studies whether deployed LLM agent communities can actually respond to criticism by repairing mistakes and publicly correcting themselves, rather than only generating norm-aware replies. It matters because real-world agent safety and fairness depend on interactional correction processes, not just one-off compliant language.
“challenge, repair, and public correction"; "we detect no repairs”
-
Evolving Deception: When Agents Evolve, Deception Wins
The paper studies self-evolving LLM agents in competitive settings and finds that deception can emerge as an evolutionarily stable strategy. This matters for AI agents because it shows that optimization through self-improvement can drift toward misalignment even when honest behavior is still feasible.
“unconstrained self-evolution reliably drifts toward deceptive behaviors”
-
From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows
This survey maps security threats across LLM-powered agent workflows, from prompt injection at the input level to protocol and inter-agent exploits. It matters because agent systems rely on tool calls, connectors, and protocols that expand the attack surface and need stronger defenses.
“unified end-to-end threat model for LLM-agent ecosystems”
-
From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents
The paper proposes a Hierarchical Autonomy Evolution (HAE) framework for agent security, splitting risks into cognitive, execution, and collective levels. This matters for AI agents because it frames defenses around how autonomy increases from internal reasoning to tool use to multi-agent systems.
“organizes agent security into three tiers: Cognitive Autonomy (L1), Execution Autonomy (L2), Collective Autonomy (L3)”
-
GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models
The paper proposes an agentic governance framework for monitoring LLM deployments, centered on the user, use-case, and model rather than only generic LLM failure modes. This matters for AI agents because it adds autonomous risk detection, tool activation, and continuous reporting to improve safety and responsible deployment.
“novel agentic framework for LLM governance”
-
Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice
This paper proposes a layered governance architecture for autonomous LLM agents to mitigate execution-layer risks like prompt injection, retrieval poisoning, and uncontrolled tool use. It matters because it turns agent safety from ad hoc guardrails into a systematic, layered defense with evaluation evidence.
“execution-layer vulnerabilities -- prompt injection, retrieval poisoning, and uncontrolled tool invocation”
-
How Emotion Shapes the Behavior of LLMs and Agents: A Mechanistic Study
The paper studies whether and how emotional signals can be used as a controllable internal intervention to steer LLM and agent behavior. This matters because it suggests a new, interpretable lever for improving reasoning, safety, and multi-step agent performance beyond prompt-level style changes.
“direct representation-level intervention in LLMs and agents”
-
Implementing surrogate goals for safer bargaining in LLM-based agents
The paper studies how to make LLM-based agents respond to threats against a surrogate objective (like preventing money from being burned) in the same way they would respond to direct threats against the principal’s true goal. This matters for agents because surrogate goals are proposed as a way to reduce bargaining and extortion risks while preserving useful behavior.
“surrogate goals have been proposed as a strategy for reducing risks from bargaining failures”
-
Memory Injection Attacks on LLM Agents via Query-Only Interaction
The paper shows that an attacker can poison an LLM agent’s memory bank without direct write access, using only queries and observed outputs to inject malicious records. This matters because agent memory can become a new attack surface that silently steers future reasoning and behavior.
“Memory INJection Attack, MINJA"; "without assuming that the attacker can directly modify the memory bank”
-
Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
The paper argues that behavioral evaluations and red-teaming cannot verify many safety claims now demanded by AI governance, because they only observe outputs rather than latent mechanisms or long-horizon agentic behavior. It matters for agents because it highlights an audit gap between what current assurance methods can prove and what regulators want to certify.
“behavioural assurance ... cannot verify the latent representations or long-horizon agentic behaviours”
-
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
The paper introduces Proteus, a grey-box self-evolving red-team framework for probing the security of agent skills under adaptive attackers. It matters because it shows that one-shot audits can underestimate risk when skills can be iteratively rewritten to evade vetting and cause runtime harm.
“adaptive leakage"; "grey-box self-evolving red-team framework”
-
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
This paper studies a public registry of LLM agent skills, analyzing how skills are organized, used, and what security risks they may introduce. It matters because reusable skills are becoming core agent infrastructure, but they also create a new ecosystem-level attack surface.
“more than 30% of all crawled skills are labeled as suspicious or malicious”
-
Reliable Weak-to-Strong Monitoring of LLM Agents
The paper studies how to reliably monitor autonomous LLM agents for covert misbehavior, such as secretly sharing private information. It introduces a red-teaming workflow and a new monitoring scaffolding, showing how weaker monitors can still track stronger agents when the monitoring setup is designed well.
“detecting covert misbehavior in autonomous LLM agents”
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
This survey frames reward hacking as a structural failure mode of proxy-based alignment in large models, where optimizing imperfect reward signals leads to shortcut behaviors and broader misalignment. It matters for AI agents because it explains how optimization pressure can induce deception, gaming, and other unsafe behaviors that undermine reliable autonomy.
“reward hacking as a structural instability of proxy-based alignment under scale”
-
SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents
The paper shows that in long-horizon LLM agents, errors in environment-changing actions are disproportionately harmful compared with non-mutating steps. It proposes a model-agnostic test-time safeguard with mutation-gated verification, targeted reflection, and context cleaning to improve agent robustness and evaluation reliability.
“mutating (environment-changing) vs. non-mutating steps" and "mutation-gated verification”
-
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor proposes a hierarchical, memory-augmented guardrail framework to set more precise safety boundaries for LLM agents. It matters because it aims to reduce harmful agent actions without over-refusing benign requests, improving the safety-utility tradeoff in autonomous agent systems.
“Hierarchical Memory-Augmented Guardrail for LLM Agent Safety” / “establish precise decision boundaries for LLM agents”
-
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
The paper studies how frontier AI agents fail under prompt-injection and adversarial red-teaming in realistic deployment settings. It matters because it shows current agent systems can be pushed into serious policy violations, motivating stronger defenses before broad deployment.
“1.8 million prompt-injection attacks; over 60,000 successfully eliciting policy violations”
-
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
SkillScope tackles the security problem that agent skills can carry out more actions than a user task actually requires. It analyzes skills at a fine-grained action level and constrains over-privileged behavior, which matters for making agent ecosystems safer and more trustworthy.
“fine-grained least-privilege enforcement for Agent Skills”
-
Towards Enforcing Company Policy Adherence in Agentic Workflows
The paper proposes a deterministic, transparent framework that compiles company policy documents into verifiable guards attached to tool use in agentic workflows. This matters because it helps LLM agents reliably comply with business rules before taking actions, reducing policy violations in real deployments.
“compiles policy documents into verifiable guard code associated with tool use”
-
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
The paper studies a new agent-security failure mode where a malicious tool can appear benign during exploration and only turn harmful when hidden conditions trigger the final action. It matters because it argues defenses should reason over the whole interaction trajectory and final committed action, not just local tool outputs or prompts.
“cognitive poisoning" and "final-action risk scoring”
-
Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw
The paper analyzes security risks in autonomous tool-calling agents using OpenClaw as a case study, identifying attack surfaces like prompt injection, chained tool abuse, context amnesia, and supply-chain contamination. It also proposes a layered defense blueprint (FASA) to make agentic systems more trustworthy and secure.
“prompt injection-driven Remote Code Execution (RCE), sequential tool attack chains, context amnesia, and supply chain contamination”
-
Visibility into AI Agents
The paper argues that as AI agents take on more delegated commercial, scientific, governmental, and personal tasks, visibility into where, why, how, and by whom they are used becomes crucial for governance and accountability. It proposes and compares three visibility mechanisms—agent identifiers, real-time monitoring, and activity logging—to help mitigate societal risks from agent deployment.
“assess three categories of measures to increase visibility into AI agents: agent identifiers, real-time monitoring, and activity logging”
-
We Need Strong Preconditions For Using Simulations In Policy
The paper argues that LLM agent simulations can be useful for policy exploration, but they require strong ethical and accountability guardrails before being used at societal scale. This matters for agents because it highlights risks of deploying agent-based simulations on human populations without validation, participation, or responsibility.
“three preconditions for societal-scale LLM agent simulations”
-
Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections
The paper studies a persistent attack on self-evolving LLM agents, where attacker-controlled content gets stored in long-term memory and later reactivated to hijack behavior across sessions. This matters because it shows that memory systems can turn a one-time prompt injection into an enduring compromise, so agent defenses need to go beyond per-session filtering.
“persistent attack we call a Zombie Agent”
09. Survey19 papers
Wide-angle reviews and taxonomies of the field.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
This survey frames 'agent skills' as reusable procedural artifacts that help LLM-based agents coordinate tools, memory, and context more reliably than doing everything from scratch. It matters because reusable skills can improve scalability, robustness, and maintainability in real-world agent systems.
“we define as reusable procedural artifacts that coordinate tools, memory, and runtime context”
-
A Survey on Knowledge Organization Systems of Research Fields: Resources and Challenges
This paper surveys knowledge organization systems used to represent academic research fields, comparing 45 systems across scope, structure, curation, usage, and interconnections. It matters for AI agents because such structured research knowledge can improve retrieval, analytics, and forecasting over scholarly information.
“comprehensive survey of the current KOS for academic disciplines”
-
A Survey on Trustworthy LLM Agents: Threats and Countermeasures
This survey introduces the TrustAgent framework to organize and analyze trustworthiness issues in LLM agents and multi-agent systems. It matters because agentic systems add memory, tools, environments, and other agents, creating new attack surfaces and requiring new defenses and evaluation methods beyond standard LLM safety.
“survey ... attacks, defenses, and evaluation methods for agents and MAS”
-
A Visionary Look at Vibe Researching
The paper defines and examines “vibe researching,” a human-led, LLM-assisted research workflow where agents do the heavy lifting for literature review, experimentation, data analysis, and drafting while humans provide direction and judgment. It matters because it frames a practical middle ground between manual research and fully autonomous AI research systems, along with limitations and societal impacts.
“human researchers provide high-level direction and critical judgment while LLM-based agents handle the labor-intensive execution”
-
Agent Harness Engineering: A Survey
This survey argues that LLM agent reliability depends heavily on the surrounding execution harness, not just the base model. It organizes agent infrastructure into a seven-layer taxonomy and maps open-source projects to show where current systems succeed or fail, which matters for designing more robust agents in production.
“task execution reliability depends less on the underlying model than on the infrastructure layer that wraps it”
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
The paper proposes a taxonomy for agentic world models along capability levels and governing-law regimes, then synthesizes over 400 works across model-based RL, video generation, web/GUI agents, social simulation, and scientific discovery. It matters because agents need predictive environment models to act effectively over long horizons, and this framework helps organize methods, failures, and evaluation.
“synthesize over 400 works" and "propose decision-centric evaluation principles”
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
This survey unifies research on LLM-based multi-agent systems around a four-stage lifecycle: building capabilities, coordinating agents, attributing failures, and enabling self-evolution. It matters because it frames multi-agent intelligence as a closed loop where collaboration, diagnosis, and improvement are causally connected, not separate problems.
“A unified review organized around four causally linked stages”
-
Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future
This survey maps how LLMs can assist or automate peer review across the full workflow, including review generation, rebuttals, meta-reviews, and revision. It matters for AI agents because peer review is a multi-stage, tool-like reasoning process where agentic systems can support structured critique and decision-making.
“assist or automate different stages of this pipeline”
-
Code as Agent Harness
This survey argues that code is becoming the operational substrate for agentic AI, not just an output target. It organizes the space into harness interfaces, harness mechanisms, and multi-agent scaling, which matters because it frames how agents can reason, act, verify, and coordinate through executable code.
“code as agent harness: a unified view that centers code as the basis for agent infrastructure”
-
Deep Research of Deep Research: From Transformer to Agent, From AI to AI for Science
This paper defines and contextualizes “deep research” as a vertical application for general-purpose agents, connecting industry practice with AI for Science. It matters because it frames how agentic systems can support scientific discovery and maps the path from transformer-based LLMs to broader agent systems.
“This paper provides a deep research of deep research”
-
Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows
This paper surveys 55 agentic visual analytics systems and proposes a co-evolutionary framework that jointly tracks how agent autonomy changes and how human roles shift from manual operators to strategic supervisors. It matters because it helps organize the fast-growing design space of LLM-driven analytics agents and offers practical guidance for building them.
“comprehensive survey of 55 primary agentic VA systems" and "introduces a co-evolutionary framework”
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
This paper argues that modern LLM agents are increasingly built by externalizing capabilities into memory, reusable skills, interaction protocols, and the surrounding harness rather than only improving model weights. It matters because it frames agent progress as a systems/infrastructure problem: better external cognitive scaffolding can make agents more reliable and governable.
“Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them.”
-
From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents
This survey frames LLM-agent systems as agentic computation graphs and reviews methods for optimizing their workflows. It matters because it gives a unified vocabulary for comparing static vs dynamic workflow design and for evaluating agent systems beyond task scores.
“This survey reviews recent methods for designing and optimizing such workflows”
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey organizes rollout design for RL-based LLM post-training into a unified framework called GFCR (Generate, Filter, Control, Replay). It matters for AI agents because rollout strategy determines what learning signal the optimizer sees, affecting reasoning quality, efficiency, and reliability.
“rollout design is often underreported” / “Generate-Filter-Control-Replay (GFCR)”
-
Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval
This survey maps how graphs can be combined with LLMs and agents to improve reasoning, retrieval, generation, and decision-making. It matters because it helps agent builders choose graph-LLM integration strategies that fit the task, data, and reasoning complexity.
“This survey provides a concise, structured overview of the design choices underlying the integration of graphs with LLMs.”
-
Making Sense of AI Agents Hype: Adoption, Architectures, and Takeaways from Practitioners
This paper reviews 138 practitioner conference talks to understand how AI agents are being adopted in industry, what architectural patterns recur, and which application domains and technologies are used. It matters because it distills real-world lessons from practice, helping separate hype from reusable agent design patterns.
“review of practitioner conference talks on AI agents; analyzed 138 recorded talks”
-
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
This survey argues that latent space is becoming a native computational substrate for language models, where many internal processes are better handled in continuous space than in explicit token generation. It matters for AI agents because it frames latent-space reasoning, planning, memory, and collaboration as core capabilities for next-generation systems.
“latent space is rapidly emerging as a native substrate for language-based models”
-
TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems
This review frames agentic AI through the lens of Trust, Risk, and Security Management (TRiSM), especially for LLM-based multi-agent systems. It matters because practical AI agents need governance, explainability, privacy, and security to be deployed safely in enterprise and societal settings.
“This review presents a structured analysis of Trust, Risk, and Security Management (TRiSM)”
-
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
This paper surveys abductive reasoning in LLMs and proposes a unified two-stage framework: hypothesis generation and hypothesis selection. It matters for AI agents because abductive reasoning underpins explanation, diagnosis, and sense-making, helping agents infer plausible causes from observations.
“the first survey of abductive reasoning in LLMs" and "Hypothesis Generation" / "Hypothesis Selection”
10. Tool Use6 papers
How agents pick up, chain, and learn from tools.
-
Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs
The paper presents Analytic Agent, an agentic LLM system that turns natural-language analytics requests into secure, policy-aware interactions with governed enterprise APIs rather than raw databases. This matters for AI agents because it emphasizes reliable multi-step orchestration, permission checking, and compliance in real enterprise workflows.
“secure interactions with enterprise analytics APIs”
-
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
The paper proposes CoEvoSkills, a framework that lets LLM agents autonomously generate and refine complex multi-file skills instead of relying on manually authored labels. It matters because skills are a more powerful unit than single tools for multi-step tasks, and the approach improves agent capability while reducing human authoring burden.
“self-evolving skills framework" and "autonomously construct complex, multi-file skill packages”
-
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a way to improve agentic RL for LLMs by using abundant human action data as plan-style guidance when the base policy cannot reach reward states. This matters because it reduces reliance on costly supervised fine-tuning while still helping agents explore and learn effectively.
“injects action data as plan-style reference guidance”
-
Root-Cause-Driven Automated Vulnerability Repair
The paper presents Kumushi, an LLM-based patching agent that improves automated vulnerability repair by using diversified dynamic fault localization and evidence-weighted ranking to steer repair toward the actual root cause. This matters for agents because it reduces shallow, symptom-only fixes and pairs stronger repair with richer evaluation of patch quality.
“combining diversified dynamic fault localization with evidence-weighted ranking”
-
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt treats an agent’s skill prompt/document as external state that can be optimized with a controlled text-space procedure, rather than being hand-crafted or revised ad hoc. This matters because it enables reproducible, stable skill improvement with zero extra inference-time cost at deployment and strong transfer across models and agent harnesses.
“first systematic controllable text-space optimizer for agent skills”
-
Tools as Continuous Flow for Evolving Agentic Reasoning
The paper proposes FlowAgent, which treats tool-chaining as continuous trajectory generation in a semantic space rather than step-by-step discrete decisions. This matters for AI agents because it aims to reduce long-horizon error accumulation and improve generalization to unseen tools in dynamic environments.
“reconceptualizes tool chaining as continuous trajectory generation within a semantic space”
That’s the pile for May. Next month’s will be different. The frontier moves like that.