



Network Management                                                 Q. Wu
Internet-Draft                                                    Huawei
Intended status: Informational                                   C. Zhou
Expires: 22 November 2026                                   China Mobile
                                                         L. M. Contreras
                                                              Telefonica
                                                                  S. Han
                                                            China Unicom
                                                                 Y. Hong
                                                      Daejeon University
                                                             21 May 2026


  Network Digital Twin and Agentic AI based Architecture for AI driven
                           Network Operations
                  draft-wmz-nmrg-agent-ndt-arch-04

Abstract

   A Network Digital Twin (NDT) provides a network emulation tool usable
   for different purposes such as scenario planning, impact analysis,
   and change management.  Agentic AI enables dynamic goal-driven
   execution and adaptive behavior and closed-loop autonomy.  By
   integrating a Network Digital Twin into network management together
   with the Agentic AI, it allows the network management activities to
   take user intent or service requirements as input, automatically
   assess, model, and refine optimization strategies under realistic
   conditions but in a risk-free environment.  Such environment that
   operates to meet these types of requirements is said to have AI
   driven Network Operations.

   AI driven Network Operations brings together existing technologies
   such as Agentic AI and Network Digital Twin which may be seen as the
   use of a toolbox of existing components enhanced with a few new
   elements.

   This document describes an architecture for AI driven network
   operations and shows how these components work together with network
   digital twin and Agentic AI capabilities.  It provides a cookbook of
   existing technologies to satisfy the architecture and realize intent-
   based network management to meet the needs of the network service.

Discussion Venues

   This note is to be removed before publishing as an RFC.

   Discussion of this document takes place on the Network Management
   mailing list (nmrg@irtf.org), which is archived at
   https://mailarchive.ietf.org/arch/browse/nmrg.

   Source for this draft and an issue tracker can be found at
   https://github.com/QiufangMa/Agent-architecture.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 22 November 2026.

Copyright Notice

   Copyright (c) 2026 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.

Table of Contents

   1.  Introduction
   2.  Conventions and Definitions
   3.  Introduction of Concepts
     3.1.  Generative AI and Agentic AI
     3.2.  Network Digital Twin
   4.  Characteristics of AI driven Network Operations
   5.  Architecture Design
     5.1.  Overall Architecture
     5.2.  Functional Components
       5.2.1.  Network Applications
       5.2.2.  Autonomous Domain
     5.3.  Functional Interfaces
       5.3.1.  Human Oversight (Human on the Loop)
       5.3.2.  Application to Network AI Agent Interface (Intent based
               Management)
       5.3.3.  Network AI Agent to Task AI Agent Interface (Single
               Autonomous Domain)
       5.3.4.  Network AI Agent to Network AI Agent Interface (Cross
               Autonomous Domain)
       5.3.5.  Hybrid Agent System to Agent Fabric Interface (Agent
               Lifecycle Management)
       5.3.6.  Hybrid Agent System to Network Digital Twin Interface
       5.3.7.  Hybrid Agent System to Knowledge Base Interface
       5.3.8.  Hybrid Agent System to Physical Network Interface
       5.3.9.  Feedback-driven Improvement Interface
   6.  AI Driven Network Operations: Relationship Between
           Characteristics and Functional Components
   7.  AI Agent Registration and Team formation
   8.  Agent to Agent Communication Security
   9.  AI Driven Network Operations: A collection of Use Cases
     9.1.  Multi-Agent Collaboration on Network Configuration Change
     9.2.  Multi-Agent Collaboration on Network Troubleshooting
     9.3.  Multi-Agent Collaboration on Network Optimization
     9.4.  Network level Energy Efficiency Management in the
           IP+Optical network
     9.5.  Network Security Drills (Human on the Loop)
   10. Challenges of Integrating Network Digital Twin and Agentic AI
           into Network Management
     10.1.  Make IETF developed YANG models AI ready
     10.2.  Trust and Security
     10.3.  Protocols between Agent and Agent/Human operator/Tools
       10.3.1.  High Risk Operations
       10.3.2.  The Timeliness Requirements of Collaboration
       10.3.3.  Collaboration Reliability
     10.4.  Agent Observability,Intervention and Control
     10.5.  Agent Benchmarking vs Model Benchmarking
   11. Security Considerations
   12. IANA Considerations
   13. Conclusion
   14. References
     14.1.  Normative References
     14.2.  Informative References
   Appendix A.  Network Element AI Agent and Network AI Agent
           Collaboration
   Appendix B.  Acknowledgements
   Appendix C.  Changes between Revisions
   Contributors
   Authors' Addresses

1.  Introduction

   With the rapid expansion of network scale and the increasing demands
   on these networks, significant challenges arise with keeping device
   configurations consistent with the needs of services and rapidly
   adapting the network at scale.  The intent based management has been
   first introduced in [RFC9315] to address these challenges and
   necessitate of continuous network reconfiguration to better adapt to
   ever-changing service requirements.

   Since network changes are directly related to service operations, any
   successful change needs to not only ensure that new services are
   provisioned smoothly, but also that existing services are not
   affected and that no problems are introduced with the new
   configurations.  On one hand, these changes might still require
   manual operation or iterative input from network Operators to provide
   operational guidance and information about the goals, purposes, and
   service instances that the network is to serve and refine the intent.
   On the other hand, pre-programmed automation scripts or policy
   templates may not be sufficient to realize the intent.  Network
   operators are, therefore, increasingly cautious about making network
   changes.  For example, they need to review the solution design as
   well as evaluate all change impacts, before making any change.  Then,
   after the change, they need to perform dialing tests, monitor
   traffic, and manually check table entries.

   The Network Digital Twin (NDT)
   [I-D.irtf-nmrg-network-digital-twin-arch] provides Network operators
   with a safe sandbox for testing complex configuration changes,
   software updates, or new protocols before they are deployed in
   production so that the risk of accidental outages or service
   disruptions can be greatly reduced.

   In addition, Agentic AI introduces disruptive paradigm to the network
   management which:

   *  Allows declarative intent interpretation and refinement with
      multi-step task management,

   *  Enables dynamic goal-driven execution and adaptive behavior
      instead of stepwise static instruction,

   *  And enables closed-loop autonomy across layer or across domain
      instead of deterministic closed-loop automation.

   Integrating a Network Digital Twin into network management together
   with Agentic AI, it allows network management activities to
   dynamically adapt to customer needs, network changes, as well as to
   automatically assess, model, and refine optimization strategies under
   realistic conditions but in a risk-free environment.  An environment
   that operates to meet these types of requirements is said to have AI
   driven network operations.

   AI Driven network operations provide the following capabilities to
   applications by coordinating the components that operate and manage
   the network:

   *  Service intent and service assurance work together to ensure that
      the network change or network optimization aligns with business
      goals and that the services provided meet the agreed-upon Service
      Level Agreements (SLAs).

   *  Provide network capacity planning and ensure that the network has
      sufficient capacity , resources, and infrastructure to meet
      current and future demands.

   *  Provide simulation on fault scenarios, formulate recovery plans,
      and verify whether the plans are applicable and effective so that
      the service will not be affected during disaster recovery drill.

   *  Support fault and risk detection and provide network health check
      and network risk check.

   *  Model the network configuration change and use a virtual topology
      model to test network changes and assess the effect of the network
      configuration changes on the network.

   *  Model the protocol operations and interactions among devices in
      the network and simulate specific networking protocols such as IS-
      IS, OSPF, BGP, SR, etc to understand how they perform under
      different conditions.

   *  Model traffic flow across the network, including traffic
      generation, flow control, routing, and congestion control and
      evaluate traffic's impact on network performance.

   *  Support generation of repair solutions for potential network risks
      and provide verification on the repair solution in seconds,
      including loop, address conflict, and security policy conflict.

   *  Model the performance characteristics of each flow according to
      traffic conditions change by using a virtual topology in order to
      understand which flows and network portions can be potentially
      affected by performance degradation and act to prevent it

   This document describes an architecture for AI Driven network
   operations, showing how these components work together with network
   digital and AI capabilities.  It provides a cookbook of existing
   technologies to satisfy the architecture and realize intent-based
   networking to meet the needs of applications.

2.  Conventions and Definitions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   The document uses the following definitions and acronyms defined in
   [I-D.irtf-nmrg-network-digital-twin-arch]:

   *  Network Digital Twin (NDT)

   *  Artificial Intelligence (AI)

   The following acronyms are used throughout this document:

   *  Generative Artificial Intelligence (Gen-AI)

   *  Large Language Model (LLM)

   *  Retrieval-Augmented Generation (RAG)

   *  Agentic AI [I-D.hong-nmrg-agenticai-ps]

   *  Remote Code Execution (RCE)

   Besides, this document defines the following terminology:

   Network AI Agent:  Network AI Agent is an autonomous system or entity
      with awareness of its environment, capable of conducting analysis,
      making decisions, and executing actions with specific intent based
      on its knowledge representation to achieve a set of service goals
      [TMF-1251D].  In addition, it is able of planning the tasks and
      decompose the tasks into several sub-tasks and coordinate with
      Task agent for these sub-tasks.

   Task AI Agent:  Task AI Agent is responsible for coordinating with
      Network AI Agent in the Hybrid Agent System and executing specific
      task assigned by Network AI Agent.

   Autonomy:  The ability to operate under uncertainty and adapt to
      changing objectives without explicit external intervention.

3.  Introduction of Concepts

3.1.  Generative AI and Agentic AI

   The integration of AI into network operations has marked a
   significant leap forward in the pursuit of network automation and
   intelligence, while generative AI further enhances the role of AI
   driven network operations and management.  Generative AI is a
   subfield of AI that uses generative models such as Large Language
   Models (LLMs) to generate new and original content such as text,
   images, videos, or other forms of data with the capability to adapt
   and make decisions to achieve specific goals.

   Agentic AI refers to the broader category of AI systems that exhibit
   "agency"—the ability to act independently and iteratively to perform
   tasks without constant human prompting [I-D.hong-nmrg-agenticai-ps].
   In the context of network operations and management, Network AI
   agents are increasingly being designed to interact with physical
   world and act upon it based on tools [Google-Agents-Whitepaper] and
   perform network management tasks such as understanding user intent,
   generating network configurations, diagnosing and resolving network
   incidents [I-D.ietf-nmop-network-incident-yang].  Meanwhile, other
   SDOs also try to define terms related to Network AI agent in the
   context of network operations and management, e.g., TM Forum defines
   Autonomous Agent in [TMF-1251D] as one of AN (Autonomous Network)
   Terminologies.

3.2.  Network Digital Twin

   The Network Digital Twin is a digital representation that is used in
   the context of network.  The concept and architecture of the Network
   Digital Twin are specified in
   [I-D.irtf-nmrg-network-digital-twin-arch].  Three core functional
   components which includes Data Repository component, a Service
   Mapping Models component, and an NDT Management component are
   introduced to characterize the Network Digital Twin and its reference
   architecture.

   The Network Digital Twin is widely recognized to be useful as an
   advanced platform for network emulation, serving as a tool for
   scenario planning, impact analysis, and change management.  By
   delivering applications requests to the Network Digital Twin through
   standardized interfaces (see Section 9.4 of
   [I-D.irtf-nmrg-network-digital-twin-arch]), the Network Digital Twin
   exposes the various capabilities to network applications.

4.  Characteristics of AI driven Network Operations

   AIOPS was first defined by Gartner in 2016, combining "artificial
   intelligence" and "IT operations" to describe the application of AI
   and machine learning to enhance IT operations.  However there is no
   unified definition for characteristic of "AI driven network
   operations" within the networking industry.  Referring to the
   characteristics of AIOPS in IT field and the characteristics of
   networking itself, this document introduces six key elements (i.e.,
   awareness, decision, analysis, execution, intent and knowledge) to
   characterize the AI driven network operation and its use, as shown in
   Figure 1.  They together form a close-loop of network operation and
   management.

           +---------------------------------------------------+
           |                                                   |
           |   +---------+                       +-----------+ |
           |   |  Intent |                       | Knowledge | |
           |   +---------+                       +-----------+ |
           |                                                   |
           |                     --------                      |
           |  +-----------+  ////        \\\\    +-----------+ |
           |  |  Analysis | |AI Driven Network|  | Decision  | |
           |  +-----------+ |  Operations    |   +-----------+ |
           |                 \\\\        ////                  |
           |                     --------                      |
           | +-----------+                      +------------+ |
           | |  Awareness|                      |  Execution | |
           | +-----------+                      +------------+ |
           |                                                   |
           +---------------------------------------------------+

        Figure 1: Six Key Elements to Characterize AI driven network
                                 operation

   *  Intent:  Intent is defined as a set of operational goals and
         outcomes defined in a declarative manner without specifying how
         to achieve or implement them in [RFC9315].  The Network AI
         Agent must accurately interpret and understand the user's high-
         level business or operational objectives, this involves
         translating declarative requirements into specific network
         instructions, e.g., configurations.

   *  Knowledge:  The Network AI agent relies on a knowledge base that
         includes network policies, historical data, expert experience,
         extra-system experience (updates to LLMs/their implied
         ‘knowledge bases’) and Manually or semi-manually entered
         knowledge, e.g., new equipment spec sheets,best practices in
         product manual.  The knowledge is used to inform its analysis,
         decision-making, and execution processes.  Over time, the
         Network AI agent can expand its knowledge through machine
         learning, incorporating new data and experiences to improve its
         performance.  For example, it learns which configurations are
         optimal for specific scenarios or how to respond most
         effectively to particular types of network incidents
         [I-D.ietf-nmop-network-incident-yang].

   *  Analysis:  The Network AI agent continuously analyzes vast amounts
         of network data from various sources, including network
         telemetry [RFC9232] and external feeds, and identify the gap
         between user intent and the existing network status.  By
         integrating Network digital twin
         [I-D.irtf-nmrg-network-digital-twin-arch] with Network AI agent
         and leveraging machine learning and other data analytics
         techniques, it also identifies network fault, problem,
         incident, anomaly and perform data driven intelligent analysis
         such as service impact analysis, and so on.  Their distinction
         is further discussed in [I-D.ietf-nmop-terminology].

   *  Decision:  Based on the intent and network analysis, AI makes
         informed decisions.  By integrating network digital twin
         [I-D.irtf-nmrg-network-digital-twin-arch] and AI, the
         intelligence decisions making can be realized.  These decisions
         could involve dynamically adjusting network parameters, e.g.,
         rerouting traffic to avoid congestion.  The decision-making
         process is driven by predefined policies, real-time data
         analysis, and AI models (e.g., LLMs) that enable the Network AI
         agent to choose the best course of action to meet the specified
         intent.  Network AI agent may also verify the correctness of
         the decision outcome by performing some network simulation or
         validation process.

   *  Awareness:  Awareness is achieved through real-time monitoring and
         data collection.  The Network AI agent maintains a
         comprehensive visibility of the network, enabling it to make
         context-aware decisions.  Network operators can also use the
         awareness understand the exact cause of specific network issues
         and achieve closed-loop decision-making.

   *  Execution:  Once a decision is made, the Network AI agent executes
         the necessary actions to implement it.  This could involve,
         e.g., sending configuration to network controllers or network
         devices through NETCONF/RESTCONF protocols.  The execution is
         carried out in a controlled and precise manner to ensure that
         the network behaves as intended without causing disruptions.
         The Network AI agent also verifies that the executed actions
         have the desired effect and makes the proper adjustments if
         needed.

5.  Architecture Design

5.1.  Overall Architecture

   Figure 2 provides the overall architecture for integrating Network
   Digital Twin and Network AI Agent System.  The components and
   functional interfaces are discussed in Section 5.2 and Section 5.3,
   respectively.  The use cases described in Section 9 show how
   different components are used selectively to provide different
   services.  It is important to understand that the relationships and
   interfaces shown between components in this figure are illustrative
   of some of the common or likely interactions; however, this figure
   does not preclude other interfaces and relationships as necessary to
   realize specific functionality.

+------------------------------------------------------------------------+
|Network Applications                                                    |
|            +-------+   +-------+          +-------+                    |
|            | App 1 |   | App 2 |   ...    | App n |                    |
|            +-------+   +-------+          +-------+                    |
+-------------------------^----------------------------------------------+
                          | Intent
+-------------------------+----------------------------------------------+
|Autonomous Domain        |                                              |
|+------------------------+-----------------------+   +----------------+ |
||Hybrid Agent System     |                       |   |  Agent Fabric  | |
||                +-------v--------+              |   |                | |
||                |Network AI Agent|              |   |+--------------+| |
||                +-------^--------+              |   || Registration || |
||                        |                       |   |+--------------+| |
||       +------------+---+------------+--------+ <--->+--------------+| |
||       |            |                |        | |   ||Security&Trust|| |
||+------v---+  +-----v----+  +--------v------+ v |   |+--------------+| |
|||Task Agent<->|Task Agent<->|Function Module|...|   |+--------------+| |
||+----------+  +----------+  +---------------+   |   ||Observability || |
|+-----------^-----------------------^------------+   |+--------------+| |
|            |                       |                |+--------------+| |
|+-----------v----------+            |                ||Knowledge Base|| |
|| Network Digital Twin |            |                |+--------------+| |
|+-----------^----------+            |                +----------------+ |
|            |                       |                                   |
|+-----------------------------------v---------------------------------+ |
||Physical Network                                                     | |
||      +---------+    +------------------+             +---------+    | |
||      |         |    |                  |             |         |    | |
||      |   NE    |    |NE(lightweight AI)|     ...     |   NE    |    | |
||      |         |    |                  |             |         |    | |
||      +---------+    +------------------+             +---------+    | |
|+----------------+----------------------------------------------------+ |
+------------------------------------------------------------------------+

   Figure 2: An Architecture for Integrating Network AI Agent with
                         Network Digital Twin

5.2.  Functional Components

   This section describes the functional components shown as boxes in
   Figure 2.  The interactions between those components, the functional
   interfaces, are described in Section 5.3.

5.2.1.  Network Applications

   Various network applications at the service level can effectively run
   over a AI driven Network operation platform to implement either
   conventional or innovative network operations, with low cost and less
   service impact on real networks.

   A network application may be a software tool that a user uses to make
   requests to the network to set up specific services such as end-to-
   end connections or scheduled bandwidth reservations or NOC
   Application /Service AI Agent Application that is responsible for
   monitoring, managing, and maintaining the health, performance, and
   availability of complex networks.

   Network applications make requests that need to be addressed by the
   AI driven network.  Such requests are exchanged through a northbound
   intent interface (e.g., Restful API, Natural Language Programming
   Interface(NLPI),A2A, A2A-T), so that they can be applied by Hybrid
   Agent System at the appropriate twin instance(s).

5.2.2.  Autonomous Domain

   An autonomous domain is a self-governing network that integrates NDT
   and AI driven capabilities to achieve autonomous network management.
   It comprises the following sub-components.

5.2.2.1.  Hybrid Agent System

   Hybrid agent system acts as the smart brain of the Autonomous Domain,
   which is responsible for conducting AI-based analysis and making
   decisions regarding network management operations.  It usually
   comprises a Network AI Agent and multiple task agents or function
   modules (e.g., Agent skills, network management functional APIs or
   tools).  It is worth mentioning that function modules could be the
   functions and services of existing network management systems, it
   offers a set of APIs and tools that can be consumed by agents.  In
   some simplified scenarios, the system may consist of only one single
   AI Agent equipped with multiple function modules, e.g., skills and
   toolsets that integrates both network-level coordination and task-
   specific execution.

   The Network AI Agent may coordinate cross-task-agent collaboration,
   aligns tasks with user intent, and supervises the execution of each
   task agent or function module.  And task agents and function modules
   are designed to perform specific functionalities, they could be
   scenario-oriented and classified according to the function they
   perform.  The hybrid agent system can adapt to new circumstances
   through access to evolving knowledge and reasoning, planning.  It
   leverages the inference of LLM, the simulation of Network Digital
   Twin, and the contextual and domain-specific knowledge provided by
   Knowledge Base to accomplish specific network operation task.

5.2.2.2.  Agent Fabric

   The Agent fabric, which serves as a central management hub, provides
   essential services for the Hybrid Agent System, including agent
   registration/discovery, authentication, observability, and knowledge
   base.

5.2.2.2.1.  Registration

   AI Agents need to first discover each other and understand their
   capabilities to collaborate.  Agent Registration manages the process
   by which new agents could join the system, making them discoverable
   and available.  It supports the unified registration of all AI agents
   within the autonomous domain, including those residing on network
   devices.  Each Agent instance submits its own metadata information
   including URI, supported authentication methods, and capabilities to
   the Agent Registry.  And the consumer Agent (e.g., the Network AI
   Agent or task agent) could query or subscribe to the Agent Registry
   to find appropriate Agents for task execution.

   Agent skills [Agent-skills], introduced by Anthropic, provides a new
   way for Agents to improve how they perform specific tasks through
   folds that include instructions, scripts, and resources that are only
   loaded when needed.  Skills are folds containing a "skill.md" file,
   Registration component may also work as a skill hub that enables
   Agents to query directories for required skills.

   [A2A] implements Agent Registration by providing the Agent Card
   mechanism to ensure Agents from different vendors can register and
   discover other Agents they need.

5.2.2.2.2.  Security & Trust

   The security component enforces trusted inter-Agent communication by
   verifying the identity of AI Agents and enforcing security policies
   throughout their interaction.  It provides unified security
   functionalities for all AI Agents within the autonomous domain,
   including those residing on network devices.  Some existing
   authentication methods such as OAuth 2.0, allow to issue each AI
   Agent its own authentication credentials to establish trusted
   communication.

   Standardized protocols like TLS (Transport Level Security) could be
   leveraged to protect sensitive data exchanged between AI Agents.

   It is also worth noting that once authenticated, authorization
   defines the specific tools and data an agent can access, which often
   using a Least Privilege access control method.  It is also
   recommended to log every Agent decision and tooling call to maintain
   audit trail.

5.2.2.3.  Observability

   Observability component provides unified monitoring capabilities for
   all AI agents within the autonomous domain and enables network
   operators to gain deep insights into Agent behaviors.

   OpenTelemetry(OTel) has emerged as a vendor-neutral approach for AI
   Agent observability.  It provides a consistent instrumentation layer,
   semantic conventions, and telemetry export mechanisms, ensuring
   end-to-end visibility across network AI agents, task AI agents, LLM
   inference, and tool/skill executions.

   By leveraging Semantic Conventions for GenAI [OTel-gen-ai], the
   system unifies the collection of LLM inference metadata, token
   consumption, reasoning processes, and skill invocation events, etc.
   This allows comprehensive tracking of agent runtime status, cross-
   component interaction flows, and abnormal behaviors in both multi-
   agent collaboration scenarios and simplified single-agent deployments
   with integrated multiple skills and tools.

5.2.2.3.1.  Knowledge Base

   The Knowledge Base serves as a crucial repository of information
   within the architecture.  It enables the injection of expert
   knowledge and chain of thoughts, provides the necessary knowledge and
   memory that helps Agents make more accurate and practive context-
   aware decisions.  It also helps mitigate the hallucination problems
   that can arise in large-scale models, which enhances the accuracy of
   task execution.  Additionally, the Knowledge Base plays a key role in
   providing the data needed for techniques like Retrieval-Augmented
   Generation (RAG), which further boosts the system's ability to
   generate reliable and relevant outputs.

   In case of coupling MCP [MCP] with the network management system, the
   new knowledge also can be used to support modification of the
   currently operating automation Closed Loop, such as: - Choice of
   tools (data, analytics, algorithms/decision processes, closed loops)
   - Orchestration of tools

5.2.2.4.  Network Digital Twin

   A Network Digital Twin provides an enhanced and optimized solution in
   the face of increasing network and business types, scale, and
   complexity.  It simulates the behavior, performance, and
   characteristics of the actual network, which could help in validation
   and testing scenarios, analyzing and predicting network behavior
   without affecting the real physical network.

   As described in Section 7 of
   [I-D.irtf-nmrg-network-digital-twin-arch], the core functional
   components of an Network Digital Twin includes Data Repository,
   Service Mapping Models, and a Network Digital Twin Management
   component.  The Network Digital Twin collects the real-time
   operational and instrumentation data from network through the
   appropriate real network-facing input interfaces, and it delivers NDT
   services through appropriate application-facing output interfaces,
   which is the interfaces to Network AI Agent(s) in Figure 2.

5.2.2.5.  Physical Network

   This is the actual hardware and infrastructure that makes up the
   network, which includes a set of network devices and wiring.  In a
   physical network, Network Elements (NEs) with Lightweight AI
   [I-D.irtf-nmrg-ai-challenges] or AI Agent may also achieve some local
   close loop without relying on human intervention.  It is also
   possible for Lightweight AI or AI Agent to coordinate with other AI
   Agent(s) to enhance the automation and efficiency of network
   operations.  The Network Lightweight AI models could be trained,
   validated, deployed, and executed on Network Elements, and further
   refined (e.g., model re-training) through monitoring and continuous
   optimization based on feedback from LLM.

5.3.  Functional Interfaces

   This section describes the interfaces between functional components
   that might be externalized in an implementation allowing the
   components to be distributed across platforms.  Where existing
   protocols might provide all or most of the necessary capabilities,
   they are noted.

   As noted in Section 5.1, it is important to understand that the
   relationships and interfaces shown between components in Figure 2 are
   illustrative of some of the common or likely interactions; however,
   this figure and the descriptions in the subsections below do not
   preclude other interfaces and relationships as necessary to realize
   specific functionality.  Thus, some of the interfaces described below
   might not be visible as specific relationships in Figure 2, but they
   can nevertheless exist.

5.3.1.  Human Oversight (Human on the Loop)

   The architecture allows human experts to monitor, guide, approve, or
   intervene in the AI driven network operations.  Human may provide
   guidance and make critical decisions when necessary.  By involving
   human in the process, the architecture can leverage their insights
   and experience, ensuring AI actions align with organizational goals.

   Human on the loop is also helpful to provide a safeguard for complex
   or sensitive decisions, where human judgement is essential to avoid
   potential errors or ethical dilemmas.

   This typically uses natural language as the primary mode of
   interaction, a chat platform that allows for conversational
   interaction with AI Agents can be leveraged.  In some scenarios,
   operators may use structured format for strategy injection via
   workflows.  Protocols like A2A [A2A], and RESTful API can be
   leveraged.

5.3.2.  Application to Network AI Agent Interface (Intent based
        Management)

   Intent based Network Management helps in delivering application
   requests to the AI Driven network operation platform and exposing the
   various platform capabilities to network applications.

   Standardized protocols and interfaces facilitate smooth communication
   between applications and AI driven network operation platform and
   ensures different systems from various vendors can work together
   seamlessly.  The interfaces between Network applications and Network
   AI Agent can adopt IG1453 Agent to Agent Protocol for Telecoms (A2A-
   T) [A2A-T] specified by TM Forum.

5.3.3.  Network AI Agent to Task AI Agent Interface (Single Autonomous
        Domain)

   This interface governs the coordination and task delegation within
   the Hybrid Agent System of a single Autonomous Domain.  The Network
   AI Agent, acting as the principal coordinator, uses this interface to
   decompose high-level goals into specific tasks and assign them to
   specialized Task Agents (e.g., for configuration generation or fault
   diagnosis).  It facilitates communication for task assignment,
   progress monitoring, and result aggregation.  This coordination can
   be implemented using protocols like [A2A-T].

5.3.4.  Network AI Agent to Network AI Agent Interface (Cross Autonomous
        Domain)

   This interface enables collaboration and information exchange between
   Network AI Agents residing in different Autonomous Domains.  It is
   essential for scenarios requiring end-to-end service assurance or
   coordinated optimization across multi-domain networks.  Through this
   interface, Network AI Agents can negotiate resource allocation, share
   summarized domain-specific insights (while preserving detail
   isolation for privacy and scalability), and coordinate actions to
   fulfill cross-domain objectives.  Standardized protocols like A2A-T
   [A2A-T], designed for agent interoperability in telecommunication
   area, are candidate technologies for implementing this cross-domain
   interface, ensuring secure and reliable interaction between
   autonomous systems from different administrative domains.

5.3.5.  Hybrid Agent System to Agent Fabric Interface (Agent Lifecycle
        Management)

   The interface between Hybrid Agent System and Agent Fabric serves as
   the management bridge which encompasses a set of services designed to
   manage the lifecycle of AI agents, and collaborative capabilities of
   AI Agents such as Registration, Security and Trust, Observability.

   *  Registration Registration handles Agent onboarding, lifecycle
      tracking (e.g., heartbeat monitoring, status updates), and
      capability-based Agent discovery.  Interfaces like RESTful APIs
      with structural schema for AI Agents metadata description could be
      leveraged.  Protocols like A2A [A2A] Agent card mechanism may also
      be used to ensure interoperability among different Agent vendors.
      It is also worth noting that message queue mechanisms such as
      Kafka could also be a candidate interface for asynchronous
      communications for agent registration and discovery.  This
      interface also supports skill management, including skill query,
      loading, unloading, and version control, enabling agents to
      dynamically obtain and use skills provided by the Registration
      component.

   *  Security and Trust Security and Trust ensures trusted inter-Agent
      communication by verifying the identity of AI Agents and enforcing
      security policies throughout their interaction.  Protocols like
      Transport Layer Security (TLS) could be leveraged for in-transit
      data Protection.  While OAuth 2.0 and OpenID Connect are
      increasingly used to authenticate AI Agents.

   *  Observability The observability component enables the network
      operators to gain deep insights of agent behaviors.  It collects
      audit logs, metrics and traces of each agent and provides
      visibility of task process.  In addition, to make the behavior of
      Hybrid Agent System declarative and traceable, it is essential to
      have the capability to distinguish whether an operation action is
      taken by agents automatically or by human operators.  For example,
      in the fault management scenarios that need on-site repairing, the
      demarcation and locating task could be accomplished by agents, and
      the repair solution implementation and verification task is
      accomplished by human operators, maybe with the help of
      intelligent copilot.  The traceable logs are essential for both
      clarifying the duty boundary between agents and human operators,
      and also for evaluating the effectiveness of the agentic system.

   The interface between AI Agent and Knowledge Base is specified in
   Section 5.3.7.

5.3.6.  Hybrid Agent System to Network Digital Twin Interface

   The interface between Hybrid Agent System and Network Digital Twin
   are the application-facing interface as defined in
   [I-D.irtf-nmrg-network-digital-twin-arch].  Furthermore, the Model
   Context Protocol (MCP) [MCP] can be leveraged to standardize this
   interaction, enabling the NDT to expose its simulation and analysis
   capabilities as a set of discoverable "tools" that the AI Agent can
   dynamically invoke.  This MCP-based approach facilitates seamless
   integration and richer contextual exchange between the Agent and the
   NDT.

5.3.7.  Hybrid Agent System to Knowledge Base Interface

   Knowledge Base service provides contextual data and insights to
   enhance the decision-making accuracy of the Hybrid Agent System.

   Interfaces such as Cypher or SPARQL with schema-defined data models
   (e.g., LPG or RDF for knowledge representation) allow efficient
   retrieval and updates.  Other high-throughput interfaces such as gRPC
   or RESTful API can be the candidate for synchronous semantic search
   queries.  For large-scale knowledge operations, asynchronous data
   message systems (e.g., Kafka) can also be employed for data ingestion
   and real-time knowledge synchronization across distributed Agents.

   Additionally, the Model Context Protocol (MCP) [MCP] could also serve
   as a standardized interface for AI Agents to dynamically access and
   utilize a wide range of tools and data sources provided by the
   Knowledge Base.  It enables the Knowledge Base to expose contextual
   information, expert rules, and external data as "tools" that Agents
   can invoke, significantly enhancing their reasoning and problem-
   solving capabilities.

5.3.8.  Hybrid Agent System to Physical Network Interface

   The interface between the hybrid agent system and the physical
   network is established via its internal function modules interacting
   with network elements (NEs).  AI agent can invoke function modules to
   perform specific network operations.

5.3.8.1.  Data Collection

   Data Collection interface is responsible for gathering data from the
   physical network through various different tools and methods (e.g.,
   IPFIX [RFC7011], YANG-push [RFC8639],[RFC8641], and BMP [RFC7854]).
   It collects various types of network data including configuration
   data, operational data, network topology, routing data, logs, and
   trace on management plane, control plane, and forwarding plane as
   needed.  The collected data is fed into the Network Digital Twin and
   hybrid agent system to provide with up-to-date information about the
   current state of the physical network.

5.3.8.2.  Configuration

   Once network decisions are made and confirmed, the Hybrid Agent
   System performs specific actions to the physical network, e.g.,
   modify specific configuration through protocols like NETCONF
   [RFC6241] and RESTCONF [RFC8040], which utilize YANG [RFC7950] as the
   data modeling language.  It is the component that makes the planned
   control and management changes a reality in the real physical
   network.

5.3.8.3.  Lightweight AI and Large AI Model Collaboration Interface

   Collaboration between small AI model and large AI model is also
   designed to be supported by this interface.

   In the past, we only support AI and machine learning technologies at
   the network level, e.g., we can use collected various different
   network data to provide network analysis and generate network
   insight.  With more intelligence introduced into the network element,
   more GPU/NPU resource can be allocated for AI inference, this make
   collaboration between large AI model and small AI model possible.

   Large AI models can provide basic logical reasoning and generalized
   analytical decision-making capabilities While specialized small AI
   models can provide efficient problem-solving capabilities in
   specialized areas.  The synergy between the two allows the AI agent
   to combine both multitasking generalization capabilities and domain
   expertise, thus minimizing the reliance on human intervention in the
   network management process.

   On one hand, we can use accumulated field engineering expertise to
   train large AI model into one foundation model for fault management
   AI agent, On the other hand, we can deploy small AI model, leverage
   hardware resource or chipset resource in the intelligent network
   element to collect more fine granularity data or provide local
   processing for Collected data and summary report generation, Trend
   prediction, etc.  When small AI model is outdated and unable to
   detect specific applications or security risk, these specific
   applications and security risk information can be collected by
   network analytics platform to retrain this small AI model and re-
   deploy it in the same network element when this small AI model has
   been trained to work correctly to detect applications or security
   risk.

   With collaboration between large AI model and small AI model, we can
   allow Network AI Agent within the Network controller interact with
   network element and has more quick response to network change.

   This collaboration, facilitated by APIs or agent communication
   protocols like A2A [A2A], combines the generalization power of large
   models with the efficiency and low-latency of specialized small
   models, leading to quicker and more context-aware responses to
   network change.

5.3.9.  Feedback-driven Improvement Interface

   The architecture should incorporate mechanism for continuous
   improvement based on feedback.  This includes collecting data on AI
   decisions, network performance, and user feedback to identify areas
   for enhancement.  By analyzing the feedback, the system can adapt and
   optimize its operations over time, leading to better performance and
   more accurate decision-making.  For example, if a Network AI Agent
   fails to accurately identify the exact cause of a network incident,
   the relevant records can be submitted as negative samples to the LLM
   which provides inference services, this allows the LLM to be trained
   on these negative samples for optimization.

   This interface is implemented through a combination of system
   interfaces that collect, process, and apply feedback.  Operational
   feedback—including the outcomes of AI decisions, network state
   metrics—is collected as structured data via system logging streams
   (e.g., in JSON format) and message queues (e.g., Kafka).  This data
   is then consumed by analytics components and machine learning
   platforms through APIs (e.g., RESTful, gRPC) to refine AI models, for
   instance, by using failure records as negative samples for fine-
   tuning.  Subsequently, optimized models and updated knowledge are
   deployed back into the runtime system via model serving and
   configuration management interfaces, closing the improvement loop.

6.  AI Driven Network Operations: Relationship Between Characteristics
    and Functional Components

   The architecture in Figure 2 provides a concrete implementation
   framework to realize the six key characteristics of AI-driven network
   operations described in Section 4.  Each characteristic is directly
   supported by specific functional components within the Autonomous
   Domain.  The following clarifies how the architecture operationalizes
   these characteristics:

   *  Intent:  The Network Applications Layer conveys a high-level user
         intent via northbound interfaces.  The Network AI Agent
         interprets this intent and translates it into actionable
         network operation tasks to each task Agent.

   *  Knowledge:  The Knowledge Base in Agent Fabric serves as the
         central repository for domain-specific knowledge, expert rules,
         and historical data.  It provides the necessary context and
         long/short memory to support accurate decision-making by task
         Agents.

   *  Analysis:  The AI Agent in Hybrid Agent System performs
         intelligent analysis using data and tools.  It leverages the
         Network Digital Twin to simulate and validate scenarios,
         enabling data-driven insights and gap analysis between intent
         and current network state.

   *  Decision:  The AI Agent in Hybrid Agent System makes informed
         decisions based on its analysis results.  It utilizes the
         Network Digital Twin for risk-free validation before finalizing
         decisions.  The decision may be sent to human operators for
         confirmation before actions are taken.

   *  Awareness:  The AI Agent in Hybrid Agent System gathers data from
         the Physical Network, it may also fetch data from the Network
         Digital Twin which maintains a dynamic, virtual representation
         of Physical Network.  Together, they provide comprehensive
         network visibility and context-aware awareness.

   *  Execution: ：The AI Agent in Hybrid Agent System implements
      validated decisions by applying configurations or control actions
      to the Physical Network via southbound interfaces such as NETCONF,
      RESTCONF, or Model Context Protocol [MCP].

7.  AI Agent Registration and Team formation

                                                     +----------------+
                                                     |Agent Fabric    |
                    +-----------+                    | +------------+ |
                    |  Network  |      Register      | |            | |
                    | AI Agent  +--------------------+->            | |
                    +-----------+                    | |Registration| |
          |---------------+----------------+---------+->            | |
         Register       Register         Register    | |  Center    | |
          |               |                |         | +------------+ |
   +------+----+   +------+-----+    +-----+----+    |                |
   |   Fault   |   |Optimization|    | Change   |    |                |
   | Task Agent|   |Task Agent  |    |Task Agent| ...|                |
   +-----------+   +------------+    +----------+    +----------------+


                    +-----------+
                    |  Incident |
                    |   Center  |                    +----------------+
                    +-----+-----+                    |Agent Fabric    |
                     Incident Notif                  | +------------+ |
                          |        Registry based    | |   Agent    | |
           Intent   +-----V-----+    Discovery       | |            | |
          Objective |  Network  +--------------------+->Registration| |
     Human ---------> AI Agent  |                    | |            | |
    Operator        |           |                    | |  Center    | |
                    +-----+-----+                    | +------------+ |
             Policy Definition(Obj+Incident+Action)  |                |
                    Task Planning                    |                |
                   Agent Team Formation              |                |
          +---------------+----------------+         +----------------+
          |               |                |
    +-----+-----+    +----+------+    +----+------+
    |           |    |           |    |           |
    |Task Agent1|    |Task Agent1|    |Task Agent1|
    +-----------+    +-----------+    +-----------+

      Figure 3: AI Agent Registration and Team formation Usage Example

   The Agent Fabric at the network level provides agent registration for
   both embedded AI agent in each network element and network AI agent
   and associated task agents.  The following steps are performed to
   provide Event driven AI Agent Team formation within the Agentic AI
   network management architecture:

   Step 1:  Human Operator pre-provision user intent which comprises
      objective, incident list and corresponding action list.

   Step 2:  Network AI Agent receieves user intent and generate
      corresponding policies which comprise objective, incident list and
      corresponding action list.  In addition, Network AI Agent
      subscribe corresponding incidents from incident center.

   Step3:  Network AI agent generate task planning based on objective
      and then discover matched task agent lists based on planned task
      from registration center within the agent fabric.

   Step4:  Upon receiving incident from the incident center, network AI
      agent assign the tasks to task agents corresponding to specific
      incident received.

   Step 5:  Network AI Agent distribute task to corresponding task
      agents and complete task agent team formation.

8.  Agent to Agent Communication Security

      Artifacts Delivery
    +-----------+
    |           |                                    +----------------+
    |           |                                     Agent Fabric    |
    |        +--V--------+  Agent Identity Discovery | +------------+ |
    |        |    Task   <---------------------------+->   Agent    | |
    |        |  AI Agent <---------------------+     | |            | |
    |        +-^------^--+  Authentication     |  +--+->Registration| |
    |          |      |       Handshake        |  |  | |            | |
    |   Secure |  Stateful                     |  |  | |  Center    | |
    |   Task   |   Monitoring                  |  |  | +------------+ |
    Negotiation|      |                        |  |  |                |
    |          |      |     Agent Identity     |  |  | +-------------+|
    |        +-V------V--+     Discoverey      +--+--+->   Agent     ||
    |        |    Task   +------------------------+  | |Authorization||
    +-------->  AI Agent <---------------------------+->   Server    ||
             +-----------+  Authentication           | |             ||
                             Handshake               | +-------------+|
                                                     +----------------+

       Figure 4: Agent to Agent Communication Security Usage Example

   The following steps are performed to provide Agent to Agent
   Communication Security within the Agentic AI network management
   architecture:

   Step 1.  Discovery via Agent Card:  The workflow begins when a client
      agent requests the Agent Card from the remote agent's /.well-
      known/agent.json endpoint.  This JSON file acts as a secure
      manifest, declaring the agent’s identity, capabilities, and
      required security schemes.

   Step 2.  Authentication Handshake:  Before any task is sent, the
      client must fulfill the authentication requirements listed in the
      Agent Card.  This typically involves an OAuth 2.0 flow where the
      client obtains a JSON Web Token (JWT) to prove its identity and
      permissions.

   Step 3.  Secure Task Initiation: :Communication is established over
   HTTPS/TLS.  The client sends a tasks/send request using JSON-RPC 2.0.
   The server validates the token and authorizes the specific task based
   on the client's role.

   Step 4.  Stateful Monitoring & Feedback: :The task moves through a
   strictly defined lifecycle (submitted → working → completed).
   Security is maintained throughout as updates are streamed via Server-
   Sent Events (SSE) or webhooks, each tied to the unique, authorized
   Task ID.

   Step 5.  Artifact Delivery:  Final results (Artifacts) are delivered
      only after the task reaches a completed state.  These are
      structured objects (text, files, or data) returned to the verified
      requester, ensuring data integrity and preventing unauthorized
      access to output.

9.  AI Driven Network Operations: A collection of Use Cases

   Network AI Agent could help in the following phases which are usually
   mentioned in network management:

   *  Network Planning and Design: includes the understanding of user
      intent, generation of solutions, and simulation for decision-
      making.

   *  Service Deployment: includes the construction of the physical
      network, as well as intent understanding, pre-deployment
      simulation, automated configuration, post-deployment validation,
      and other capabilities to enhance the efficiency and accuracy of
      network configuration for service deployment.

   *  Network Monitoring and Troubleshooting: includes intent
      monitoring, issues identification, solution generation, evaluation
      and decision-making, solution implementation, and service
      validation.

   *  Network Change and Optimization: involves the design, evaluation,
      decision-making, implementation, and validation of network
      configuration changes or optimizations to improve network
      operation efficiency.

   In all phases and use cases, after the Agent performs specific
   action, it always continuously monitors the network by data
   collection.  Based on the result of network running analysis and user
   explicit feedback, it may adjust and optimize the management strategy
   if necessary.

9.1.  Multi-Agent Collaboration on Network Configuration Change

   Network configuration changes are needed in scenarios such as
   optimizing network or service performance, provisioning new network
   services, or resolving network incidents/faults.

                       +------------+
         Network       |  Service   |
         Operator ----->  AI Agent  |
                       |            |
                       +-----+------+
                             |Configuration               +---------+
                             |Change Intent               |         |
                     +-------V---------+                  |         |
                     |Network AI Agent |                  |         |
                     |     Goal        |                  |         |
             +-------| Task-1,Task-2   +------------------>         |
             |       |   ... Task-n    +-------+          | Network |
             |       +-------+---------+       |          | Digital |
             |               |                 |          |  Twin   |
             |               |                 |          |  Task   |
       +-----V------+  +-----V-------+  +------V----+     |  Agent  |
       |   Config   |  |  Config     |  |  Config   <----->         |
       | Generation |  | Distribution|  |Validation |     |         |
       | Task Agent |  |  Task Agent |  | Task Agent|     |         |
       +------------+  +-------------+  +-----^-----+     +---------+
                                              |
                                       +------V------+
                                       |  Resource   |
                                       | Allocation  |
                                       | Task Agent  |
                                       +-------------+

     Figure 5: Intent Based Network Configuration Change Usage Example

   Network configuration change leveraging Network AI Agent and Network
   Digital Twin may experience the following typical steps:

   Step 1:  The network operator inputs the intent of network
      configuration change into the Network AI Agent using natural
      language.  The network operator may simply explain the objectives
      and requirements of the changes.

   Step 2:  Network AI Agent first verifies the identity of the user
      requesting the change and checks the user's permissions to make
      certain types of network changes against predefined rules or
      policies.  It then understands and parses the initial intent of
      the request, by leveraging the powerful knowledge and reasoning
      capabilities of LLM and decompose the tasks into configuration
      generation task, configuration distribution task, configuration
      validation task and assign to corresponding task agents.
      Configuration generation Task Agent first generates initial
      suggestions for specific network configuration update, which may
      include multiple possible network configuration change plans if
      possible.

   Step 3:  Network AI Agent further communicates with the Configuration
      Validation task agent and Network Digital Twin task agent to
      validate the suggested configuration change, including the syntax
      and semantics of the configuration, verification of effected
      application and resources.  The network digital Twin task agent
      may generate a report indicating the validation result, and
      suggested configuration fix when the validation fails after
      network simulation leveraging the current physical network
      operational state.

   Step 4:  Network AI Agent may generate a configuration change plan
      and submit to the network operator for approval.  Based on the
      feedback from the operator, Network AI Agent then further decides
      whether to optimize the change plan or deliver the plan to the
      Configuration Distribution task agent to conduct the physical
      network configuration change.  The configuration distribution task
      agent may further communicate with resource allocation task agent
      to obtain network resource (e.g.,vlan, IP subnet) allocated by
      resource allocation task agent.

9.2.  Multi-Agent Collaboration on Network Troubleshooting

                                  |
                          Human   |Network Troubeshooting
                         Operator |Intent
                                  |                    +---------+
                         +--------V--------+           | Network |
                         | Network AI Agent|           | Digital |
                         |     Goal        |           |  Twin   |
                         | Task-1,Task-2   +----------->  Task   |
                         | ....Task-n      |           |  Agent  |
                         +--------+--------+           +---------+
                                  |
            |------------------+--+-----------+------------+
        +---+----------+ +-----+-----+  +-----+----+ +-----+------+
        |    Fault     | |   Fault   |  |   Fault  | |   Fault    |
        |Identification| | Diagnosis |  |  Repair  | | Prediction |
        | Task Agent   | |Task Agent |  |Task Agent| | Task Agent |
        +--------------+ +-----------+  +----------+ +------------+

        Figure 6: Intent based Network Troubleshooting Usage Example

   The network operator inputs the intent of network configuration
   change into the Network AI Agent using natural language.  Network AI
   Agent could plan and decompose network troubleshooting tasks and
   coordinate with fault identification task agent, fault diagnosis task
   agent, fault repair task agent and fault prediction task agent to
   assist in network troubleshooting in the following significant
   aspects:

   *  Fault Identification:  Network AI Agent coordinates with fault
         identification task agent continuously monitors and aggregates
         data from various sources, the comprehensive data collection
         provides a holistic view of the network operational state.  By
         analyzing the real-time data, fault identification task Agent
         could detect network anomalies swiftly, which enables the
         prompt identification of potential issues before they escalate
         into major faults, minimizing downtime or service disruptions.
         In some cases, the Lightweight AI located in the Network
         Element may handle some simple fault identification tasks
         (e.g., optical module fault automatic identification) to
         enhance the awareness, while the fault identification task
         agent and LLM could leverage their powerful processing
         capabilities to analyze the time-domain data collected from the
         optical module.

   *  Fault Diagnosis:  Once a fault is identified, Network AI Agent
         coordinate with fault diagnosis task agent to delve into
         diagnosing the exact cause, fault diagnosis task agent may also
         invoke some existing operations such as "incident-diagnose" RPC
         defined in [I-D.ietf-nmop-network-incident-yang].  By
         correlating symptoms and/or applying AI models trained on
         historical data, fault diagnosis task agent can narrow down the
         potential causes and pinpoint the exact cause, which
         accelerates the diagnosis process and reduces the time needed
         to address the issue.

   *  Fault Repair:  After diagnosing the fault, Network AI Agent can
         coordinate with fault repair task agent to generate targeted
         repair solutions.  These solutions range from specific
         configuration adjustments to more complex fixes (e.g., hardware
         replacement).  Fault Repair task Agent would also communicate
         with the Network Digital Twin task agent to simulate the
         proposed repair solutions and get feedback from the Network
         Digital Twin task agent.  In advanced setups, fault repair task
         agent may automatically execute these repairs, ensuring quick
         restoration of normal operations and enhancing the overall
         reliability and efficiency of network management.  But the
         fault repair task agent may also first present the fault
         details and repair advice to the network operator for review,
         and proceed to carry out the repair task once it is confirmed.

   *  Fault Prediction  As an advanced enhancement of fault management
         capabilities, fault prediction aims to reduce network risks
         through proactive management that prevents problems before they
         occur.  Before a fault actually occurs, the fault prediction
         task agent can coordinate with network digital twin task agent
         to construct a dynamic simulation model by collecting real-time
         multi-dimensional operational state data, including network
         topology, traffic load, and device performance indicators.
         Based on the network data, the fault predication task agent
         uses large models and machine learning algorithms (such as
         time-series prediction models and anomaly detection models) to
         reason and analyze potential faults—for example, predicting the
         risk of physical link interruption based on optical cable
         signal attenuation data.  Furthermore, the fault prediction
         task Agent generates recommended operations to avoid faults and
         validates them through simulation in the network digital twin
         task agent, thereby achieving predictive maintenance of the
         network.

9.3.  Multi-Agent Collaboration on Network Optimization

                                |
                                |
                        Human   |Network Optimization
                       Operator |Intent
                                |                    +---------+
                       +--------V--------+           | Network |
                       | Network AI Agent|           | Digital |
                       |     Goal        |           |  Twin   |
                       | Task-1,Task-2   +----------->  Task   |
                       | ....Task-n      |           |  Agent  |
                       +--------+--------+           +---------+
                                |
                     +----------+----------------+
                     |                           |
             +-------+------+            +-------+-------+
             | Optimization |            |  Optimization |
             |  Generation  |            |  Distribution |
             | Task Agent   |            | Task Agent    |
             +--------------+            +---------------+

         Figure 7: Intent based Network Optimization Usage Example

   Network optimization is often introduced due to the Network AI
   Agent's awareness of some potential network faults or anomalies
   through continuously monitoring of network operational state, e.g.,
   AI models may predicts network congestion by analyzing historical and
   real-time network traffic data.  It may also be triggered by the
   network operator actively inputting the network optimization intent.

   Based on the analysis of network data and user's intent (if any),
   Network AI Agent collaborate with Optimization Solution Generation
   Task Agent to propose network optimization strategies.  For instance,
   once the network congestion sometime in the future is predicted, it
   may proactively optimize the network configuration, or suggest
   scaling up to meet specific demands.

   Before the network optimization is conducted, Network AI Agent
   coordinates with the network digital twin task agent to implement and
   evaluate the optimization solution using the Network Digital Twin
   platform.  This may need repeated trials and validations based on
   specific evaluation criteria, before the optimal strategy could be
   selected.  Network AI Agent may also first present the suggested
   network optimization solution to the network operator for review, and
   apply it to the physical network through optimization solution
   distribution task agent after obtaining approval from the network
   operator.

9.4.  Network level Energy Efficiency Management in the IP+Optical
      network

                            +-----------------+
                            |  Multi-Domain   |
                            |    AI Agent     |
                            |  GREEN Goal     |
                            | Task-1,Task-2   |
                            | ....Task-n      |
                            +--------+--------+
                                     |
                       +-------------+--------------+
                       |                            |
                 +-----+------+              +------+-----+
                 |    IP      |              |   Optical  |
                 |   Domain   |              |   Domain   |
                 |   Network  |              |   Network  |
                 |   AI Agent |              |   AI Agent |
                 +------------+              +------------+

     Figure 8: Intent based Network level Energy Efficiency Management
                               Usage Example

   Network level Energy Efficiency refer to a set of processes used to
   discover a inventory of capabilities, use specific metrics to monitor
   and assess energy consumption of the entire IP+Optical network ,
   operate, and control the use of available energy in an optimized
   manner while achieving the network’s functional and performance
   requirements by improving overall network utilization.

   Multi-Domain AI Agent can work together with network AI Agent in each
   autonomous domain to allow network operators not only see real time
   energy consumption in the network devices of large scale network
   through interaction with the GREEN Network AI Agent, but also allow
   them see

   o which network devices enable energy saving, which devices not,
   which are legacy ones,

   o The total energy consumption changing trend over the time of the
   day, for all network devices,

   o Energy efficiency changing trend over the time of the day for the
   whole network.

   On the other hand, With the end to end observability to energy
   consumption statistics data and energy efficiency statistics data,
   the Network AI Agent in each autonomous domain can collaborate with
   network digital twin to know which part of the network need to be
   adjusted or optimized based on network status change.

9.5.  Network Security Drills (Human on the Loop)

                                 +----------------+   Human
                                 | Agent Fabric   |  Operator
            +-----------------+  |                | /---------\
            | Network AI Agent|  |+--------------+| | Analyze |
            |     Goal        |  ||              || | Define  |
            | Task-1,Task-2   <-->|Observability |<-> Inject  |
            | ....Task-n      |  ||              || \---------/
            +-------+---------+  |+--------------+|
                    |            |                |
                    |            +---------^------+
                    |                      |
            +-------V------ ---------------V----------------+
            | Dynamic Attack and Defense Verification System|
            |        Based on Network Digital Twin          |
            |      +---------+      +---------+             |
            |      | Dynamic |      | Dynamic |             |
            |      | Security|      |Security |             |
            |      | Attack  |      |Defense  |             |
            |      |  Task   |      |  Task   |             |
            |      |  Agent  |      |  Agent  |             |
            |      +---------+      +---------+             |
            +--------------------^--------------------------+
                                 |Data Collection
            +--------------------+--------------------------+
            |            Network Infrastructure             |
            +-----------------------------------------------+

        Figure 9: Intent based Network Security Drill Usage Example

   The human operator can work together with the Network AI Agent to
   conduct Network security Drill.  The human operator can instruct the
   Network AI Agent with specific injection policy to work with network
   digital twin help construct a dynamic attack-defense verification
   system in network security drills through NDT and AI reasoning
   capabilities.  The dynamic attack-defense verification system
   comprise dynamic security attack task agent and dynamic security
   defense task agent which are responsible security risk attack task
   and security risk defense task respectively assigned by the network
   AI agent.  The dynamic security attack task agent uses generative AI
   to automatically generate diversified attack paths, models network
   topologies with graph neural networks, covers attack stages such as
   reconnaissance and penetration, and dynamically adjusts strategies
   via reinforcement learning to simulate the adaptive characteristics
   of network attacks.  The virtual range built based on the NDT can 1:1
   map the production environment, supporting simulations of composite
   scenarios like ransomware chain attacks and supply chain attacks —
   such as simulating the entire process of Contivirus laterally
   penetrating to domain controllers through weak passwords.

   During drills, Human operator can instruct the Network AI Agent to
   work with the dynamic security defense task agent to automatically
   deploy virtual environments with vulnerabilities, collect defense
   response data in real time through NDT, and generate attack path
   heatmaps and repair suggestions.  This capability can further verify
   emergency response processes, inject real-time threat intelligence to
   dynamically update drill scenarios, and simulate end-to-end automated
   deployment, vulnerability injection, and real-time analysis of
   security drills, enhancing the proactive verification ability of
   defense systems against real-world threats.

10.  Challenges of Integrating Network Digital Twin and Agentic AI into
     Network Management

   In addition to the research challenges in coupling AI and network
   management specified in [I-D.irtf-nmrg-ai-challenges], this document
   also identifies some challenges that need to be considered when
   integrating agentic AI together with network digital twin for network
   operations.

10.1.  Make IETF developed YANG models AI ready

   Network management has evolved towards model-driven programmability
   using YANG (RFC 7950).  While YANG provides a rigorous structure for
   data, it was designed for deterministic software clients rather than
   the probabilistic reasoning of Large Language Models (LLMs).
   Consequently, AI agents often struggle to navigate complex YANG
   hierarchies, leading to "hallucinations" or incorrect configuration
   logic.

   The emerging Model Context Protocol (MCP) provides a framework for
   connecting AI models to external tools and data.  However, there is
   currently no standardized way to map YANG primitives to MCP
   components (Resources, Tools, and Prompts).  This gap results in:

   o Semantic Ambiguity: LLMs lack the contextual hints needed to
   understand the operational impact of specific YANG leaves.

   o Interoperability Barriers: AI agents require custom "glue code" for
   every vendor's unique interpretation of how a YANG model should be
   exposed to an AI.

   o Scaling Issues: Massive YANG schemas exceed LLM context windows,
   requiring standardized methods for sub-schema discovery and pruning.

10.2.  Trust and Security

   Multi-Agent Collaborations and interactions can be break down into 4
   typical scenarios:

   o Human operator-> AI Agent-> APIs/Tools/APIs/LLMs

   In a single-agent scenario, Human operators access services through
   the network management AI agent.  The network management AI agent has
   multiple functions (fault and optimization), and authentication is
   required to prevent users from Performing unauthorized actions by
   exploiting privilege vulnerabilities,e.g., accessing the optimization
   function interface when they only have fault agent permissions.

   o AI Agent -> API Services

   In a single-agent scenario, the network management AI agent triggers
   tasks automatically based on trace and log information.  In some
   cases, the logging or decision-making process cannot be traced.

   o Human operator-> AI Agent -> multiple AI Agent

   Multiple agents may call each other.  For example, if a faulty agent
   A calls an optimization agent B, authentication is required to
   prevent Manipulating communication channels of agents to influence
   decision-making processes.

   o External AI Agent ->AI Agent-> APIs/Tools/APIs/LLMs

   External AI agents can directly access the network management AI
   Agent by simulating human operation through interface protocols.  For
   example, a customer AI agent can access the Network AI Agent through
   a northbound interface protocol such as A2A, MCP.  In some case,
   there might have target flaws in protocols like MCP or A2A;
   e.g.,consent bypass, context hijacking, etc.

   Ensuring robust security throughout the entire AI-based network
   management architecture is essential to prevent unauthorized access
   and maintain the security of the network infrastructure.  The
   security risk can be break down into the following cases:

   o External system interacts with AI agent

   Human Operators or external systems bypass their privileges to
   operate the Network management AI agent

   *  Human operators operate through a network management AI agent, but
      the service scope that the AI agent can handle may exceed the
      user's authorized privileges, leading to unauthorized access.

   *  Human operators manipulate business operations using agent
      delegation and authorization.

   *  Human operators Coerce intelligent agents to manipulate users into
      performing covert operations.

   o AI Agent Interact with Tools/APIs/LLMs

   *  Interact with Tools/APIs/LLMs with Privilege Escalation

      Since network management AI agents rely on LLM for inference when
      accessing APIs and tools, there is a possibility of malicious
      injection scenarios where the APIs accessed by the agent exceed
      the expected scope,e.g.,Using AI to generate execution
      environments and inject malicious code.

   *  Interact with Tools/APIs/LLMs without audit

      When an agent interacts with an API or tool, the logs are
      currently recorded as system logs, which cannot distinguish
      between different agents, the logging or decision-making process
      for specific agent cannot be traced and pose a risk of
      repudiation.

   o Multiple Agent Collaboration and Communication

   *  Multi-Agent Communication with Privilege Escalation

      When agents communicate with each other via intent communication
      and understanding, there is a cascading permission amplification
      problem,e.g.,AI-generated false information disrupts the reasoning
      process , leading to Privilege Escalation.

   *  Interaction with internal AI Agent

      As a microservice, a network management AI agent can be accessed
      by other services or AI agents,e.g., Performing unauthorized
      operations by exploiting authentication vulnerabilities,which
      poses a risk of Privilege Escalation.

10.3.  Protocols between Agent and Agent/Human operator/Tools

10.3.1.  High Risk Operations

   "network change" in the network management field refers to the
   modification, adjustment, or configuration of physical or logical
   resources in the existing network.  Because these operations directly
   affect the continuity and stability of existing network services,
   even minor unauthorized access or errors can lead to large-scale
   network outages.  In case of high risk operation, the following
   measures should be taken into account.

   o Regardless of whether the initiator of the operation is a human or
   an agent, each network change request must be re-authenticated.

   o For operations involving "bulk deletion" or "core route changes",
   the system must require approval from a second high-privilege account
   before issuing the command or enforcing the policy.

   o High-risk configurations should be tested on a very small scale
   first.  The system should automatically monitor the indicators, and
   if any abnormalities are detected, the configuration should be
   automatically rolled back within milliseconds.

10.3.2.  The Timeliness Requirements of Collaboration

   For real-time network operation and maintenance scenarios with high
   real-time requirements, such as scheduling strategy optimization and
   critical fault repair, the rapid generation of network optimization
   decisions is crucial.  However AI Agents based on large models adopt
   a "Token-based" generation and reasoning approach, which is limited
   by computing power and algorithms, resulting in generally slow
   reasoning speeds.  In addition, the simulation and verification
   process of Network Digital Twin (NDT) further increases decision
   latency, which leads to long end-to-end decision-making time in
   complex scenarios and is difficult to meet the real-time requirements
   of services.

   Also tasks such as fault diagnosis, complaint handling, and user
   experience improvement often have strict time constraints.  For
   example, if a fault is not resolved within a set time, it can trigger
   an escalation of the complaint, requiring efficient collaboration
   among multiple agents.  In addition, In a network management
   environment, you might need agents to subscribe to real-time network
   alarms or telemetry events.  However Google-initiated A2A protocol
   primarily follows a task-oriented "request-response" model and
   doesn't support a native pub/sub or event-driven architecture.

   To improve decision efficiency, continuous efforts are needed in
   lightweight NDT modeling algorithms, optimizing large model reasoning
   frameworks (such as quantization technology and parallel computing),
   and deploying high-performance AI acceleration hardware.

10.3.3.  Collaboration Reliability

   Fault diagnosis and complaint handling in the network management
   field are complex tasks, typically involving 10 to 20+ fields for one
   single message exchange between two AI Agents and requiring a high
   level of expertise.  In addition, Reliable task collaboration is
   incomplete and not sufficient for network management field, e.g.,
   Google-initiated A2A protocol doesn't defined handling strategies for
   task rejection, missing information during task collaboration, and
   failure to achieve task objectives.

   also In network management area, data sources can be diverse and
   heterogeneous, leading to potential issues such as data
   inconsistencies, missing, or outdated data.  Poor-quality data may
   result in inaccurate AI predictions and decisions.  For example, if
   incorrect or outdated network configuration data is provided, the
   model may provide incorrect repair advice when diagnosing network
   incidents or faults, it may suggest checking an non-existing
   interface.  Ensuring that data is properly cleaned, validated, and
   maintained is a significant challenge in providing reliable inputs
   for AI-driven network management.

10.4.  Agent Observability,Intervention and Control

   Network operations are increasingly autonomous with the growth of
   network management Agent applications at the network level and
   service level.  Since AI native operations may be non-deterministic,
   when network management agents misbehave or deviate from what Agents
   are expected to do, Current AI safety technologies, often referred to
   as "AI guardrails" are introduced to constrain the behavior of AI
   agents within operational and compliance boundaries, prevent AI from
   producing harmful results or taking wrong actions, e.g.,escalate a
   decision to a human for a high-risk network operation, defend against
   malicious attacks,e.g., prompt injection.  These guardrails typically
   operate at the input/output/pre-action filter level or through static
   boundary alignment.

   However as AI systems are increasingly integrated into autonomous
   workflows and critical infrastructure, these static measures are
   proving insufficient for the full operational lifecycle, they often
   cannot detect, interrupt, and rollover from unanticipated behaviors.
   Network operators usually lack an equivalent infrastructure for human
   oversight or to provide continuous, monitoring of an AI system’s
   internal logic or its long-running execution paths that match the
   speed and scale of the network management Agent applications, e.g.,
   network failure or security risk is hard to detect and control,
   occurring at machine speed.  When a violation is suspected, there are
   currently no standardized protocols for intervention (e.g., immediate
   task suspension) and recovery (e.g., reverting to a last known safe
   state or undoing a series of autonomous actions that introduce
   substantial operational risk) mechanisms.  In non-deterministic
   environments, the lack of human oversight and human-AI semantic
   intent exchange hinder timely risk mitigation and state recovery
   during boundary violations by agents.

10.5.  Agent Benchmarking vs Model Benchmarking

   Current AI evaluation frameworks are primarily designed for "one-
   shot" Large Language Models (LLMs) that provide static responses to
   isolated prompts.  However, the industry is shifting toward AI
   Agents: systems capable of reasoning, planning, and using tools to
   execute complex, multi-step workflows.

   Traditional benchmarks (e.g., HumanEval) focus on outcome-only
   scoring and static knowledge.  These fail to measure the core
   competencies of agents, such as long-horizon planning, tool-use
   proficiency, and the ability to adapt to environmental feedback.

   As agents increasingly manage production-level tasks—often involving
   millions of tokens and hundreds of tool calls—there is a critical
   need for standardized metrics that evaluate the process of task
   completion, not just the final result.

   o Sequential Complexity: Unlike single-turn benchmarks that check one
   answer, agents require multi-step task completion where each step
   relies on the success of the previous one.  This involves massive
   workflows (e.g., ~90 tool calls) that traditional benchmarks don't
   measure.

   o Environmental Interaction: Agents must interact with live
   environments.  Evaluation shifts from checking if code passes a test
   to whether a system can select, sequence, and execute tools in
   realistic, dynamic settings.

   o Adaptive Planning: Benchmarking must move beyond static knowledge
   retrieval to evaluate how systems build and adapt plans over long
   horizons, especially when faced with resource constraints or
   unexpected disruptions.

   o Process vs. Outcome: Traditional "right or wrong" scoring is
   insufficient.  Agentic evaluation requires process metrics such as
   tool success rates, context retention, and multi-turn coherence to
   determine how an agent reached its conclusion.

11.  Security Considerations

   The security consideration from
   [I-D.irtf-nmrg-network-digital-twin-arch] apply here.  In addition,
   the following architectural risks need to be considered:

   o Memory Poisoning：If the AI/ML models used by the network AI Agent
   or Network digital twin are compromised or poisoned with malicious/
   fake data,they could begin making incorrect or malicious decisions.
   Robust checks and validation are necessary to ensure the integrity of
   these models.  Session isolation or memory access authentication is
   also required to mitigate such risk.

   o Misuse of Tools： When network AI Agent interacts with tools,
   Deceptive prompts or commands can be introduced.  Tool access
   verification, tool monitoring, log tracking of AI tool usage are
   required to mitigate such risk.

   o Privilege Compromise：When human operators interact with Network AI
   Agent or Network AI Agent interact with tools/APIs/LLM, unauthorized
   actions might be performed by exploiting privilege vulnerabilities.
   Fine-grained permission control, dynamic access verification, and
   role change monitoring are required to mitigate such risk.

   o Resource Overload：When Network AI Agent interact with tools/APIs/
   LLMs, system failures might be caused by exploiting resource-
   intensive features.  Deployment of resource management controls to
   limit high-frequency task requests from agents is required to
   mitigate such risk.

   o Cascading Hallucinations：In case of multi-agent collaboration or
   communication, AI-generated false information might disrupt the
   reasoning process.Output verification, secondary verification of AI-
   generated knowledge are required to mitigate such risk.

   o Intent Breaking &Goal Manipulation：When external AI Agent or human
   operators interact with the network AI Agent, reasoning through agent
   planning capabilities might be manipulated.  Planning verification,
   managing reflection processes, goal consistency protection are
   required to mitigate such risk.

   o Misaligned & Deceptive Behaviors：When Network AI Agent interact
   with tools/APIs/LLMs, harmful operations might be performed by
   exploiting reasoning vulnerabilities.  Manual confirmation of high-
   risk operations, logging, monitoring, and deception detection are
   required to mitigate such risk.

   o Repudiation & Untraceability：In case of multi-agent collaboration
   or communication, the logging or decision-making process might not be
   traced.  Logging & cryptographic signature, cryptographic
   verification are required to mitigate such risk.

   o Identity Spoofing & Impersonation：When Human operators interact
   with the network AI Agent or in case of multi-agent collaboration or
   communication,unauthorized operations might be performed by
   exploiting authentication vulnerabilities.  Comprehensive identity
   verification, trust boundary control, and continuous monitoring are
   required to mitigate such risk.

   o Overwhelming HITL(Human In The Loop)：In case of multi-agent
   collaboration or communication,fatigue auditors attack might take
   place.  Developing advanced human-machine interaction frameworks and
   adaptive trust mechanisms are required to mitigate such risk.

   o Unexpected RCE & Code Attacks：When the network AI Agent interacts
   with tools/APIs/LLMs, Using AI to generate execution environments and
   inject malicious code might take place.  Restrict AI code generation
   permissions, sandbox isolation, and manual review of generated code
   are required to mitigate such risk.

   o Agent Communication Poisoning：In case of multi-agent collaboration
   or communication, communication channels of agents might be
   manipulated to influence decision-making processes.  Message
   authentication, communication verification, implementing multi-agent
   authentication mechanisms.

   o Rogue Agents in MAS：In case of multi-agent collaboration, there
   might be malicious or compromised agents.  Restricting the autonomy
   of agents and conducting regular AI testing are required to mitigate
   such risk.

   o Humans Attacks on MAS：In case of multi-agent collaboration,
   business operations might be manipulated using agent delegation and
   authorization.  Restricting the delegation mechanism, implementing AI
   agent identity authentication, and isolate tasks in segments are
   required to mitigate such risk.

   o Human Manipulation：When Human operators interact with the network
   AI Agent, AI agents might be coerced to manipulate users into
   performing covert operations.  Safety guardrails, content moderation,
   output content detection are required to mitigate such risk.

   o Insecure Inter-Agent Protocol Abuse：In case of multi-agent
   collaboration, there might be target flaws in protocols like MCP or
   A2A; e.g.,consent bypass, context hijacking, etc.  Strong
   authentication,data validation, restricting delegation to scoped
   function,logging agent and tool invocations and encrypting
   communications are required to mitigate such risk.

   o Supply Chain Compromise：In case of multi-agent collaboration,
   Vulnerable, malicious, outdated, harmful components might be included
   into the agent.  Digital signatures of SBOMs (AI_, Agent_),applying
   version control, chaining authentication, environment isolation are
   required to mitigate such risk.

   o Lifecycle security: The entire management lifecycle of the network
   AI agents and the network digital twin—from initial deployment and
   configuration to updates and decommissioning—must be secured against
   unauthorized access and manipulation.

12.  IANA Considerations

   This document has no requests to IANA.

13.  Conclusion

   The following items were felt to be good starting points for IETF
   work:

   *  Nature Language Interaction protocol to ensure both The accuracy
      and efficiency of structured data for deterministic tasks and
      natural language interactions for understanding and handling
      uncertain or ambiguous tasks.

   *  Translate Service Level YANG Data model and Network Level YANG
      Data model into DSL payloads and APIs which can be consumed by
      Agents and Models.

   *  Human and Agent Interaction to support explainable, observability
      and controllable capabilities.

   *  Define semantic information transfer or agent prompt language
      template for Agent to Agent Communication.

   *  Agent to Agent Protocol extensions for IP Network Agent and
      Network Element Agent Collaboration.

14.  References

14.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

14.2.  Informative References

   [A2A]      "Agent2Agent (A2A) protocol", April 2025, <https://google-
              a2a.github.io/A2A/#/documentation?id=agent2agent-protocol-
              a2a>.

   [A2A-T]    "Agent to Agent Protocol for Telecoms (A2A-T)", 2025,
              <https://www.tmforum.org/resources/introductory-guide/
              ig1453-agent-to-agent-protocol-for-telecoms-a2a-
              t-v1-0-0/>.

   [Agent-skills]
              "Agent Skills", 2025, <https://agentskills.io/home>.

   [Google-Agents-Whitepaper]
              "Agents", 2024,
              <https://www.kaggle.com/whitepaper-agents>.

   [I-D.hong-nmrg-agenticai-ps]
              Hong, Y., Youn, J., Wu, Q., and B. Claise, "Motivations
              and Problem Statement of Agentic AI for network
              management", Work in Progress, Internet-Draft, draft-hong-
              nmrg-agenticai-ps-01, 1 March 2026,
              <https://datatracker.ietf.org/doc/html/draft-hong-nmrg-
              agenticai-ps-01>.

   [I-D.ietf-nmop-network-incident-yang]
              Hu, T., Contreras, L. M., Wu, Q., Davis, N., and C. Feng,
              "A YANG Data Model for Network Incident Management", Work
              in Progress, Internet-Draft, draft-ietf-nmop-network-
              incident-yang-08, 13 February 2026,
              <https://datatracker.ietf.org/doc/html/draft-ietf-nmop-
              network-incident-yang-08>.

   [I-D.ietf-nmop-terminology]
              Davis, N., Farrel, A., Graf, T., Wu, Q., and C. Yu, "Some
              Key Terms for Network Fault and Problem Management", Work
              in Progress, Internet-Draft, draft-ietf-nmop-terminology-
              23, 18 August 2025,
              <https://datatracker.ietf.org/doc/html/draft-ietf-nmop-
              terminology-23>.

   [I-D.irtf-nmrg-ai-challenges]
              François, J., Clemm, A., Papadimitriou, D., Fernandes, S.,
              and S. Schneider, "Research Challenges in Coupling
              Artificial Intelligence and Network Management", Work in
              Progress, Internet-Draft, draft-irtf-nmrg-ai-challenges-
              05, 18 March 2025, <https://datatracker.ietf.org/doc/html/
              draft-irtf-nmrg-ai-challenges-05>.

   [I-D.irtf-nmrg-network-digital-twin-arch]
              Zhou, C., Yang, H., Duan, X., Lopez, D., Pastor, A., Wu,
              Q., Boucadair, M., and C. Jacquenet, "Network Digital
              Twin: Concepts and Reference Architecture", Work in
              Progress, Internet-Draft, draft-irtf-nmrg-network-digital-
              twin-arch-12, 27 February 2026,
              <https://datatracker.ietf.org/doc/html/draft-irtf-nmrg-
              network-digital-twin-arch-12>.

   [MCP]      "Model Context Protocol", November 2024,
              <https://modelcontextprotocol.io/>.

   [OTel-gen-ai]
              "AI Agent Observability", 2025,
              <https://opentelemetry.io/docs/specs/semconv/gen-ai/>.

   [RFC6241]  Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed.,
              and A. Bierman, Ed., "Network Configuration Protocol
              (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011,
              <https://www.rfc-editor.org/rfc/rfc6241>.

   [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
              "Specification of the IP Flow Information Export (IPFIX)
              Protocol for the Exchange of Flow Information", STD 77,
              RFC 7011, DOI 10.17487/RFC7011, September 2013,
              <https://www.rfc-editor.org/rfc/rfc7011>.

   [RFC7854]  Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP
              Monitoring Protocol (BMP)", RFC 7854,
              DOI 10.17487/RFC7854, June 2016,
              <https://www.rfc-editor.org/rfc/rfc7854>.

   [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
              RFC 7950, DOI 10.17487/RFC7950, August 2016,
              <https://www.rfc-editor.org/rfc/rfc7950>.

   [RFC8040]  Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF
              Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017,
              <https://www.rfc-editor.org/rfc/rfc8040>.

   [RFC8639]  Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard,
              E., and A. Tripathy, "Subscription to YANG Notifications",
              RFC 8639, DOI 10.17487/RFC8639, September 2019,
              <https://www.rfc-editor.org/rfc/rfc8639>.

   [RFC8641]  Clemm, A. and E. Voit, "Subscription to YANG Notifications
              for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641,
              September 2019, <https://www.rfc-editor.org/rfc/rfc8641>.

   [RFC9232]  Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and
              A. Wang, "Network Telemetry Framework", RFC 9232,
              DOI 10.17487/RFC9232, May 2022,
              <https://www.rfc-editor.org/rfc/rfc9232>.

   [RFC9315]  Clemm, A., Ciavaglia, L., Granville, L. Z., and J.
              Tantsura, "Intent-Based Networking - Concepts and
              Definitions", RFC 9315, DOI 10.17487/RFC9315, October
              2022, <https://www.rfc-editor.org/rfc/rfc9315>.

   [TMF-1251D]
              "AN Agent Architecture v1.0.0", May 2025,
              <https://www.tmforum.org/resources/introductory-guide/
              ig1251d-an-agent-architecture-v1-0-0/>.

   [TMF-1258] "Autonomous Networks Glossary v1.2.0", May 2025,
              <https://projects.tmforum.org/wiki/display/PUB/
              IG1258+Autonomous+Networks+Glossary+v1.2.0>.

Appendix A.  Network Element AI Agent and Network AI Agent Collaboration

   As a supplementary deployment case, this section explores the
   collaboration mechanism between network element AI Agent and Network
   AI Agent.

   Network devices collect information from multiple dimensions,
   including flow information, configuration, events, alarms, logs,
   dynamic topology and routes, and device status (including CPU,
   memory, and hardware health).  With large amount of data collected to
   the domain controller for analysis and processing, the data accuracy
   is very limited and therefore it is hard to determine the service
   impact within 1 minute.  In addition, it usually require multiple
   step interaction, complex task management with various different data
   types or data sources.

   To address those challenges, the network AI Agent can delegate
   massive data analysis and processing to distributed AI Agent in each
   network element, e.g., a) only allow distributed AI Agent export
   processed analytic data to help establish global view of network
   observability. b) or export key network fault information for Network
   AI Agent for further investigation the root cause of the problem.

   For the former case, routing protocol specific fault data such as BGP
   Status Changed, OSPF Neighbor state changes, IS-IS Adjacency Changed
   data or hardware related fault data such as Optical fail, Physical
   Port down can be collected and using pre-trained LLM model with
   expert experience to match fault pattern and invoke corresponding
   routing protocol troubleshooting MCP tools and finally root cause.
   In addition, it allows network maintenance engineer using nature
   language interface to look up troubleshooting information or it
   allows Network AI Agent or Task Agent at the network element using
   MCP interface to invoke tools from MCP server within the network
   element.

                               +------------+     +------------+
                               |  Network   |     |            |
              Human Operator   |  AI Agent  |     | Task Agent |
                   |           |            |     |            |
                   |           |+----------+|     |+----------+|
                   |           ||MCP Client||     ||MCP Client||
                   |           |+----------+|     |+----+-----+|
                 Intent        +------+-----+     +-----+------+
                   |                  |                 |
           +-------+------------------+-----------------+-------+
           |       |                  +------------+    |       |
           | +-----V-----------------+    +--------+----+------+|
           | |Protocol Fault Agent   |    |Protocol|Fault Agent||
           | |  +---------+ +------+ |    |  +-----V----V-+    ||
           | |  |  ONNOX  | | MCP  +-+----+-->    MCP     |    ||
           | |  |  Model  | |Client| |    |  |   Server   |    ||
           | |  +---------+ +------+ |    |  +------------+    ||
           | +-----------------------+    +--------------------+|
           |                                                    |
           +----------------------------------------------------+
                         Smart Network Element

          Figure 10: Network Element AI Agent and Network AI Agent
                        Collaboration Usage Example

Appendix B.  Acknowledgements

   This work has benefited from the discussions of NMRG interim meeting
   on Agentic AI.  Thanks Laurent Ciavaglia, Chris Janz,Pedro Garcia
   Parra,Brad Peters,Kevin McDonnell,Giuseppe Fioccola,Richard
   Kilmurray,Dmytro Gassanov for wonderful comments and discussion on
   Agentic AI for Network Operation.

Appendix C.  Changes between Revisions

   v03 - v04

* Add some background information for this work in the introduction;
* Agent Governance Consideration in the context of network management;
* Clarify the value of Agent added to the network management architecture;
* Add one new definition for Autonomy;
* Rewrite Challenges for Agent Benchmarking;
* Rewrite the 1st challange on making YANG AI ready;
* Add a new challenge on Agent Observability,Intervention and Control;
* Add OTel & agent skill alignment.
* Update Architecture into hybrid architecture to support hybrid agent system.

   v02 - v03

* Agentic AI Architecture Update.
* Rewrite Functional Interfaces section for user to agent, agent to agent, agent to tools communication.
* Explore Relationship Between      Management      Characteristics and Functional Components in section 6.
* Rewrite a collection of use cases to support multi-agent collaborations.
* Rewrite Challenges section to cover Trust and Security, protocol and benchmarking.
* Add workflows for Agent Registration, Discovery, team forming.
* Add workflow for Agent to Agent Communciation Security.
* Rewrite Security Consideration Section.
* Add Conclusion Section.

   v00 - v01

* Add Security Consideration Section;
* Add Acknowledge Section;
* Clarify the relation between knowlege and tools;
* Clarify the souce of knowlege;
* Clarify the key characteristics of Network AI Agent to adpat to the environment change.

Contributors

   Qiufang Ma
   Huawei
   Email: maqiufang1@huawei.com


   Zhenqiang Li
   CMCC
   Email: lizhenqiang@chinamobile.com


   Lionel Tailhardat
   Orange Research
   Email: lionel.tailhardat@orange.com


Authors' Addresses

   Qin Wu
   Huawei
   China
   Email: bill.wu@huawei.com


   Cheng Zhou
   China Mobile
   China
   Email: zhouchengyjy@chinamobile.com


   Luis M. Contreras
   Telefonica
   Email: luismiguel.contrerasmurillo@telefonica.com


   Sai Han
   China Unicom
   China
   Email: hans29@chinaunicom.cn


   Yong-Geun Hong
   Daejeon University
   Email: yonggeun.hong@gmail.com
