RTGWG Working Group                                            W. Cheng
Internet Draft                                             China Mobile
Intended status: Standards Track                                 C. Lin
Expires: 05 January 2026                           New H3C Technologies
                                                           July 4, 2025




                        Enhanced ECMP for AI Cluster


                    draft-cheng-rtgwg-enhanced-ecmp-00


Abstract

   In AI training scenarios, the current mainstream load balancing
   technology is per-flow ECMP. However, hash collision issues lead to
   imbalanced traffic distribution, adversely affecting application
   performance.

   To address this problem, this document proposes an enhanced ECMP
   method that resolves load imbalance caused by hash collisions. The
   proposed solution effectively improves load balancing efficiency,
   reduces network congestion, and enhances overall network
   performance.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time.  It is inappropriate to use Internet-Drafts as
   reference material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 05, 2026.






Cheng, et al.          Expire January 05, 2026                [Page 1]

Internet-Draft      Enhanced ECMP for AI Network             July 2025


Copyright Notice

   Copyright (c) 2025 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.  Code Components extracted from this
   document must include Revised BSD License text as described in
   Section 4.e of the Trust Legal Provisions and are provided without
   warranty as described in the Revised BSD License.

Table of Contents


   1. Introduction...................................................2
      1.1. Requirements Language.....................................4
   2. Motivation.....................................................4
   3. Solution.......................................................5
      3.1. ECMP based on source ingress interface....................5
      3.2. ECMP based on egress Grouping.............................6
   4. Protocol Extension.............................................8
   5. Security Considerations........................................8
   6. IANA Considerations............................................8
   7. References.....................................................9
      7.1. Normative References......................................9
      7.2. Informational References..................................9
   Authors' Addresses................................................9



1. Introduction

   Currently, there are two granularities for network load
   balancing: per-flow ECMP and per-packet forwarding.

   As illustrated in Figure 1, the per-flow ECMP method employs flow
   characteristic-based hashing (typically using the five-tuple) to
   distribute traffic across multiple ECMP paths. This approach works
   effectively in environments with numerous small flows and absence of
   elephant flows. Its primary advantage is the elimination of packet
   reordering issues.

   However, this method presents limitations when dealing with either:


Cheng, et al.         Expires January 05, 2026                [Page 2]

Internet-Draft      Enhanced ECMP for AI Network             July 2025


   A limited number of flows, or The presence of elephant flows.

   In such cases, five-tuple-based hashing may lead to hash collisions,
   causing disproportionate mapping of oversized flows to the same
   path. This results in suboptimal load balancing performance.


                               ECMP
           Flow 1    Hash 1   +------+
           Flow 4  +--------- |if 1  |-----
           Flow 7             +------+
                              |      |-----
           Flow 2    Hash 2   +------+
           Flow 5  +--------  |if 2  |-----
           Flow 8             +------+
                              |      |-----
           Flow 3    Hash 3   +------+
           Flow 6  +-------   |if 3  |-----
           ...                +------+
                              |...   |-----
                              +------+
                      Figure 1 Per-flow ECMP

   The other approach is per-packet forwarding. This method applies
   hashing to each individual packet, distributing traffic across
   different ECMP paths, as illustrated in Figure 2. Theoretically, it
   achieves optimal load-balancing granularity. However, it introduces
   severe packet reordering within the same flow, necessitating
   additional mechanisms (e.g., reordering buffers or sequence
   tracking) to handle out-of-order delivery. This imposes higher
   demands on network infrastructure.

           Packet 1   Hash 1   +------+
           Packet 4 +--------- |if 1  |-----
           Packet 7            +------+
                               |      |-----
           Packet 2   Hash 2   +------+
           Packet 5  +-------- |if2   |-----
           Packet 8            +------+
                               |      |-----
           Packet 3   Hash 3   +------+
           Packet 6  +-------  |if3   |-----
           ...                 +------+
                               |...   |-----
                               +------+
                      Figure 2 Per-packet forwarding



Cheng, et al.         Expires January 05, 2026                [Page 3]

Internet-Draft      Enhanced ECMP for AI Network             July 2025


   In AI training scenarios, the current mainstream load balancing
   technology is per-flow ECMP. However, hash collision issues lead to
   imbalanced traffic distribution, adversely affecting application
   performance.

   To address this problem, this document proposes an enhanced ECMP
   method that resolves load imbalance caused by hash collisions. The
   proposed solution effectively improves load balancing efficiency,
   reduces network congestion, and enhances overall network
   performance.

1.1. Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2. Motivation


             +---------+                         +---------+
             |   R11   |                         |   R12   |
             +-#--#-#--+                         +#---#--#-+
               |  | |                             |   |  |
               |  | |                             |   |  |
               |  | +-----------------------------)-+ |  |
               |  |                               | | |  |
               |  |   +---------------------------+ | |  |
               |  |   |                             | |  |
               |  +---)----------+     +------------)-+  |
               |      |          |     |            |    |
             +-#------#+       +-#-----#-+       +--#----#-+
             |  R21    |       |  R22    |       |   R23   |
             +-#------#+       +-#------#+       +-#------#+
               |      |          |      |          |      |
             +-#+   +-#+       +-#+   +-#+       +-#+   +-#+
             |H1|   |H2|       |H3|   |H4|       |H5|   |H6|
             +--+   +--+       +--+   +--+       +--+   +--+

                      Figure 3 AI Network


   Due to the unique traffic patterns in AI training networks -
   characterized by a limited number of flows - achieving balanced load
   distribution becomes challenging. Traditional flow-based load

Cheng, et al.         Expires January 05, 2026                [Page 4]

Internet-Draft      Enhanced ECMP for AI Network             July 2025


   balancing strategies often result in uneven traffic distribution,
   potentially leading to network congestion. While packet-based
   approaches can mitigate this imbalance to some degree, they
   introduce packet reordering issues as flow packets may traverse
   different paths, requiring additional network-level reordering
   mechanisms.

   This document proposes two enhanced ECMP methods to address the load
   imbalance issue in AI training networks and improve the overall
   network performance.

3. Solution

3.1. ECMP based on source ingress interface

   Group the ingress interfaces for traffic, assign an ECMP number to
   the interfaces within the same group, and then perform ECMP hashing
   based on this ECMP number. This method is suitable when the
   forwarding traffic size for each ingress interface is roughly the
   same.

                                                          +-------+
      ingress-interace 1 -> group-id 1 -> ECMP Index 1----> |if 1   |
                                                            +-------+
      ingress-interace 2 -> group-id 1 -> ECMP Index 2----> |if 2   |
                                                            +-------+
      ingress-interace 3 -> group-id 1 -> ECMP Index 3----> |if 3   |
                                                            +-------+
      ingress-interace 4 -> group-id 1 -> ECMP Index 4----> |if 4   |
                                                            +-------+
      ingress-interace 5 -> group-id 2 -> ECMP Index 1----> |if 1   |
                                                            +-------+
      ingress-interace 6 -> group-id 2 -> ECMP Index 2----> |if 2   |
                                                            +-------+
      ingress-interace 7 -> group-id 2 -> ECMP Index 3----> |if 3   |
                                                            +-------+
      ingress-interace 8 -> group-id 2 -> ECMP Index 4----> |if 4   |
                                                            +-------+


           Figure 4 ECMP based on source ingress interface

   As shown in Figure 4 above, the eight ingress interfaces are divided
   into two groups, with four interfaces in each group.

   Within each group:



Cheng, et al.         Expires January 05, 2026                [Page 5]

Internet-Draft      Enhanced ECMP for AI Network             July 2025


    *  The ingress interfaces are assigned ECMP numbers 1, 2, 3, and 4
       respectively.

    *  ECMP hashing is performed based on these assigned ECMP numbers to
      select corresponding egress interfaces for forwarding.

   For traffic entering through the four ingress interfaces in Group 1:
   Different egress interfaces are selected for forwarding (four
   distinct paths)

   Similarly, for traffic entering through Group 2's four ingress
   interfaces: Different egress interfaces are selected for forwarding
   (four distinct paths)


3.2. ECMP based on egress Grouping



                          +-----+       +----+ Egress
                          |Spine|-------|Leaf|----Host
            Ingress     / +-----+  / \   +----+
                +----+ /   +-----+/   \+----+  Egress
       Host ----|Leaf|/----|Spine|------|Leaf|----Host
                +----+\    +-----+\   / +----+
                       \  +-----+  \ /  +----+ Egress
                        \ |Spine|------|Leaf|----Host
                          +-----+      +----+
                      Figure 5 ECMP based on egress Grouping

   As shown in the figure, the source HOST connects to the source Leaf
   via an Ingress interface, while the destination HOST connects to the
   destination Leaf through an Egress interface. Multiple ECMP (Equal-
   Cost Multi-Path) links exist between the Leaf switches and multiple
   Spine devices.

   To improve load-balancing distribution uniformity, the ECMP
   interfaces connecting to multiple Spine devices are grouped on the
   source Leaf. This grouping can be configured on the Leaf device. For
   example, if there are 128 equal-cost links between the source Leaf
   and Spine devices, they can be divided into groups of 4 interfaces
   each (Group 1: interfaces 1-4; Group 2: interfaces 5-8, etc.).

   For traffic load balancing, flows are first mapped to specific
   groups based on their location information (flows with the same
   location information are assigned to the same group), and then hash-
   based load balancing is performed within each group.


Cheng, et al.         Expires January 05, 2026                [Page 6]

Internet-Draft      Enhanced ECMP for AI Network             July 2025


   The location information used for group mapping can be either the
   source's Ingress interface or the destination's Egress interface.

   By implementing fine-grained grouping of ECMP interfaces, this
   solution achieves more uniform traffic load distribution, thereby
   addressing current issues of imbalanced load sharing and flow
   collisions.

                Mapping to        <------Route(with egress group-id)
        +-------+Group ID     +-----+     +------+
        |Ingress|-------------|Spine|-----|Egress|------egress
        |Leaf   |             |     |     |Leaf  |
        +-----=-+             +-----+     +-----=+
          Figure 6 Route carries remote egress group-id attribute



       Ecmp Group--
                   |--Sub Group-id 1 ---------(if-1,if-2, if-3,if-4)
                   |--Sub Group-id 2 ---------(if-5,if-6, if-7,if-8)
                    ...
          Figure 7 Grouping interfaces within an ECMP Group



   First, as shown in Figure 6, when the Egress Leaf advertises a
   route, it carries the egress group index, which can be composed of
   the local device's router-id and group-id (see Section 4 for
   details).

   On the Ingress Leaf, the ECMP egress interfaces toward the Spine
   devices are grouped (as illustrated in Figure 7).

   When the Ingress Leaf receives a route, it extracts the remote
   egress group index carried in the route. It then maps this remote
   egress group index to a local ECMP subgroup index, effectively
   directing traffic to the corresponding subset of interfaces for
   forwarding.

   This ensures that flows destined for different remote addresses are
   load-balanced across different ECMP subgroups, improving
   distribution granularity. Refer to Figure 8 for details.







Cheng, et al.         Expires January 05, 2026                [Page 7]

Internet-Draft      Enhanced ECMP for AI Network             July 2025


    +=========+=============+=================+======================+
    |  Dest   |Remote       |  Local Index    | ECMP interfaces      |
    |         | Attribute   |                 |                      |
    +=========+=============+=================+======================+
    | route-1 |Egress Group |Local ECMP       |(if-1,if-2, if-3,if-4)|
    |         |Index 1      |Sub Group Index 1|                      |
    +=========+=============+=================+======================+
    | route-2 |Egress Group |Local ECMP       |(if-5,if-6, if-7,if-8)|
    |         |Index 2      |Sub Group Index 2|                      |
    +=========+=============+=================+======================+
    |    ...  |     ...     |       ...       |          ...         |
    +=========+=============+=================+======================+
                     Figure 8

4. Protocol Extension

   This document defines a new extended community attribute type to
   carry the ECMP ID associated with a route. The ID comprises a 4-byte
   Router ID and a 2-byte Group-ID. The format is as follows:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    0x83       | Sub-Type(TBD) |           Router ID           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   Router ID                   |        Group-ID               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                      Figure 10 ECMP GroupID Extended Community

   Sub-Type: BGP_EXT-COMM-ECMP-GROUPID (TBD)

   Value Structure:

   Local BGP RouterID (4 bytes)

   ECMP-Group-ID Value (2 bytes)

5. Security Considerations

   TBD.

6. IANA Considerations

   Registry Name: Transitive BGP_EXT-COMM-ECMP-GROUPID
   Community Sub-Types

   TBD: BGP_EXT-COMM-ECMP-GROUPID


Cheng, et al.         Expires January 05, 2026                [Page 8]

Internet-Draft      Enhanced ECMP for AI Network             July 2025


7. References

7.1. Normative References

   TBD.

7.2. Informational References

   TBD

Authors' Addresses



   Weiqiang Cheng
   China Mobile
   Beijing
   China
   Email: chengweiqiang@chinamobile.com


   Changwang Lin
   New H3C Technologies
   Beijing
   China
   Email: linchangwang.04414@h3c.com


















Cheng, et al.         Expires January 05, 2026                [Page 9]

