Designing the Modern Data Center Network for AI Workloads

The fastest GPU is only as fast as the slowest packet.

For years, data center design has been driven by compute density, virtualization efficiency, and east-west traffic patterns dominated by many small flows. That model breaks down entirely when you step into AI training environments. What you’re building now isn’t just a traditional data center. It’s a distributed, synchronized compute system where the network is a first-class component of performance.

The Network Is the System

In large-scale AI training, thousands of GPUs operate in tightly coordinated steps. They compute locally, then exchange gradients, weights, and parameters in synchronized bursts. This is the core of the workload. 

That means traffic is periodic, synchronized, and massive. Flows are not small and random (they’re elephant flows by design), and the network directly impacts job completion time (JCT). And, especially for AI training, JCT is the metric that matters.

In an AI training data center, we’re talking about collective patterns like all-reduce, all-gather, and parameter sync. Because training is synchronized, performance is dictated by tail latency and not averages. A single straggling flow or congested path can delay the entire training run across thousands of GPUs.

These fabrics typically rely on simple, scalable control planes, usually BGP with ECMP, to provide predictable multipath forwarding, though convergence behavior has to be carefully considered at this scale.

Think about the incredible cost of GPUs today. At scale, where individual GPUs can cost tens of thousands of dollars, even small inefficiencies translate into millions in lost utilization. The key is making sure GPUs never have to wait on the network; otherwise you’re losing performance, extending training time, and burning money. 

Switching Architecture at Scale

In data centers built for AI training, the dominant architecture is still leaf-spine, but the design principles are different.

First, symmetry is non-negotiable. Equal-cost paths must truly be equal which means identical bandwidth from leaf to spine, consistent link counts across tiers, and minimal variance in latency. 

Second, oversubscription is effectively eliminated in the training fabric. In more traditional, even some high performance computing data centers, we could get away with a 3:1 or 5:1 oversubscription. AI clusters trend toward 1:1. This is because when thousands of GPUs synchronize and transmit simultaneously, even slight oversubscription creates congestion, head-of-line blocking, and stalled computation. And though non-blocking 1:1 is the ideal for AI training fabrics, in practice some architectures introduce carefully controlled oversubscription beyond a single pod or cluster boundary.

Third, radix matters more than ever. High-radix switches reduce the number of tiers, which reduces hop count, latency, power consumption, and of course, operational complexity. In other words, the more ports per switch, the flatter and faster the network will be. That translates to reduced latency and power. 

There are inherent tradeoffs between fixed form factor and chassis switches, Clos and butterfly designs, and so on, but the underlying goal is always the same: make the network as flat as possible and as fast as possible.

Now, keep in mind this is conceptual because at scale, these fabrics are often built as modular units (pods or clusters) with well-defined boundaries to contain failure domains and maintain predictable performance.

Elephant Flows and Microbursts

Traditional data centers deal with many short-lived flows, whereas AI data centers do not. Instead, typical traffic during training runs in an AI data center is long-lived, high-bandwidth elephant flows in synchronized bursts at line rate. On top of that, you still have to deal with incast and microburst conditions.

TCP incast is where multiple, synchronized servers send data to a single receiver at the same time which causes switches to drop packets. In an AI data center much of the traffic is RDMA (RoCEv2), which means incast still happens, but it’s different under PFC and ECN. Instead of pure packet drops, you’ll see an increase in pause events and queue buildup. 

This creates two major problems. 

First, hash polarization, where large elephant flows are pinned to a subset of ECMP paths due to limited hashing entropy, can lead to uneven link utilization. 

Second, buffer pressure and microburst loss can happen when queues fill in microseconds. 

Even a single congested path can stall an entire training run, which is why modern fabrics increasingly rely on advanced load balancing, congestion-aware routing, very precise queue management, and in some designs packet spraying, which can be controversial because of potential issues with RDMA behavior.  

Lossless Ethernet and Congestion Control

AI fabrics are pushing Ethernet into territory historically (but decreasingly) owned by InfiniBand. To support GPU-to-GPU communication, networks are designed to be effectively lossless, using technologies like: 

  • RoCE (RDMA over Converged Ethernet)
  • Priority Flow Control (PFC)
  • Explicit Congestion Notification (ECN)

These mechanisms shift the behavior of the network so that drops are minimized, congestion shows up as queue buildup and pause frames, and latency becomes the dominant signal. But it does also introduce some challenges. 

While no Ethernet network is truly lossless, these fabrics are engineered to minimize drops under expected conditions using mechanisms like PFC. However, this shifts the problem. Poorly tuned PFC can actually propagate congestion, introduce head-of-line blocking, and in worst cases lead to deadlock conditions.

Observability

At this scale, traditional monitoring is almost useless. Polling every 30–60 seconds won’t catch microbursts, short-lived congestion, or transient link issues. Instead, AI data center networks rely on high-resolution, streaming telemetry focused on the signals like queue depth and buffer utilization, PFC “time-in-pause” and ECN marking rates. 

Remember that the real failure mode in AI networks isn’t obvious outages. It’s silent performance degradation. A slightly degraded link can increase JCT by minutes or hours. And even worse, link failures can force checkpoint rollbacks, undoing large portions of a training run. 

Some of the important metrics we need for an AI data center include:

  • Queue depth
  • ECN events
  • PFC events
  • FEC correction
  • Link-level retransmits
  • NIC-level telemetry 

Designing for Job Completion Time

Everything in an AI data center ultimately ties back to JCT. You’re not so much optimizing for link utilization, average latency, or throughput in isolation. Rather, you’re optimizing for maximizing GPU utilization and minimizing communication overhead.

That leads to a different set of priorities. For an AI data center, the goals are to eliminate unnecessary hops, minimize retransmissions, and treat observability as a first class citizen so we can prevent congestion before it occurs and remediate issues immediately. 

Even a 10–20% reduction in JCT translates directly into lower GPU cost, faster model iteration, and in business terms, a real competitive advantage. 

Beyond Networking

One of the biggest mistakes engineers make is isolating the network from the rest of the system. AI data centers are tightly integrated systems involving compute (GPUs, CPUs, accelerators), storage and data pipelines, the network fabric, and of course also power and cooling. 

The industry is actively evolving Ethernet to better support these workloads. Initiatives such as the Ultra Ethernet Consortium are focused on improving congestion management, transport behavior, and coordination between NICs and the network fabric, bringing Ethernet closer to the performance characteristics historically associated with HPC interconnects.

Remember that for AI data centers, you’re not simply designing a network. You’re building a system where:

  • the network is part of the compute
  • performance is measured in job completion time
  • and small inefficiencies compound into massive cost

At this scale, architecture decisions really are business decisions and not purely technical. And the difference between a good design and a great one is measured in how fast you can turn GPUs into results.

Thanks,

Phil

Leave a comment

Blog at WordPress.com.

Up ↑