The July issue of IEEE Spectrum is here!

Close bar

Could a Data Center Rewiring Lead to 6x Faster AI?

Cornelis Networks’ congestion-free architecture takes on Ethernet and InfiniBand

4 min read

Rachel Berkowitz is a freelance science writer and editor with a Ph.D. in geophysics from the University of Cambridge.

A circuit board with connectors at one edge and metal face at another.

Cornelis Networks’ card plugs into servers to expand networks.

Original imagery: Cornelis

In the good old days, networks were all about connecting a small number of local computers. But times have changed. In an AI-dominated world, the trick is coordinating the activity of tens of thousands of servers to train a large language model—without any delay in communication. Now there’s an architecture optimized to do just that. Cornelis Networks says its CN500 networking fabric maximizes AI performance, supporting deployments with up to 500,000 computers or processors—an order of magnitude higher than today—and no added latency.

The new technology brings a third major product to the networking world, along with Ethernet and InfiniBand. It’s designed to enable AI- and high-performance computers (HPC, or supercomputers) to achieve faster and more predictable completion times with greater efficiency. For HPC, Cornelis claims its technology outperforms InfiniBand NDR—the version introduced in 2022— passing twice as many messages per second and with 35 percent less latency. For AI applications, it delivers six-fold faster communication compared to Ethernet-based protocols.

Ethernet has long been synonymous with local area networking, or LAN. Software patches have allowed its communication protocols to weather the test of time. The invention of InfiniBand was an improvement, but it was still designed with the same goal: connecting a small number of local devices. “When these technologies were invented, they had nothing to do with parallel computing,” says Philip Murphy, co-founder, president, and chief operation officer at Pennsylvania-based Cornelis.

When data centers started to spring up, engineers needed a new networking solution. Because different systems used different software, they couldn’t share resources—so scaling the likes of Ethernet and InfiniBand to accommodate the busiest periods of operations proved challenging. “That sparked the whole cloud evolution,” says Murphy. Sharing a cloud-based CPU among different computers or even different organizations became the solution du jour.

But while data center pioneers tried to maximize the number of applications running on one server, Murphy and his colleagues saw value in an opposite approach: maximizing the number of processors working on one application. “That requires a totally different networking solution,” he says, which is what Cornelis now offers. The company’s Omni-Path architecture, developed by Intel for supercomputing applications like simulating climate models or molecular interactions for drug design, offers maximum throughput with zero data packet loss.

Congestion-free data highway

Coordinating processors to train AI models requires the exchange of many messages—data packets—at very high bandwidth. The message rate per millisecond matters, and so does the latency, meaning how long a recipient takes to respond.

One major challenge with sharing so many data packets throughout a network is traffic congestion. Murphy explains, you need a way to reliably route packets around congestion points without creating other problems. For example, if the packets take different routes to the same destination, they may arrive out of order.

Cornelis’s dynamic adaptive routing algorithm mitigates congestion by routing around short-lived congestion events, while its congestion-control architecture routes traffic around “popular” destinations. “If there’s an event at a stadium that we all want to go to, you don’t want the traffic that’s going past the stadium to get caught there too,” says Murphy. The central pacing technique enables this congestion-control architecture. Switches see where traffic is forming, then tell senders to slow down until congestion dissipates. “Think of mitigating traffic as it comes onto a highway on-ramp,” Murphy explains.

The other challenge is avoiding latency. In traditional Ethernet architectures, sending a packet requires sufficient memory at the end point. “If I send to you and you run out of memory, you have to come back and tell me that,” says Murphy. That’s a long loop that requires large buffers that are not scalable. Instead, Cornelis uses an algorithm called credit-based flow control that allocates memory in advance. “You don’t have to tell me anything, and I’ll know how much more I can send,” says Murphy.

Finally, the system avoids grinding to a halt if a GPU or link fails. In traditional architectures, if the server goes down, so does the application. Fixing it requires rebooting from the most recent checkpoint—which itself took extensive computing power to create. “Imagine if every time you hit ‘save’ on your document, you had to wait 20 minutes,” says Murphy. Instead, because it’s spread across multiple computers, Cornelis Networks keeps an application running, albeit at slightly lower bandwidth until the faulty link can be replaced—no checkpoints needed.

Efficient AI

Physically, the CN5000 product is a network card built around a custom chip. The network cards plug into every server, “like you plug an Ethernet card into your PC at home,” explains Murphy. A top-of-rack switch is cabled to each server and to other switches, and a director-class switch comes with 48 or 576 ports to link to the rack switches. “Each server has cards plugged in, so you can build multi-thousand endpoint clusters,” says Murphy.

The company’s main market is organizations that want to upgrade to a new cluster for AI or faster HPC simulations. That’s done through one of three original equipment manufacturers Cornelis is working with that make the servers and network switches. The OEM purchases physical cards from Cornelis and plugs them into servers before fulfilling the order.

Until recently, training a neural network model was a one-time deal. But now, training a multitrillion-parameter AI model means repeatedly fine-tuning or updating. Cornelis expects to take advantage of that. “If you don’t adopt AI, you’re going out of business. If you use AI inefficiently, you’ll still go out of business,” Murphy says. “Our customers want to adopt AI in the most efficient way possible.”

The Conversation (0)