AI Networking: Cornelis' CN500 Boosts Performance

In the good old days, networks were all about connecting a small number of local computers. But times have changed. In an AI-dominated world, the trick is coordinating the activity of tens of thousands of servers to train a large language model—without any delay in communication. Now there’s an architecture optimized to do just that. Cornelis Networks says its CN500 networking fabric maximizes AI performance, supporting deployments with up to 500,000 computers or processors—an order of magnitude higher than today—and no added latency.

The new technology brings a third major product to the networking world, along with Ethernet and InfiniBand. It’s designed to enable AI- and high-performance computers (HPC, or supercomputers) to achieve faster and more predictable completion times with greater efficiency. For HPC, Cornelis claims its technology outperforms InfiniBand NDR—the version introduced in 2022— passing twice as many messages per second and with 35 percent less latency. For AI applications, it delivers six-fold faster communication compared to Ethernet-based protocols.

Ethernet has long been synonymous with local area networking, or LAN. Software patches have allowed its communication protocols to weather the test of time. The invention of InfiniBand was an improvement, but it was still designed with the same goal: connecting a small number of local devices. “When these technologies were invented, they had nothing to do with parallel computing,” says Philip Murphy, co-founder, president, and chief operation officer at Pennsylvania-based Cornelis.

When data centers started to spring up, engineers needed a new networking solution. Because different systems used different software, they couldn’t share resources—so scaling the likes of Ethernet and InfiniBand to accommodate the busiest periods of operations proved challenging. “That sparked the whole cloud evolution,” says Murphy. Sharing a cloud-based CPU among different computers or even different organizations became the solution du jour.

But while data center pioneers tried to maximize the number of applications running on one server, Murphy and his colleagues saw value in an opposite approach: maximizing the number of processors working on one application. “That requires a totally different networking solution,” he says, which is what Cornelis now offers. The company’s Omni-Path architecture, developed by Intel for supercomputing applications like simulating climate models or molecular interactions for drug design, offers maximum throughput with zero data packet loss.

Congestion-free data highway

Coordinating processors to train AI models requires the exchange of many messages—data packets—at very high bandwidth. The message rate per millisecond matters, and so does the latency, meaning how long a recipient takes to respond.

One major challenge with sharing so many data packets throughout a network is traffic congestion. Murphy explains, you need a way to reliably route packets around congestion points without creating other problems. For example, if the packets take different routes to the same destination, they may arrive out of order.

Cornelis’s dynamic adaptive routing algorithm mitigates congestion by routing around short-lived congestion events, while its congestion-control architecture routes traffic around “popular” destinations. “If there’s an event at a stadium that we all want to go to, you don’t want the traffic that’s going past the stadium to get caught there too,” says Murphy. The central pacing technique enables this congestion-control architecture. Switches see where traffic is forming, then tell senders to slow down until congestion dissipates. “Think of mitigating traffic as it comes onto a highway on-ramp,” Murphy explains.

The other challenge is avoiding latency. In traditional Ethernet architectures, sending a packet requires sufficient memory at the end point. “If I send to you and you run out of memory, you have to come back and tell me that,” says Murphy. That’s a long loop that requires large buffers that are not scalable. Instead, Cornelis uses an algorithm called credit-based flow control that allocates memory in advance. “You don’t have to tell me anything, and I’ll know how much more I can send,” says Murphy.

Finally, the system avoids grinding to a halt if a GPU or link fails. In traditional architectures, if the server goes down, so does the application. Fixing it requires rebooting from the most recent checkpoint—which itself took extensive computing power to create. “Imagine if every time you hit ‘save’ on your document, you had to wait 20 minutes,” says Murphy. Instead, because it’s spread across multiple computers, Cornelis Networks keeps an application running, albeit at slightly lower bandwidth until the faulty link can be replaced—no checkpoints needed.

Efficient AI

Physically, the CN5000 product is a network card built around a custom chip. The network cards plug into every server, “like you plug an Ethernet card into your PC at home,” explains Murphy. A top-of-rack switch is cabled to each server and to other switches, and a director-class switch comes with 48 or 576 ports to link to the rack switches. “Each server has cards plugged in, so you can build multi-thousand endpoint clusters,” says Murphy.

The company’s main market is organizations that want to upgrade to a new cluster for AI or faster HPC simulations. That’s done through one of three original equipment manufacturers Cornelis is working with that make the servers and network switches. The OEM purchases physical cards from Cornelis and plugs them into servers before fulfilling the order.

Until recently, training a neural network model was a one-time deal. But now, training a multitrillion-parameter AI model means repeatedly fine-tuning or updating. Cornelis expects to take advantage of that. “If you don’t adopt AI, you’re going out of business. If you use AI inefficiently, you’ll still go out of business,” Murphy says. “Our customers want to adopt AI in the most efficient way possible.”

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Could a Data Center Rewiring Lead to 6x Faster AI?

Cornelis Networks’ congestion-free architecture takes on Ethernet and InfiniBand

Congestion-free data highway

Efficient AI

Engineering Student’s Bold Pitch to IEEE Spectrum

Poem: Fiber to the Home

Tech Founders Must Embrace Market Needs

Related Stories

Operators Skeptical of AI in Data Centers

How AI Barbie Could Change Playtime Forever

OpenAI’s Return to Open Weights Surprises Developers

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Could a Data Center Rewiring Lead to 6x Faster AI?

Cornelis Networks’ congestion-free architecture takes on Ethernet and InfiniBand

Congestion-free data highway

Efficient AI

Engineering Student’s Bold Pitch to IEEE Spectrum

Poem: Fiber to the Home

Tech Founders Must Embrace Market Needs

Related Stories

Operators Skeptical of AI in Data Centers

How AI Barbie Could Change Playtime Forever

OpenAI’s Return to Open Weights Surprises Developers