AI cloud startup TensorWave bets AMD can beat Nvidia • The Register

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Cloud operators specializing in running hot and power-hungry GPUs and other AI infrastructure are emerging, and while some of these players—like CoreWeave, Lambda, or Voltage Park—have built their clusters using tens of thousands of Nvidia GPUs , others are turning to it. AMD instead.

An example of the latter is BitBarn startup TensorWave, which earlier this month began racking up systems powered by AMD’s Instinct MI300X, at a fraction of the cost of accessing the Nvidia accelerator. Intends to lease the chips.

Jeff Tatarchuk, co-founder of TensorWave, believes that AMD’s latest accelerator has many advantages. For starters, you can actually buy them. TensorWave has allocated a large amount of parts.

By the end of 2024, TensorWave aims to deploy 20,000 MI300X accelerators across two facilities, and plans to bring an additional liquid-cooled system online next year.

AMD’s latest AI silicon is even faster than Nvidia’s highly coveted H100. “Just in raw specs, the MI300x dominates the H100,” said Tatarchuk.

Launched at AMD’s Advancing AI event in December, the MI300X chip design is the firm’s most advanced accelerator to date. The 750W chip uses a combination of advanced packaging to stitch together 12 chiplets – 20 if you count the HBM3 modules into a single GPU that’s claimed to be 32 percent faster than Nvidia’s H100. .

In addition to superior floating-point performance, the chip also features a larger 192GB HBM3 memory capable of delivering 5.3TB/s bandwidth compared to the H100’s claimed 80GB and 3.35TB/s.

As we’ve seen with Nvidia’s H200 – an enhanced version of the H100 with the inclusion of HBM3e – memory bandwidth is a key contributor to AI performance, particularly in inferring large language models.

Like Nvidia’s HGX and Intel’s OAM designs, the standard configuration of AMD’s latest GPU requires eight accelerators per node.

That’s the configuration the folks at TensorWave have been racking and stacking.

“We have hundreds going now and thousands going in the coming months,” Tatarchuk said.

Racking them up

In a picture Posted On social media, the TensorWave crew showed off three 8U Supermicro AS-8125GS-TNMR2 systems racked up. This led us to question whether the TensorWave racks were power or thermally limited, it is not unusual for these systems to draw over 10kW at full load.

It turns out that the folks at TensorWave hadn’t finished installing the machines and that the firm was targeting four nodes with a total capacity of about 40kW per rack. These systems will be cooled by Rear Door Heat Exchangers (RDHx). As we’ve discussed in the past, these are rack-shaped radiators with cooling water flowing through them. As hot air exits a conventional server, it passes through a radiator that cools it to an acceptable level.

This cooling tech has become a hot topic among data center operators looking to support GPU clusters and has led to some supply chain challenges, said TensorWave COO Piotr Tomasik.

“At the moment there are also a lot of capacity issues in the ancillary equipment around the data centers,” he said, specifically citing RDHx as a pain point. “We’ve been successful so far and we were very pleased with our ability to deploy them.”

Longer term, however, TensorView has its sights set on direct-to-chip cooling, which could be difficult to deploy in data centers that weren’t designed to house GPUs, Tomsek said. “We’re excited to deploy direct-to-chip cooling in the second half of the year. We think it will be much better and easier with density.”

Performance anxiety

Another challenge is confidence in AMD’s performance. According to Tatarchuk, while there’s a lot of excitement surrounding AMD’s Nvidia alternative offering, consumers aren’t sure they’ll enjoy the same performance. “A lot of ‘we’re not 100 percent sure if it’s going to be as good as what we’re currently using on Nvidia,'” he said.

In the interest of getting the system up and running as quickly as possible, TensorWave will launch its MI300X nodes using RDMA over Converged Ethernet (RoCE). These bare-metal systems will be available for fixed lease periods, apparently at a minimum of $1/hr/GPU.

Scale up

Over time, the organization aims to introduce a more cloud-like orchestration layer for resource provisioning. Also on the agenda is implementing GigaIO’s PCIe 5.0-based FabreX technology to stitch together up to 5,750 GPUs in a single domain with more than a petabyte of high-bandwidth memory.

These so-called TensorNODEs are based on GigaIO’s SuperNODE architecture that it showed off last year, which used a pair of PCIe switch devices to connect 32 AMD MI210 GPUs together. In theory, this allows a single CPU head node to address far more than the eight accelerators typically seen in GPU nodes today.

This approach differs from Nvidia’s preferred design, which uses NVLink to stitch multiple superchips together into one large GPU. While NVLink tops out quite quickly at 1.8TB/s bandwidth in its latest iteration compared to just 128GB/s on PCIe 5.0, it only supports configurations of up to 576 GPUs.

TensorWave will fund the construction of its BitBarn by using its GPUs as collateral for a large round of debt financing, a method used by other data center operators. Just last week, Lambda revealed that it had secured a $500 million loan to deploy “tens of thousands” of Nvidia’s fastest accelerators.

Meanwhile, CoreView, one of the largest providers of GPUs for rent, was able to secure a major $2.3 billion loan to expand its data center footprint.

“You, you should expect us to make a similar announcement later this year,” Tomasek said. ®

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment