Nvidia fixes Blackwell chip flaw with help from TSMC, mass production back on schedule

Nvidia fixes Blackwell chip flaw with help from TSMC, mass production back on schedule

Serving tech enthusiasts for over 25 years.

TechSpot means tech analysis and advice you can trust.

What just happened? Nvidia has successfully fixed a design flaw in its latest Blackwell AI chips, according to CEO Jensen Huang. The issue, which caused production delays, has been solved with the assistance of TSMC, Nvidia’s long-standing manufacturing partner. In fact, it was TSMC that originally spotted the problem.

Overcoming this issue was crucial for Nvidia, as it aims to maintain its dominant position in the AI chip market. As demand for high-performance AI computing solutions continues to surge, the successful launch of Blackwell will play a pivotal role in providing the necessary hardware.

Huang candidly admitted the company’s responsibility for the setback. “We had a design flaw in Blackwell,” he said. “It was functional, but the design flaw caused the yield to be low. It was 100 percent Nvidia’s fault.”

The Blackwell chips, unveiled in March, were originally slated for second-quarter shipping. However, the design flaw led to delays, potentially affecting major customers such as Meta, Google, and Microsoft.

The Blackwell project was unusually complex, Huang said, which may have been a factor in the flaw. “In order to make a Blackwell computer work, seven different types of chips were designed from scratch and had to be ramped into production at the same time.”

The technical issue stemmed from the intricate packaging technology used in the Blackwell B100 and B200 GPUs. These chips employ TSMC’s CoWoS-L packaging, which utilizes an RDL interposer with local silicon interconnect bridges to achieve data transfer rates of about 10 TB/s. The problem arose from a mismatch in thermal expansion properties between various components, causing system warping and failure.

To address this, Nvidia modified the top metal layers and bumps of the GPU silicon, enhancing production yields. While specific details of the fix remain undisclosed, the company confirmed that new masks were required.

The speed of the resolution is noteworthy. Typically, addressing such issues in the semiconductor industry involves modifying metal layers and creating new steppings, a process that can take around three months. “What TSMC did was to help us recover from that yield difficulty and resume the manufacturing of Blackwell at an incredible pace,” Huang said.

With the design flaw now resolved, mass production of the fixed Blackwell GPUs is set to begin in late October. Shipments are expected to start in early 2025, aligning with Nvidia’s fiscal year.

Despite the setback, demand for Blackwell chips remains high. Huang had previously described the demand as “insane,” with customers eager to be first in line for the new technology.

Google has ordered over 400,000 GB200 chips in a deal exceeding $10 billion. Similarly, Meta has placed a $10 billion order, while Microsoft is set to receive 55,000 to 65,000 GB200 GPUs ready for OpenAI by the first quarter of 2025.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *