Meta’s LLama 3 Training Hampered by Faulty Nvidia H100 GPUs: A Glitch Every Three Hours

Over a 54-day period, the cluster encountered 419 unexpected component failures, averaging one failure every three hours.

Introduction

Meta recently released a detailed study on its Meta’s Llama 3 405B model training, which involved a cluster of 16,384 Nvidia H100 80GB GPUs. Over a 54-day period, the cluster encountered 419 unexpected component failures, averaging one failure every three hours. Faulty GPUs and their onboard HBM3 memory were responsible for half of these failures.

This article delves into the findings of Meta’s study and the implications for large-scale AI training clusters.

Follow us on Twitter here

The Scale and Complexity of Supercomputing

Supercomputers are incredibly complex systems, comprising tens of thousands of processors, hundreds of thousands of other chips, and extensive cabling.

The sheer scale and synchronous nature of these systems make them prone to failures. As the saying goes in supercomputing, “the only certainty with large-scale systems is failure.”

For developers, the key challenge is ensuring that the system remains operational despite frequent local breakdowns.

Meta’s Llama 3 Training Cluster

Meta’s Llama 3 training cluster is no exception to this rule. The 16,384 GPU setup faced 466 job interruptions during the 54-day pre-training snapshot, with 47 planned and 419 unexpected interruptions.

Planned interruptions were mainly due to automated maintenance, while unexpected ones stemmed mostly from hardware issues.

Japan’s PM Kishida Backs $32 Billion Rapidus Project with New Legislation for 2nm Chip Production – techovedas

Breakdown of Failures

GPU-related problems were the largest category of unexpected interruptions, accounting for 58.7% of the total.

Specifically, 148 failures (30.1%) were caused by various GPU issues, including NVLink failures, while 72 failures (17.2%) were due to HBM3 memory failures.

Given that Nvidia’s H100 GPUs consume around 700W and endure significant thermal stress, these findings are not entirely surprising. Interestingly, only two CPUs failed over the 54-day period.

Mitigating Failures

Meta’s Llama 3 team managed to maintain over 90% effective training time despite the frequent failures. They employed several strategies to mitigate the impact of hardware issues:

  1. Job Startup and Checkpointing Times: The team reduced job startup and checkpointing times to minimize downtime.
  2. Proprietary Diagnostic Tools: Meta developed proprietary diagnostic tools to quickly identify and resolve issues.
  3. PyTorch’s NCCL Flight Recorder: This tool was extensively used to diagnose and resolve hangs and performance issues, particularly related to NCCLX. It captures collective metadata and stack traces, aiding in swift problem resolution.
  4. Straggling GPUs: Specialized tools were used to identify and prioritize problematic communications, enabling effective detection and timely resolution of straggling GPUs.

Intel Introduces Free AI Playground App for Arc GPUs – techovedas

Environmental and Power Challenges

Environmental factors also played a role in training performance. Mid-day temperature fluctuations caused a 1-2% variation in throughput, affecting the dynamic voltage and frequency scaling of GPUs.

Additionally, the simultaneous power consumption changes of tens of thousands of GPUs stressed the data center’s power grid.

These fluctuations, sometimes in the tens of megawatts, stretched the grid’s limits, requiring Meta to ensure adequate power supply for their data centers.

Comparison with xAI’s Cluster

The study raises questions about the reliability of even larger AI training clusters. For instance, xAI’s cluster, which contains 100,000 H100 GPUs, is six times larger than Meta’s.

If it experiences the same failure rate, it could potentially face 2,560 failures over a similar period.

This emphasizes the importance of robust failure mitigation strategies in large-scale AI training.

Conclusion

Meta’s experience with its Llama 3 training cluster highlights the challenges of maintaining large-scale AI systems.

Despite encountering frequent failures, the Llama 3 team successfully maintained a high effective training time through proactive failure mitigation strategies.

As AI models and their training clusters continue to grow in size, these lessons will be crucial for the industry.

Kumar Priyadarshi
Kumar Priyadarshi

Kumar Joined IISER Pune after qualifying IIT-JEE in 2012. In his 5th year, he travelled to Singapore for his master’s thesis which yielded a Research Paper in ACS Nano. Kumar Joined Global Foundries as a process Engineer in Singapore working at 40 nm Process node. Working as a scientist at IIT Bombay as Senior Scientist, Kumar Led the team which built India’s 1st Memory Chip with Semiconductor Lab (SCL).

Articles: 2622