AI Chips Depreciate Quickly Due To Heat Stress During Model Training Runs
The Michael Burry view is that companies are depreciating their GPUs over 5 year periods despite technological obsolescence in 18 months or less. This means negative profits and rates of return despite corporate claims to the contrary.
But heat stress also depreciates GPUs. AI Chips run marathon sessions during model estimation (training) often exceeding a month and GPUs often fail due to heat stress: "Imagine you had a 10,000 or even a 20,000 GPU data center. You should expect on the statistics a chip to fail about every 3 or 4 hours. So long before I get to the point where I’m rapidly turning these over because there’s a new generation of chips, I’m turning over a vast chunk of my chips just because t…
AI Chips Depreciate Quickly Due To Heat Stress During Model Training Runs
The Michael Burry view is that companies are depreciating their GPUs over 5 year periods despite technological obsolescence in 18 months or less. This means negative profits and rates of return despite corporate claims to the contrary.
But heat stress also depreciates GPUs. AI Chips run marathon sessions during model estimation (training) often exceeding a month and GPUs often fail due to heat stress: "Imagine you had a 10,000 or even a 20,000 GPU data center. You should expect on the statistics a chip to fail about every 3 or 4 hours. So long before I get to the point where I’m rapidly turning these over because there’s a new generation of chips, I’m turning over a vast chunk of my chips just because they’re failing under thermal stress." Source: https://paulkrugman.substack.com/p/talking-with-paul-kedrosky. GPUs used to estimate large language models are like ultra-marathon (50, 100, and 166 kilometers) runners on a 38 degree day.
It took Facebook 54 days to estimate the Llama 3 open source large language model using16,384 Nvidia H100 80GB GPUs. There was a hardware failure roughly every three hours with GPU chip heat deaths comprising the majority of them.
Full details here: https://www.datacenterdynamic.s.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/
Popular posts from this blog
Here We Go Again: Several Major Cables Down Off Yemen
Three industry insiders have confirmed the ‘epicenter’ of the outages is in Yemen coastal waters at a depth of only one 100 meters. This strongly suggests fishing or more likely an anchor is responsible. Multiple sources have told me that neither Egyptian or Saudi Internet services has been degraded, but the Persian Gulf has been hit hard as well as Pakistan. This is consistent with the epicenter being off Yemen. It is also consistent with the cables reported down below. Four Cables Definitely Down: 1. EIG. 2. SWM4. 3. IMEWE. 4. Falcon Lower left map is SMW4. Center is EIG. Far right is IMEWE. Total capacity of these four cables is approximately 44 Tbps. Pakistan is heavily dependent on SWM4, EIG, and IMEWE. Scattered reports initially suggest AAE1 may also be down. But it is not.
The Short Life Expectancy of Starlink’s LEO Satellites
According to FCC filings Starlink shut down almost 500 Starlink satellites during the first half of 2025. The company had them reenter the atmosphere where they burned up. What is striking is that these satellites were all less than 5 years old. The general consensus is that LEOs have a life expectancy ranging from 5 to 8 years. Shorter than expected life spans for the satellites will hit Starlink’s income statement hard by increasing network depreciation and replacement needs. However, Starlink has managed to lower its LEO’s manufacturing costs down to $500K versus initial figures around $1 million. So these production economies of scale might offset some of the higher than expected depreciation. However, there are also rocket launch costs as well. It costs Starlink about $3 million to put a satellite into orbit. The Falcon 9 costs $67 million per flight and delivers 23 LEOs into low Earth orbit. As a private company Starlink financials are a bit of mystery. The company press ...
Update on META’s Waterworth Project
META has started selecting landing sites. As the map below shows, the subsea cable’s routing is unique, a record breaking 50,000 kilometers in length, and it will form a ring around the world. A ring allows Facebook to reroute traffic in the opposite direction if there is a fault. Waterworth is a 24 fibre pair system. That puts at the top of the fiber pair count for spatial division multiplexing systems. So figure a design transmission rate of a half petabit per second. Although crazy news outlets have reported it will cost $10 billion, that figure is absurd. The project can easily be done for $2 billion or less even with extensive terrestrial trenching to create new and unique fibre routes. Will the project use new technologies like multicore fibre or multiband spectrum? Doubtful. The price tag dictates the META subsea cable design team will use standard proven technology. No one wants to tell their boss they just wasted $1.5 billion dollars. It is not good for your career. 😃 Wa...