Thermal Fail

Why cold is good for computing systems.

The discussion that follows assumes well designed systems, in particular proper choice of materials and mechanical design so that differences in the coefficient of thermal expansion for different materials do not cause large mechanical stresses. This is partly a cost of materials and system complexity problem, but careful design and cleverness can work around many of these issues; we have built systems that can tolerate wide thermal variations, mostly by decreasing stiffness where it is not needed.

Heat adds thermal noise to signals, and activation energy for atom displacement. Most digital systems are designed with large signal swings to overcome tolerances, low gain, and thermal noise. Large "noise" margin is costly in power and speed, but it avoids the need for careful analog-level design, and error detection and correction. As gates scale down and systems scale up, error detection becomes mandatory anyway. As design systems become more powerful, analog-level design is easier to automate.

Logical Failure

There are two minimum limits on logic:

Shannon bit energy is not practically achievable for complex logic systems, but has been approached for communication and memory systems. Neural operations use about 1e6 more than Shannon bit energy, and deep-sub-micron CMOS is dropping below that level. Note that brains rewire and reprogram themselves, while reprogramming the physical wiring of a chip requires manufacturing a new chip.

"Field programmable gate arrays" are flexible, but they do not reprogram physical wiring, they just select a few of the potentially available paths, requiring far more area, cost, and power per logic operation. Useful where logical versatility is more important than speed and efficiency.


Physical Failure

Real systems are limited by thermal noise, temperature- driven decay, and heat disposal. The biggest computers in 2016 are data centers; almost half the incoming power drives the cooling systems that dispose of all of the power.

Physical systems (as opposed to philosophical abstracts) are made of arranged atoms, which are kept in place by finite energy barriers. If an infrequent thermal excursion pushes an atom over an energy barrier, the system diffuses one step closer to randomness, and one step is all it takes to break an atomic scale system.

The size of the energy barriers in a typical solid object (like a transistor gate) vary, but for electronic systems we typically assume E = 1.0 electron volt.

The rate at which these rare but damaging excursions happen is estimated by the Arrhenius equation:

k is 1 electron volt per 11605 Kelvins. If E is 1 eV, and T is decreased from room temperature (20C or 293 Kelvin) to 277 Kelvin, the rate decreases by a factor of about 10. Increasing the temperature to 38C increases the rate by a factor of 10. Increasing the temperature to 82C increases decay rates by 1000x. Hot systems fail fast.

Going the other way, temperature reduction produces drastic decreases in damage rates. -24C reduces failures by 1000x, -101C by a factor of a trillion.































ThermalFail (last edited 2016-11-22 22:12:33 by KeithLofstrom)