Aggressive CMOS scaling will make future chip multiprocessors (CMPs) increasingly susceptible to transient faults, hard errors, manufacturing defects, and process variations. Existing fault-tolerant CMP proposals that implement dual modular redundancy (DMR) do so by statically binding pairs of adjacent cores via dedicated communication channels and buffers. This can result in unnecessary power and performance losses in cases where one core is defective (in which case the entire DMR pair must be disabled), or when cores exhibit different frequency/leakage characteristics due to process variations (in which case the pair runs at the speed of the slowest core). Static DMR also hinders power density/ thermal management, as DMR pairs running code with similar power/thermal characteristics are necessarily placed next to each other on the die.
We present dynamic core coupling (DCC), an architectural technique that allows arbitrary CMP cores to verify each others' execution communication hardware. Our evaluation shows that the performance overhead of DCC over a CMP without fault tolerance is 3% on on SPEC2000 benchmarks, and is within 5% for a set of scalable parallel scientific and data mining applications with up to eight threads (16 processors). Our results also show that DCC has the potential to significantly outperform existing static DMR schemes.