3.00 Credits
Survey of current fault tolerance and reliability issues on high-performance computing (HPC) systems. Topics include taxonomy of failures and errors, checkpoint-restart, fault injection techniques, soft error detection schemes, and lossy compression. May also be offered as ECE 6740. Students are expected to have completed coursework comparable to ECE 3220 or ECE 3290 before enrolling in this course. It is recommended that students also have completed coursework comparable to ECE 4730/6730 before enrolling.