System 4: Autotuning for Improving the Fault Tolerance of Large-scale Simulations


Hiroyuki Takizawa

13:40:00 - 14:05:00

101 , Mathematics Research Center Building (ori. New Math. Bldg.)

HPC systems are getting larger in scale, resulting in a higher probability of encountering faults during simulation. Thus, future-generation simulation needs to be more tolerant to faults. Since the maximum CPU time for one job is usually limited, a today’s simulation program is capable of checkpointing. The simulation can use a longer CPU time than the maximum one by restarting the program from the checkpoint file. However, to tolerate frequent faults, simulation programs need to be checkpointed more frequently. As the timing overhead of checkpointing is quite large, an excessively-high frequency of checkpointing causes severe performance degradation. Therefore, in this talk, we will discuss the autotuning of checkpoint interval and where to write checkpoint files. Our performance model indicates that we can significantly reduce the timing overhead if we can exploit the storage hierarchy with appropriate checkpoint intervals.