ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

Published:

Recommended citation: Tenghui Ma, Jihu Guo, Wei Gao* (Corresponding Author), Sitian Lu, Zhisheng Ye, Hanjing Wang, Dahua Lin; ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), July 2026. [CCF-A]