IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters
Published:
Recommended citation: Wei Gao, Ouyang Zhuoyuan, Peng Sun, Yonggang Wen, Tianwei Zhang; IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 36, No. 6, pp. 1071-1086, 2025. [CCF-A]
