Publications

(* Equal contribution  /  † Corresponding author)


RL Post-Training & AI Infrastructure

OSDI 26   Wei Gao*, Yuheng Zhao*, Tianyuan Wu*, Shaopan Xiong*, Weixun Wang*, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang
RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure
USENIX Symposium on Operating Systems Design and Implementation (OSDI), Seattle, July 2026.

OSDI 26   Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, Wei Wang
Weave: Efficient Co-Scheduling for Disaggregated RL Post-Training
USENIX Symposium on Operating Systems Design and Implementation (OSDI), Seattle, July 2026.

NSDI 26   Wei Gao*, Yuheng Zhao*, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang
RollPacker: Taming Long-Tail Rollouts for RL Post-Training with Tail Batching
USENIX Symposium on Networked Systems Design and Implementation (NSDI), Renton, May 2026.


LLM Training: Fault Tolerance & Scheduling

HPDC 26   Tenghui Ma, Jihu Guo, Wei Gao†, Sitian Lu, Zhisheng Ye, Hanjing Wang, Dahua Lin
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), July 2026.

TC 24   Wei Gao, Zhisheng Ye, Peng Sun, Tianwei Zhang, Yonggang Wen
UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands
IEEE Transactions on Computers (TC), Vol. 73, No. 6, pp. 1500–1515, June 2024.

TPDS 22   Zhisheng Ye, Peng Sun, Wei Gao, Tianwei Zhang, Xiaolin Wang, Shengen Yan, Yingwei Luo
Astraea: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters
IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 33, No. 11, pp. 2781–2793, November 2022.

ICS 24   Wei Gao, Xu Zhang, Shan Huang, Shangwei Guo, Peng Sun, Yonggang Wen, Tianwei Zhang
AutoSched: An Adaptive Self-configured Framework for Scheduling Deep Learning Training Workloads
ACM International Conference on Supercomputing (ICS), June 2024.

SoCC 21   Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, Tianwei Zhang
Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
ACM Symposium on Cloud Computing (SoCC), November 2021.


LLM Serving & Fine-tuning

MLSys 25   Wei Gao*, Xinyu Zhou*, Peng Sun, Yonggang Wen, Tianwei Zhang
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
Annual Conference on Machine Learning and Systems (MLSys), May 2025.

TPDS 25   Wei Gao, Ouyang Zhuoyuan, Peng Sun, Yonggang Wen, Tianwei Zhang
IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters
IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 36, No. 6, pp. 1071–1086, 2025.

ICS 24   Wei Gao, Weiming Zhuang, Minghao Li, Peng Sun, Yonggang Wen, Tianwei Zhang
Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters
ACM International Conference on Supercomputing (ICS), June 2024.

SoCC 22   Wei Gao, Peng Sun, Yonggang Wen, Tianwei Zhang
Titan: A Scheduler for Foundation Model Fine-tuning Workloads
ACM Symposium on Cloud Computing (SoCC), November 2022.


Technical Reports

arXiv   Weixun Wang, Shaopan Xiong, Wei Gao, et al.
Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library
Technical Report, arXiv preprint, June 2025.

arXiv   Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, et al.
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony
Technical Report, arXiv preprint, October 2025.

arXiv   Weixun Wang, Wei Gao, et al.
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
Technical Report, arXiv preprint, December 2025.


AI Security

CVPR 21 Oral   Wei Gao, Shangwei Guo, Tianwei Zhang, Han Qiu, Yonggang Wen, Yang Liu
Privacy-preserving Collaborative Learning with Automatic Transformation Search
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Oral, June 2021.

TPAMI 23   Wei Gao, Xu Zhang, Shangwei Guo, Tianwei Zhang, Tao Xiang, Han Qiu, Yonggang Wen, Yang Liu
Automatic Transformation Search Against Deep Leakage from Gradients
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 45, No. 9, September 2023.


Survey

CSUR 24   Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen
Deep Learning Workload Scheduling in GPU Datacenters: A Survey
ACM Computing Surveys (CSUR), 2024.