Page Not Found
Page not found. Your pixels are in another canvas.
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Page not found. Your pixels are in another canvas.
About me
This is a page not in th emain menu
Short description of portfolio item number 1
Short description of portfolio item number 2 
Published in , 2020
We achieve as many as 120 stylization effects in a single model and show results on long-term videos that consist of thousands of frames.
Recommended citation: Wei Gao, Yijun Li, Yihang Yin, Ming-Hsuan Yang; WACV 2020.
Published in , 2021
We propose an automatic transformation search framework for privacy-preserving collaborative learning, which defends against gradient leakage attacks by automatically finding input transformations that protect private data while preserving model utility.
Recommended citation: Wei Gao, Shangwei Guo, Tianwei Zhang, Han Qiu, Yonggang Wen, Yang Liu; IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Oral, June 2021. [CCF-A]
Published in , 2021
We design a novel search method to automatically discover qualified policies, which can significantly protect collaborative learning.
Recommended citation: Wei Gao, Shangwei Guo, Tianwei Zhang, Han Qiu, Yonggang Wen, Yang Liu; CVPR 2021 (oral).
Published in , 2021
We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs.
Recommended citation: Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, Tianwei Zhang; ACM Symposium on Cloud Computing (SoCC), November 2021. [CCF-B]
Published in , 2022
We present Astraea, a fair deep learning scheduler for multi-tenant GPU clusters that maintains long-term fairness across training jobs with heterogeneous resource demands and durations, while maximizing overall cluster efficiency.
Recommended citation: Zhisheng Ye, Peng Sun, Wei Gao, Tianwei Zhang, Xiaolin Wang, Shengen Yan, Yingwei Luo; IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 33, No. 11, pp. 2781-2793, November 2022. [CCF-A]
Published in , 2022
We present Titan, an elastic end-to-end scheduling system for foundation model fine-tuning workloads in GPU datacenters.
Recommended citation: Wei Gao, Peng Sun, Yonggang Wen, Tianwei Zhang; ACM Symposium on Cloud Computing (SoCC), November 2022. [CCF-B]
Published in , 2023
We first design two new metrics to quantify the impacts of transformations on data privacy and model usability. With the two metrics, we design a novel search method to automatically discover qualified policies from a given data augmentation library.
Recommended citation: Wei Gao, Xu Zhang, Shangwei Guo, Tianwei Zhang, Tao Xiang, Han Qiu, Yonggang Wen, Yang Liu; TPAMI 2023.
Published in , 2023
We present an automatic transformation search method to defend against deep leakage from gradients (DLG) attacks in federated learning, systematically discovering input transformations that prevent gradient-based reconstruction of private training data.
Recommended citation: Wei Gao, Xu Zhang, Shangwei Guo, Tianwei Zhang, Tao Xiang, Han Qiu, Yonggang Wen, Yang Liu; IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 45, No. 9, September 2023. [CCF-A]
Published in , 2024
This article surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource utilization manner. Finally, we discuss several promising future research directions including emerging DL workloads, advanced scheduling decision making, and underlying hardware resources.
Recommended citation: Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen; ACM Computing Surveys (CSUR), 2024. [Non-CCF Top Survey]
Published in , 2024
In this work, we present UniSched, a unified scheduler to optimize different types of scheduling objectives (e.g., guaranteeing the deadlines of SLO jobs, minimizing the latency of best-effort jobs). Meanwhile, UniSched supports different job stopping criteria (e.g., iteration-based, performance-based).
Recommended citation: Wei Gao, Zhisheng Ye, Peng Sun, Tianwei Zhang, Yonggang Wen; IEEE Transactions on Computers (TC), Vol. 73, No. 6, pp. 1500-1515, June 2024. [CCF-A]
Published in , 2024
We design AutoSched, a framework that can automatically, efficiently, and dynamically adjust the configuration parameters of DLT schedulers.
Recommended citation: Wei Gao, Xu Zhang, Shan Huang, Shangwei Guo, Peng Sun, Yonggang Wen, Tianwei Zhang; ACM International Conference on Supercomputing (ICS), June 2024. [CCF-B]
Published in , 2024
We propose Ymir, a scheduler to leverage the shared FM backbone architecture to expedite FMF workloads in GPU datacenters.
Recommended citation: Wei Gao, Weiming Zhuang, Minghao Li, Peng Sun, Yonggang Wen, Tianwei Zhang; ACM International Conference on Supercomputing (ICS), June 2024. [CCF-B]
Published in , 2025
We provide a comprehensive end-to-end evaluation of KV cache compression techniques for LLM serving, redefining the benefit metric from offline compression ratio to real serving gains including throughput, latency distribution, and per-token cost. We offer actionable guidelines for deploying compression-aware serving in production environments.
Recommended citation: Wei Gao*, Xinyu Zhou* (Co-first Authors), Peng Sun, Yonggang Wen, Tianwei Zhang; Annual Conference on Machine Learning and Systems (MLSys), May 2025. [Top Conference in MLSys Area]
Published in , 2025
We propose IceFrog, a layer-elastic scheduling system that reduces GPU usage by selectively freezing specific layers of deep learning models during training, enabling overlapping of computation and communication phases to enhance cluster throughput and GPU resource efficiency.
Recommended citation: Wei Gao, Ouyang Zhuoyuan, Peng Sun, Yonggang Wen, Tianwei Zhang; IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 36, No. 6, pp. 1071-1086, 2025. [CCF-A]
Published in , 2026
We propose RollPacker, a system that unifies orchestration of RL post-training components (Actor, Reward, Environment) with fine-grained sample lifecycle control. By batching long-tail samples across training rounds, RollPacker eliminates generation stalls and GPU bubbles, achieving stable end-to-end acceleration and reduced per-step training cost.
Recommended citation: Wei Gao*, Yuheng Zhao* (Co-first Authors), Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang; USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2026. [CCF-A]
Published in , 2026
We present ResiHP, a fault-tolerant LLM training system that dynamically reconfigures hybrid parallelism strategies in response to hardware failures and resource fluctuations, using fine-grained anomaly detection, intelligent checkpoint recovery, and incremental rollback to guarantee training continuity and predictable job delivery.
Recommended citation: Tenghui Ma, Jihu Guo, Wei Gao* (Corresponding Author), Sitian Lu, Zhisheng Ye, Hanjing Wang, Dahua Lin; ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), July 2026. [CCF-A]
Published in , 2026
We propose Weave, a co-scheduling system for disaggregated RL post-training that separates generation and training onto heterogeneous resource pools. Through multi-tenant cross-cluster co-scheduling with execution groups and native async pipeline orchestration at sample granularity, Weave reduces structural wait bubbles and resource mismatch, improving end-to-end throughput and cluster-level cost efficiency.
Recommended citation: Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, Wei Wang; USENIX Symposium on Operating Systems Design and Implementation (OSDI), July 2026. [CCF-A]
Published in , 2026
We present RollArt, a disaggregated infrastructure for scaling agentic RL training. RollArt decouples generation, environment interaction, and policy update across heterogeneous resource pools, enabling fine-grained async orchestration and resource-affinity mapping that improve throughput and cost efficiency at 3000+ GPU scale.
Recommended citation: Wei Gao*, Yuheng Zhao*, Tianyuan Wu*, Shaopan Xiong*, Weixun Wang* (Co-first Authors), Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang; USENIX Symposium on Operating Systems Design and Implementation (OSDI), July 2026. [CCF-A]