We propose an automatic transformation search framework for privacy-preserving collaborative learning, which defends against gradient leakage attacks by automatically finding input transformations that protect private data while preserving model utility.

Recommended citation: Wei Gao, Shangwei Guo, Tianwei Zhang, Han Qiu, Yonggang Wen, Yang Liu; IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Oral, June 2021. [CCF-A]

Privacy-preserving collaborative learning with automatic transformation search

Published in , 2021

We design a novel search method to automatically discover qualified policies, which can significantly protect collaborative learning.

Recommended citation: Wei Gao, Shangwei Guo, Tianwei Zhang, Han Qiu, Yonggang Wen, Yang Liu; CVPR 2021 (oral).

CHRONUS: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs

Published in , 2021

We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs.

Recommended citation: Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, Tianwei Zhang; ACM Symposium on Cloud Computing (SoCC), November 2021. [CCF-B]

Astraea: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters

Published in , 2022

We present Astraea, a fair deep learning scheduler for multi-tenant GPU clusters that maintains long-term fairness across training jobs with heterogeneous resource demands and durations, while maximizing overall cluster efficiency.

Recommended citation: Zhisheng Ye, Peng Sun, Wei Gao, Tianwei Zhang, Xiaolin Wang, Shengen Yan, Yingwei Luo; IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 33, No. 11, pp. 2781-2793, November 2022. [CCF-A]

Titan: A Scheduler for Foundation Model Fine-tuning Workloads

Published in , 2022

We present Titan, an elastic end-to-end scheduling system for foundation model fine-tuning workloads in GPU datacenters.

Recommended citation: Wei Gao, Peng Sun, Yonggang Wen, Tianwei Zhang; ACM Symposium on Cloud Computing (SoCC), November 2022. [CCF-B]

Automatic Transformation Search Against Deep Leakage From Gradients

Published in , 2023

We first design two new metrics to quantify the impacts of transformations on data privacy and model usability. With the two metrics, we design a novel search method to automatically discover qualified policies from a given data augmentation library.

Recommended citation: Wei Gao, Xu Zhang, Shangwei Guo, Tianwei Zhang, Tao Xiang, Han Qiu, Yonggang Wen, Yang Liu; TPAMI 2023.

Automatic Transformation Search Against Deep Leakage from Gradients

Published in , 2023

We present an automatic transformation search method to defend against deep leakage from gradients (DLG) attacks in federated learning, systematically discovering input transformations that prevent gradient-based reconstruction of private training data.

Recommended citation: Wei Gao*, Xu Zhang* (Co-first Authors), Shangwei Guo, Tianwei Zhang, Tao Xiang, Han Qiu, Yonggang Wen, Yang Liu; IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 45, No. 9, September 2023. [CCF-A]

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Published in , 2024

This article surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource utilization manner. Finally, we discuss several promising future research directions including emerging DL workloads, advanced scheduling decision making, and underlying hardware resources.

Recommended citation: Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen; ACM Computing Surveys (CSUR), 2024. [Non-CCF Top Survey]

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

Published in , 2024

In this work, we present UniSched, a unified scheduler to optimize different types of scheduling objectives (e.g., guaranteeing the deadlines of SLO jobs, minimizing the latency of best-effort jobs). Meanwhile, UniSched supports different job stopping criteria (e.g., iteration-based, performance-based).

Recommended citation: Wei Gao, Zhisheng Ye, Peng Sun, Tianwei Zhang, Yonggang Wen; IEEE Transactions on Computers (TC), Vol. 73, No. 6, pp. 1500-1515, June 2024. [CCF-A]

AutoSched: An Adaptive Self-configured Framework for Scheduling Deep Learning Training Workloads

Published in , 2024

We design AutoSched, a framework that can automatically, efficiently, and dynamically adjust the configuration parameters of DLT schedulers.

Recommended citation: Wei Gao, Xu Zhang, Shan Huang, Shangwei Guo, Peng Sun, Yonggang Wen, Tianwei Zhang; ACM International Conference on Supercomputing (ICS), June 2024. [CCF-B]

Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters

Published in , 2024

We propose Ymir, a scheduler to leverage the shared FM backbone architecture to expedite FMF workloads in GPU datacenters.

Recommended citation: Wei Gao, Weiming Zhuang, Minghao Li, Peng Sun, Yonggang Wen, Tianwei Zhang; ACM International Conference on Supercomputing (ICS), June 2024. [CCF-B]

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Published in , 2025

We provide a comprehensive end-to-end evaluation of KV cache compression techniques for LLM serving, redefining the benefit metric from offline compression ratio to real serving gains including throughput, latency distribution, and per-token cost. We offer actionable guidelines for deploying compression-aware serving in production environments.

Recommended citation: Wei Gao*, Xinyu Zhou* (Co-first Authors), Peng Sun, Yonggang Wen, Tianwei Zhang; Annual Conference on Machine Learning and Systems (MLSys), May 2025. [Top Conference in MLSys Area]

IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters

Published in , 2025

We propose IceFrog, a layer-elastic scheduling system that reduces GPU usage by selectively freezing specific layers of deep learning models during training, enabling overlapping of computation and communication phases to enhance cluster throughput and GPU resource efficiency.

Recommended citation: Wei Gao, Ouyang Zhuoyuan, Peng Sun, Yonggang Wen, Tianwei Zhang; IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 36, No. 6, pp. 1071-1086, 2025. [CCF-A]

RollPacker: Taming Long-Tail Rollouts for RL Post-Training with Tail Batching

Published in , 2026

We propose RollPacker, a system that unifies orchestration of RL post-training components (Actor, Reward, Environment) with fine-grained sample lifecycle control. By batching long-tail samples across training rounds, RollPacker eliminates generation stalls and GPU bubbles, achieving stable end-to-end acceleration and reduced per-step training cost.

Recommended citation: Wei Gao*, Yuheng Zhao* (Co-first Authors), Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang; USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2026. [CCF-A]

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

Published in , 2026

We present ResiHP, a fault-tolerant LLM training system that dynamically reconfigures hybrid parallelism strategies in response to hardware failures and resource fluctuations, using fine-grained anomaly detection, intelligent checkpoint recovery, and incremental rollback to guarantee training continuity and predictable job delivery.

Recommended citation: Tenghui Ma, Jihu Guo, Wei Gao* (Corresponding Author), Sitian Lu, Zhisheng Ye, Hanjing Wang, Dahua Lin; ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), July 2026. [CCF-A]

Weave: Efficient Co-Scheduling for Disaggregated RL Post-Training

Published in , 2026

We propose Weave, a co-scheduling system for disaggregated RL post-training that separates generation and training onto heterogeneous resource pools. Through multi-tenant cross-cluster co-scheduling with execution groups and native async pipeline orchestration at sample granularity, Weave reduces structural wait bubbles and resource mismatch, improving end-to-end throughput and cluster-level cost efficiency.

Recommended citation: Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, Wei Wang; USENIX Symposium on Operating Systems Design and Implementation (OSDI), July 2026. [CCF-A]

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

Published in , 2026

We present RollArt, a disaggregated infrastructure for scaling agentic RL training. RollArt decouples generation, environment interaction, and policy update across heterogeneous resource pools, enabling fine-grained async orchestration and resource-affinity mapping that improve throughput and cost efficiency at 3000+ GPU scale.

Recommended citation: Wei Gao*, Yuheng Zhao*, Tianyuan Wu*, Shaopan Xiong*, Weixun Wang* (Co-first Authors), Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang; USENIX Symposium on Operating Systems Design and Implementation (OSDI), July 2026. [CCF-A]

Wei Gao

Sitemap

Pages

Posts

portfolio

projects

publications

talks