To meet the ever-growing demand of problem-solving capability and generalizability via artificial intelligence, modern deep learning models are becoming larger and more sophisticated, while at the cost of huge amounts of computing resources (e.g., GPU) and prolonged training time. it has become a common practice to leverage large-scale GPU data centers (i.e., AI data centers) to optimize and accelerate model training and inference. However, the management and scheduling of these deep learning workloads in the GPU data centers present numerous challenges, due to their high computational requirements, distinct and diverse runtime characteristics, and heterogeneous nature of the underlying hardware.
In this talk, we will investigate deep learning workload scheduling accelerating, training execution over GPU datacenters, with a multifold objective of improving resource utilization, enhancing users’ experience, and easing operators’ management. Specifically, we will introduce novel and practical methodologies and system designs to achieve those goals. These solutions are highly integrated to tackle different challenges, paving the way for optimal utilization of GPU resources and accelerated progress in deep learning applications.