Data center scheduling for management of large loads
Constance Crozier – Georgia Tech
AI training is energy intensive, due to its use of many parallel GPUs. These training jobs have inherent flexibility — they sit in a queue for hours or sometimes days at a time. Classically jobs are executed using first-in-first-out with backfilling, where smaller jobs are positioned in natural gaps which arise. Two jobs which use the same number of GPUs can have very different energy consumption, due to bottlenecks in memory or arithmetic calculations. This talk will discuss the extent to which jobs could be rescheduled to reduce the negative impact of the data center on the grid.
