Large-scale HPC systems are shared by many users. Beside system's efficiency, the goal of the scheduler is to serve users according to a scheduling policy. The fair-share algorithm strives to build schedules with each user achieving similar utilization rates. This method was fine when each user had just a few jobs. However, modern workloads are often composed of many jobs submitted at the same time (e.g. bag-of-tasks or job arrays in SLURM). For such workloads, fair-share is not optimal because users frequently have similar utilization metrics and, in such situations, the schedule switches between users, executing just a few jobs of each one of them. However, it would be more efficient to assign the maximum number of resources to one user per time.
OStrich is optimized for campaigns of jobs. OStrich maintains a virtual schedule that partitions resources between users' workloads according to pre-defined shares. The virtual schedule drives the real schedule maintaining a trade-off between efficiency and fairness.
the SLURM implementation
A scheduling module for SLURM implementing OStrich.
slides from SLURM User Group Meeting 2014 at Lugano, Switzerland.
the algorithm
slides from Google Tech Talk given by Krzysztof Rzadca in June 2013 (Warsaw, Poland).
Ostrich: Fair scheduling for multiple submissions (Joseph Emeras, Vinicius Pinheiro, Krzysztof Rzadca, and Denis Trystram): OStrich theory (published at PPAM 2013)