Grant publications

Fanny Pascual and Krzysztof Rzadca. Optimizing egalitarian performance when colocating tasks with types for cloud data center resource management. IEEE Transactions on Parallel and Distributed Systems, 30(11):2523-2535, November 2019. [ bib | DOI | http ]

In data centers, up to dozens of tasks are colocated on a single physical machine. Machines are used more efficiently, but the performance of the tasks deteriorates, as the colocated tasks compete for shared resources. Since the tasks are heterogeneous, the resulting performance dependencies are complex. In our previous work [1], [2] we proposed a new combinatorial optimization model that uses two parameters of a task - its size and its type - to characterize how a task influences the performance of other tasks allocated to the same machine. In this paper, we study the egalitarian optimization goal: the aim is to optimize the performance of the worst-off task. This problem generalizes the classic makespan minimization on multiple processors (P||Cmax ). We prove that polynomially-solvable variants of P||Cmax are NP-hard for this generalization, and that the problem is hard to approximate when the number of types is not constant. For a constant number of types, we propose a PTAS, a fast approximation algorithm, and a series of heuristics. We simulate the algorithms on instances derived from a trace of one of Google clusters. Compared with baseline algorithms solving P||Cmax , our proposed algorithms aware of the types of the jobs lead to significantly better tasks' performance. The notion of type enables us to extend standard combinatorial optimization methods to handle degradation of performance caused by colocation. Types add a layer of additional complexity. However, our results - approximation algorithms and good average-case performance - show that types can be handled efficiently.

Earlier publications

Milosz Pacholczyk and Krzysztof Rzadca. Fair non-monetary scheduling in federated clouds. In CrossCloud'18: 5th Workshop on CrossCloud Infrastructures & Platforms, EuroSys'18 Workshops, pages 3:1-3:6. ACM, 2018. [ bib | DOI | http ]

In a hybrid cloud, individual cloud service providers (CSPs) often have incentive to use each other's resources to off-load peak loads or place load closer to the end user. However, CSPs have to keep track of contributions and gains in order to disincentivize long-term free-riding. We show CloudShare, a distributed version of a load balancing algorithm DirectCloud based on the Shapley value-a powerful fairness concept from game theory. CloudShare coordinates CSPs by a ZooKeeper-based coordination layer; each CSP runs a broker that interacts with local resources (such as Kubernetes-managed clusters). We quantitatively evaluate our implementation by simulation. The results confirm that CloudShare generates on the average more fair schedules than the popular FairShare algorithm. We believe our results show an viable alternative to monetary methods based on, e.g., spot markets.
Fanny Pascual, Krzysztof Rzadca, and Piotr Skowron. Collective schedules: Scheduling meets computational social choice. In AAMAS'18, pages 667-675. IFAAMAS, 2018. [ bib | http ]
When scheduling public works or events in a shared facility one needs to accommodate preferences of a population. We formalize this problem by introducing the notion of a collective schedule. We show how to extend fundamental tools from social choice theory-positional scoring rules, the Kemeny rule and the Condorcet principle-to collective scheduling. We study the computational complexity of finding collective schedules. We also experimentally demonstrate that optimal collective schedules can be found for instances with realistic sizes.
Fanny Pascual and Krzysztof Rzadca. Colocating tasks in data centers using a side-effects performance model. European Journal of Operational Research, 268(2):450-462, 2018. [ bib | DOI ]
In data centers, many tasks (services, virtual machines or computational jobs) share a single physical machine. We explore a new resource management model for such colocation. Our model uses two parameters of a task-its size and its type-to characterize how a task influences the performance of the other tasks allocated on the same machine. As typically a data center hosts many similar, recurring tasks (e.g. a webserver, a database, a CPU-intensive computation), the resource manager should be able to construct these types and their performance interactions. In particular, we minimize the total cost in a model in which each task's cost is a function of the total sizes of tasks allocated on the same machine (each type is counted separately). We show that for a linear cost function the problem is strongly NP-hard, but polynomially-solvable in some particular cases. We propose an algorithm polynomial in the number of tasks (but exponential in the number of types and machines) and another algorithm polynomial in the number of tasks and machines (but exponential in the number of types and admissible sizes of tasks). We also propose a polynomial time approximation algorithm, and, in the case of a single type, a polynomial time exact algorithm. For convex costs, we prove that, even for a single type, the problem becomes NP-hard, and we propose an approximation algorithm. We experimentally verify our algorithms on instances derived from a real-world data center trace. While the exact algorithms are infeasible for large instances, the approximations and heuristics deliver reasonable performance.
Pawel Janus and Krzysztof Rzadca. SLO-aware colocation of data center tasks based on instantaneous processor requirements. In ACM SoCC'17, ACM Symposium on Cloud Computing, pages 256-268. ACM, 2017. [ bib | DOI | http | .pdf ]
In a cloud data center, a single physical machine simultaneously executes dozens of highly heterogeneous tasks. Such colocation results in more efficient utilization of machines, but, when tasks' requirements exceed available resources, some of the tasks might be throttled down or preempted. We analyze version 2.1 of the Google cluster trace that shows short-term (1 second) task CPU usage. Contrary to the assumptions taken by many theoretical studies, we demonstrate that the empirical distributions do not follow any single distribution. However, high percentiles of the total processor usage (summed over at least 10 tasks) can be reasonably estimated by the Gaussian distribution. We use this result for a probabilistic fit test, called the Gaussian Percentile Approximation (GPA), for standard bin-packing algorithms. To check whether a new task will fit into a machine, GPA checks whether the resulting distribution's percentile corresponding to the requested service level objective, SLO is still below the machine's capacity. In our simulation experiments, GPA resulted in colocations exceeding the machines' capacity with a frequency similar to the requested SLO.
Fanny Pascual and Krzysztof Rzadca. Optimizing egalitarian performance in the side-effects model of colocation for data center resource management. In Euro-Par 2017 Proceedings, volume 10417 of LNCS, pages 206-219. Springer, 2017. [ bib | DOI | http | .pdf ]
In data centers, up to dozens of tasks are colocated on a single physical machine. Machines are used more efficiently, but tasks' performance deteriorates, as colocated tasks compete for shared resources. As tasks are heterogeneous, the resulting performance dependencies are complex. In our previous work [18] we proposed a new combinatorial optimization model that uses two parameters of a task - its size and its type - to characterize how a task influences the performance of other tasks allocated to the same machine.
In this paper, we study the egalitarian optimization goal: maximizing the worst-off performance. This problem generalizes the classic makespan minimization on multiple processors (P||Cmax). We prove that polynomially-solvable variants of multiprocessor scheduling are NP-hard and hard to approximate when the number of types is not constant. For a constant number of types, we propose a PTAS, a fast approximation algorithm, and a series of heuristics. We simulate the algorithms on instances derived from a trace of one of Google clusters. Algorithms aware of jobs' types lead to better performance compared with algorithms solving P||Cmax.
The notion of type enables us to model degeneration of performance caused by using standard combinatorial optimization methods. Types add a layer of additional complexity. However, our results - approximation algorithms and good average-case performance - show that types can be handled efficiently.
The notion of type enables us to model degeneration of performance caused by using standard combinatorial optimization methods. Types add a layer of additional complexity. However, our results - approximation algorithms and good average-case performance - show that types can be handled efficiently.

opal

optimization methods for
allocation of large-scale computational resources

Grant publications

Earlier publications

opal

optimization methods for allocation of large-scale computational resources

Grant publications

Earlier publications

optimization methods for
allocation of large-scale computational resources