School of Computer Science and Engineering, Southeast University, Nanjing211189, China.
SING Group, Hong Kong University of Science and Technology, Hong Kong999077, China.
Show Author Information
Hide Author Information
Abstract
Cloud data centers, such as Amazon EC2, host myriad big data applications using Virtual Machines (VMs). As these applications are communication-intensive, optimizing network transfer between VMs is critical to the performance of these applications and network utilization of data centers. Previous studies have addressed this issue by scheduling network flows with coflow semantics or optimizing VM placement with traffic considerations. However, coflow scheduling and VM placement have been conducted orthogonally. In fact, these two mechanisms are mutually dependent, and optimizing these two complementary degrees of freedom independently turns out to be suboptimal. In this paper, we present VirtCO, a practical framework that jointly schedules coflows and places VMs ahead of VM launch to optimize the overall performance of data center applications. We model the joint coflow scheduling and VM placement optimization problem, and propose effective heuristics for solving it. We further implement VirtCO with OpenStack and deploy it in a testbed environment. Extensive evaluation of real-world traces shows that compared with state-of-the-art solutions, VirtCO greatly reduces the average coflow completion time by up to 36.5%. This new framework is also compatible with and readily deployable within existing data center architectures.
J. C.Mogul and L.Popa, What we talk about when we talk about cloud network performance, in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’12), Helsinki, Finland, 2012, pp. 44-48.
D.Xie, N.Ding, Y. C.Hu, and R.Kompella, The only constant is change: Incorporating time-varying network reservations in data centers, in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’12), Helsinki, Finland, 2012, pp. 199-210.
M.Chowdhury, M.Zaharia, J.Ma, M. I.Jordan, and I.Stoica, Managing data transfers in computer clusters with orchestra, in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’12), Toronto, Canada, 2011, pp. 98-109.
J.Jiang, S.Ma, B.Li, and B.Li, Symbiosis: Network-aware task scheduling in data-parallel frameworks, in Proceedings of IEEE Conference on Computer Communications (INFOCOM’16), San Francisco, CA, USA, 2016, pp. 1-9.
M.Chowdhury, Y.Zhong, and I.Stoica, Efficient coflow scheduling with varys, in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’14), Chicago, IL, USA, 2014, pp. 443-454.
M.Zaharia, M.Chowdhury, T.Das, A.Dave, J.Ma, M.McCauley, MJ.Franklin, S.Shenker, and I.Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’12), San Jose, CA, USA, 2012, pp. 2-2.
G.Malewicz, M. H.Austern, A. J. C.Bik, J. C.Dehnert, I.Horn, N.Leiser, and G.Czajkowski, Pregel: A system for large-scale graph processing, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10), Indianapolis, IN, USA, 2010, pp. 135-146.
M.Chowdhury and I.Stoica, Efficient coflow scheduling without prior knowledge, in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’15), London, UK, 2015, pp. 393-406.
Z.Qiu, C.Stein, and Y.Zhong, Minimizing the total weighted completion time of coflows in datacenter networks, in Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’15), Portland, OR, USA, 2015, pp. 294-303.
J.Lee, Y.Turner, M.Lee, L.Popa, S.Banerjee, J.Kang, and P.Sharma, Application-driven bandwidth guarantees in datacenters, in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’14), Chicago, IL, USA, 2014, pp. 467-478.
X.Meng, V.Pappas, and L.Zhang, Improving the scalability of data center networks with traffic-aware virtual machine placement, in Proceedings of IEEE Conference on Computer Communications (INFOCOM’10), San Diego, CA, USA, 2010, pp. 1-9.
X.Li, J.Wu, S.Tang, and S.Lu, Let’s stay together: Towards traffic aware virtual machine placement in data centers, in Proceedings of IEEE Conference on Computer Communications (INFOCOM’14), Toronto, Canada, 2014, pp. 1842-1850.
Y.Zhao, K.Chen, W.Bai, M.Yu, C.Tian, Y.Geng, Y.Yang, D.Li, and S.Wang, Rapier: Integrating routing and scheduling for coflow-aware data center networks, in Proceedings of IEEE Conference on Computer Communications (INFOCOM’15), Hong Kong, China, 2015, pp. 424-432.
V.Jalaparti, P.Bodik, I.Menache, S.Rao, K.Makarychev, and M.Caesar, Network-aware scheduling for data-parallel jobs: Plan when you can, in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’15), London, UK, 2015, pp. 407-420.
H.Zhang, L.Chen, B.Yi, K.Chen, MChowdhury, and Y.Geng, Coda: Toward automatically identifying and scheduling coflows in the dark, in Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’16), Florianopolis, Brazil, 2016, pp. 160-173.
K.LaCurts, J. C.Mogul, H.Balakrishnan, and Y.Turner, Cicada: Introducing predictive guarantees for cloud networks, in Proceedings of USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’14), Philadelphia, PA, USA, 2014, pp. 14-19.
[19]
J.Perry, H.Balakrishnan, and D.Shah, Flowtune: Flowlet control for datacenter networks. in Proceedings of USENIX Conference on Networked Systems Design and Implementation (NSDI’17), Boston, MA, USA, 2017, pp. 421-435.
D.Shen, J.Luo, F.Dong, and J.Zhang, Appbag: Application-aware bandwidth allocation for virtual machines in cloud environment, in 45th International Conference on Parallel Processing (ICPP), Philadelphia, PA, USA, 2016, pp. 21-30.
L.Chen, W.Cui, B.Li, and B.Li, Optimizing coflow completion times with utility max-min fairness, in Proceedings of IEEE Conference on Computer Communications (INFOCOM’16), San Francisco, CA, USA, 2016, pp. 1755-1763.
Y.Lu, Sed: An SDN-based explicit-deadline-aware TCP for cloud data center networks, Tsinghua Science and Technology, vol. 21, no. 5, pp. 491-499, 2016.
F.Ahmad, S. T.Chakradhar, A.Raghunathan, and T. N.Vijaykumar, Shufflewatcher: Shuffle-aware scheduling in multi-tenant MapReduce clusters, in Proceedings of USENIX Annual Technical Conference (ATC’14), Philadelphia, PA, USA, 2014, pp. 1-12.
[25]
A.Munir, T.He, R.Raghavendra, F.Li, and A. X.Liu, Network scheduling aware task placement in datacenters, in Proceedings of the International Conference on Emerging Networking Experiments and Technologies (CoNEXT’16), Irvine, CA, USA, 2016, pp. 221-235.
Y.Zhao, Y.Huang, K.Chen, M.Yu, S.Wang, and D. S.Li, Joint VM placement and topology optimization for traffic scalability in dynamic datacenter networks, Computer Networks, vol. 80, pp. 109-123, 2015.
H.Wang, Y.Li, Y.Zhang, and D.Jin, Virtual machine migration planning in software-defined networks, in Proceedings of IEEE Conference on Computer Communications (INFOCOM’15), Hong Kong, China, 2015, pp. 487-495.
J.Li, D.Li, Y.Ye, and X.Lu, Efficient multi-tenant virtual machine allocation in cloud data centers, Tsinghua Science and Technology, vol. 20, no. 1, pp. 81-89, 2015.
K.Ousterhout, R.Rasti, S.Ratnasamy, S.Shenker, and B. G.Chun, Making sense of performance in data analytics frameworks, in Proceedings of USENIX Conference on Networked Systems Design and Implementation (NSDI’15), Oakland, CA, USA, 2015, pp. 293-307.
[30]
A.Trivedi, P.Stuedi, J.Pfefferle, R.Stoica, B.Metzler, I.Koltsidas, and N.Ioannou, On the [ir] relevance of network performance for data processing, in Proceedings of USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’16), Denver, CO, USA, 2016, pp. 126-131.
[31]
J.Zhang, J.Chen, J.Luo, and A.Song, Efficient location-aware data placement for data-intensive applications in geo-distributed scientific data centers, Tsinghua Science and Technology, vol. 21, no. 5, pp. 471-481, 2016.
Shen D, Luo J, Dong F, et al. VirtCO: Joint Coflow Scheduling and Virtual Machine Placement in Cloud Data Centers. Tsinghua Science and Technology, 2019, 24(5): 630-644. https://doi.org/10.26599/TST.2018.9010098
10.26599/TST.2018.9010098.F1
A motivating example. Two color-coded coflows ( and ) reside in 3 VMs. The size of each VM is 1 slot, and each PM has 2 slots. has 3 flows: , , and . has 2 flows: and . (b)-(d) show the VM placement and scheduling on egress ports. (b) indicates the baseline in which no VMs are co-resident, and the coflow with the largest bottleneck () is scheduled first. (c) shows the result when VMs with the largest mutual communication () are placed together. The coflow scheduling ( first) is conducted independently. (d) demonstrates the optimal case.
10.26599/TST.2018.9010098.F2
Architecture and implementation of VirtCO.
4.1 Global coflow scheduling and VM placement
We begin by modeling the joint coflow scheduling and VM placement problem to optimize the average CCT for all coflows. For the virtualization environment, we define and as the set of physical hosts and the -th physical host within the data center, respectively. Applications request resources in the form of VMs. For each application, and denote the set of VMs and the -th VM for the application, respectively. The VM placement is represented by an matrix denoted by . denotes whether is hosted by . Specifically, if is hosted by ; otherwise, .
The coflow information associated with an application is characterized by traffic matrix , where denotes the size of flow to be transferred from to . In the example depicted in
Fig. 1
, coflow has the traffic matrix , , and coflow has the traffic matrix , . All incomplete coflows are included in the collection . To illustrate the network resource constraints, we define as the residual bandwidth on PMs. consists of the egress and ingress bandwidth, denoted by and , respectively. denotes the rate allocated to , which is composed of the egress and ingress rates allocated to on , denoted as and . VM size, denoted as , is represented by several slots, and the number of available slots in is . Under any placement, denotes the CCT of .
To optimize the average CCT, VirtCO leverages the minimum remaining time-first heuristic to schedule the coflow in having minimum remaining CCT with the highest priority. We describe the main framework of VirtCO in Algorithm 1. The algorithm is invoked whenever a new application enters the data center. Specifically, when a new application associated with a coflow arrives, the algorithm is triggered to determine its VM placement, scheduling priority, and bandwidth allocation (allowing preemption). Line 4 of Algorithm 1 invokes another algorithm to compute the minimum CCT for the arriving coflow. In Section 4.2, we illustrate the details of minimizing the single CCT through VM placement and network scheduling. Among all incomplete coflows, the algorithm sets the priorities according to their remaining completion time in ascending order (lines 7-10). To avoid starvation, the algorithm checks the waiting time of coflows in and sets a threshold to ensure that no coflow starves for an arbitrary period (lines 11-14). is usually in minutes and determined empirically by cloud providers.
4.2 Minimizing the completion time of a single coflow
The problem of minimizing the CCT for a single coflow in the cloud environment is formally stated as follows: given the estimated traffic matrix of a coflow , we try to find a feasible placement of all its VMs onto PMs, allocating feasible bandwidth and for on each , to minimize its completion time . Although single-flow optimization heuristics would allocate the entire bandwidth of the link to the scheduled flow, a desirable property of coflow is that completing any flow faster than the bottleneck in a coflow does not affect the CCT. Therefore, the minimum completion time of a coflow can be attained as long as all flows finish at the same time with the bottleneck flow. That is, all flows in should finish at the CCT .
Note that we enforce scheduling in the hypervisor layer with coflow granularity. Under placement , the amount of data coflow that sends from is , and the coflow received from is . and are computed by
Corollary 1 Having allocated the bandwidth and to on one PM , we can determine the minimum CCT through weighted fair sharing among individual flows of on , with the weight
Proof For notational simplicity, we define as the sender host and as the receiver host. As all flows in a coflow are completed at the same time, we assume that . Then, the egress and ingress bandwidth allocated to each PM are equal to and , respectively. As the data sent must be equal to the data received, . Let denote the aggregate flow sent from to . Thus, we have . Let be the bandwidth shared for flow , and then reduce from both sides. We then have . Thus, with as the bandwidth shared for all flows sent from , the data transfer finishes exactly at . Therefore, to achieve the optimal time , each shares the bandwidth with . Furthermore, as is the aggregate flow of all , each flow shares the bandwidth with the proportion .
Corollary 1 indicates the theoretical feasibility of enforcing scheduling at the granularity of coflows and hypervisors. This control granularity is able to attain the same performance as the finer granularity of flows and VMs. This design allows fewer rate limiters than other implementations and reduces the computation intensiveness of core algorithms.
Given the above definitions, we formulate the problem of minimizing as follows:
subject to the following:
Constraint Eq. (
3a
) enforces that the completion time of all flows is equal to the CCT. With the minimum bandwidth allocated, the residual bandwidth is released to admit other coflows. Equations (
3C
) and (
3d
) represent the placement constraints. However, the original problem (3) is difficult to solve directly because it is a nonlinear programming problem with binary integer variables. To facilitate the solution, we define , then, Eq. (
3
) can be modified to
The boundary conditions of Eq. (
4
) are given by
Then, we introduce two variables: and . Let and .
Substituting and into problem (4), and we can transform this problem into the following:
The boundary conditions of Eq. (
5
) are given by
Equation (
5
) represents an LP problem with the limited scale of variables and constraints that can be solved in a timely manner. We analyze the computation complexity of the algorithm from two perspectives: bandwidth allocation and VM placement. For the aspect of bandwidth allocation, as we only conduct rate limiting on each hypervisor, the variable scale is limited to the number of physical hosts . For VM placement, each iteration of computation generates the VM placement for an application, and the variables’ scale is bounded by . The computation overhead to run the LP is bounded by and . In practice, the available for hosting an application is usually limited to several adjacent racks, which are approximately dozens of hosts. The requested by an application is also extremely limited. Therefore, the small scale optimization can be computed efficiently, and with a small computation overhead.
After and are derived, and can be computed by
In the previous computation, as we relax the binary integral variables to fractional ones, the solution may contain some that are decimal fractions. A fractional means that we need to divide the VM to place it separately on different PMs, which is not practicable. Thus, we resort to rounding by placing on PM where and set . In summary, the rationale for our algorithm is to integrate coflow scheduling and VM placement to achieve optimal scheduling, and then let VM placement cooperate to approximate this goal. The heuristic for pursuing a minimal CCT is summarized in Algorithm 2.
Finally, we analyze the performance of Algorithm 2 using the standard notion of approximation ratio. The approximation ratio of Algorithm 2 is defined as the supremum of , where is the minimum CCT of and is CCT obtained by our algorithm.
Theorem 1 Algorithm 2 has the approximation ratio of 2, that is,
Proof To prove this theorem, the equivalent proposition is , where and are the inverse of and , respectively. We assume that the objective of problem (5) is . Problem (5) is the binary relaxation of problem (4) and the feasible solution of the linear formulation sets an upper bound .
Multiplier is mathematically equal to a graph cut where and are vertices and edge cuts vertex from . The physical meaning of this cut is that we place VM in host with the proportion . In our heuristic, we only pick the with the largest value and discard the rest. Let be the weight of cut , and let the value of be the value of . Accordingly, let be the value of cut in the optimal solution. Finally, we have a set of cuts that separates every from , which is the final placement.
Note that each cut is an incident at two of these components. Each edge will be in two of the cuts, hence, . Therefore, we have
The analytical result provides the worst-case guarantee of this algorithm. Theoretically, a loose bound is derived, but in practice this algorithm works effetively in improving the CCT for real applications in the cloud environment.
10.26599/TST.2018.9010098.F3
Effectiveness of VirtCO in improving single CCT.
10.26599/TST.2018.9010098.F4
Effectiveness of VirtCO in improving average CCT.
10.26599/TST.2018.9010098.F5
Overhead introduced by VirtCO.
10.26599/TST.2018.9010098.F6
Impact of coflow width.
10.26599/TST.2018.9010098.F7
Impact of coflow length.
10.26599/TST.2018.9010098.F8
Impact of coflow skew.