College of Computer Science, Chongqing University, Chongqing 400044, China
National Experimental Teaching Demonstration Center, Chongqing University, Chongqing 400044, China
College of Big Data and Software Engineering, Chongqing University, Chongqing 400044, China
School of Civil Engineering, The University of Sydney, Sydney 2006, Australia
Show Author Information
Hide Author Information
Abstract
When deploying workflows in cloud environments, the use of Spot Instances (SIs) is intriguing as they are much cheaper than on-demand ones. However, SIs are volatile and may be revoked at any time, which results in a more challenging scheduling problem involving execution interruption and hence hinders the successful handling of conventional cloud workflow scheduling techniques. Although some scheduling methods for SIs have been proposed, most of them are no more applicable to the latest SIs, as they have evolved by eliminating bidding and simplifying the pricing model. This study focuses on how to minimize the execution cost with a deadline constraint when deploying a workflow on volatile SIs in cloud environments. Based on Monte Carlo simulation and list scheduling, a stochastic scheduling method called MCLS is devised to optimize a utility function introduced for this problem. With the Monte Carlo simulation framework, MCLS employs sampled task execution time to build solutions via deadline distribution and list scheduling, and then returns the most robust solution from all the candidates with a specific evaluation mechanism and selection criteria. Experimental results show that the performance of MCLS is more competitive compared with traditional algorithms.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
J.Sahni and D. P.Vidyarthi, A cost-effective deadline-constrained dynamic scheduling algorithm for scientific workflows in a cloud environment, IEEE Trans. Cloud Comput., vol. 6, no. 1, pp. 2–18, 2018.
B.Javadi, R. K.Thulasiram, and R.Buyya, Characterizing spot price dynamics in public cloud environments, Future Gener. Comput. Syst., vol. 29, no. 4, pp. 988–999, 2013.
M.Adhikari, T.Amgoth, and S. N.Srirama, A survey on scheduling strategies for workflows in cloud environment and emerging trends, ACM Comput. Surv., vol. 52, no. 4, p. 68, 2019.
D. W.Wei, H. S.Ning, F. F.Shi, Y. L.Wan, J. B.Xu, S. K.Yang, and L.Zhu, Dataflow management in the internet of things: Sensing, control, and security, Tsinghua Science and Technology, vol. 26, no. 6, pp. 918–930, 2021.
S.Abrishami, M.Naghibzadeh, and D. H. J.Epema, Deadline-constrained workflow scheduling algorithms for infrastructure as a service clouds, Future Gener. Comput. Syst., vol. 29, no. 1, pp. 158–169, 2013.
L. W.Yang, L. J.Ye, Y. Q.Xia, and Y. F.Zhan, Look-ahead workflow scheduling with width changing trend in clouds, Future Gener. Comput. Syst., vol. 139, pp. 139–150, 2023.
M. A.Rodriguez and R.Buyya, Deadline based resource provisioningand scheduling algorithm for scientific workflows on clouds, IEEE Trans. Cloud Comput., vol. 2, no. 2, pp. 222–235, 2014.
H. R.Faragardi, M. R. S.Sedghpour, S.Fazliahmadi, T.Fahringer, and N.Rasouli, GRP-HEFT: A budget-constrained resource provisioning scheme for workflow scheduling in IaaS clouds, IEEE Trans. Parallel Distrib. Syst., vol. 31, no. 6, pp. 1239–1254, 2020.
H.Topcuoglu, S.Hariri, and M. Y.Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Trans. Parallel Distrib. Syst., vol. 13, no. 3, pp. 260–274, 2002.
R.Ghafouri, A.Movaghar, and M.Mohsenzadeh, A budget constrained scheduling algorithm for executing workflow application in infrastructure as a service clouds, Peer-to-Peer Netw. Appl., vol. 12, no. 1, pp. 241–268, 2019.
Q. W.Wu, M. C.Zhou, Q. S.Zhu, Y. N.Xia, and J. H.Wen, MOELS: Multiobjective evolutionary list scheduling for cloud workflows, IEEE Trans. Autom. Sci. Eng., vol. 17, no. 1, pp. 166–176, 2020.
X. M.Zhou, G. X.Zhang, J.Sun, J. L.Zhou, T. Q.Wei, and S. Y.Hu, Minimizing cost and makespan for workflow scheduling in cloud using fuzzy dominance sort based HEFT, Future Gener. Comput. Syst., vol. 93, pp. 278–289, 2019.
G. J.Portella, G. N.Rodrigues, E. Y.Nakano, A.Boukerche, and A. C. M.Melo, A novel statistical and neural network combined approach for the cloud spot market, IEEE Trans. Cloud Comput., .
J.Li, Y. M.Zhu, J. D.Yu, C. N.Long, G. T.Xue, and S. Y.Qian, Online auction for IaaS clouds: Towards elastic user demands and weighted heterogeneous VMs, IEEE Trans. Parallel Distrib. Syst., vol. 29, no. 9, pp. 2075–2089, 2018.
W.Voorsluys and R.Buyya, Reliable provisioning of spot instances for compute-intensive applications, in Proc. 2012 IEEE 26th Int. Conf. Advanced Information Networking and Applications, Fukuoka, Japan, 2012, pp. 542–549.
X.He, P.Shenoy, R.Sitaraman, and D.Irwin, Cutting the cost of hosting online services using cloud spot markets, in Proc. 24th Int. Symp. High-Performance Parallel and Distributed Computing, Portland, OR, USA, 2015, pp. 207–218.
S.Yang, S.Khuller, S.Choudhary, S.Mitra, and K.Mahadik, Scheduling ML training on unreliable spot instances, in Proc. 14th IEEE/ACM Int. Conf. Utility and Cloud Computing Companion, Leicester, UK, 2021, p. 29.
L.Teylo, A. L.Nunes, A. C. M. A.Melo, C.Boeres, L. M. de A.Drummond, and N. F.Martins, Comparing SARS-CoV-2 sequences using a commercial cloud with a spot instance based dynamic scheduler, in Proc. 2021 IEEE/ACM 21st Int. Symp. Cluster, Cloud and Internet Computing (CCGrid), Melbourne, Australia, 2021, pp. 247–256.
F.Xu, H. Y.Zheng, H.Jiang, W. J.Shao, H. K.Liu, and Z.Zhou, Cost-effective cloud server provisioning for predictable performance of big data analytics, IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 5, pp. 1036–1051, 2019.
S. J.Cao, K. F.Deng, K. J.Ren, X. Y.Li, T. F.Nie, and J. Q.Song, An optimizing algorithm for deadline constrained scheduling of scientific workflows in IaaS clouds using spot instances, in Proc. 2019 IEEE Int. Conf. Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom), Xiamen, China, 2019, pp. 1421–1428.
R. G.Martinez, A.Lopes, and L.Rodrigues, Planning workflow executions when using spot instances in the cloud, in Proc. 34th ACM/SIGAPP Symp. Applied Computing, Limassol, Cyprus, 2019, pp. 310–317.
H.Ghavamipoor, S. A. K.Mousavi, H. R.Faragardi, and N.Rasouli, A reliability aware algorithm for workflow scheduling on cloud spot instances using artificial neural network, in Proc. 2020 10th Int. Symp. Telecommunications (IST), Tehran, Iran, 2020, pp. 67–71.
A. C.Zhou, J. M.Lao, Z. B.Ke, Y.Wang, and R.Mao, FarSpot: Optimizing monetary cost for HPC applications in the cloud spot market, IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11, pp. 2955–2967, 2022.
L.Teylo, L.Arantes, P.Sens, and L. M. A.Drummond, A dynamic task scheduler tolerant to multiple hibernations in cloud environments, Cluster Comput., vol. 24, no. 2, pp. 1051–1073, 2021.
T. P.Pham and T.Fahringer, Evolutionary multi-objective workflow scheduling for volatile resources in the cloud, IEEE Trans. Cloud Comput., vol. 10, no. 3, pp. 1780–1791, 2022.
F.Cao and M. X.Zhu, A fault-tolerant workflow mapping algorithm under end-to-end delay constraint, in 2011 IEEE Int. Conf. High Performance Computing and Communications, Banff, Canada, 2011, pp. 575–580.
H.Youness, A.Omar, and M.Moness, An optimized weighted average makespan in fault-tolerant heterogeneous MPSoCs, IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 8, pp. 1933–1946, 2021.
G. S.Yao, Y. S.Ding, and K. R.Hao, Using imbalance characteristic for fault-tolerant workflow scheduling in cloud systems, IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 12, pp. 3671–3683, 2017.
D.Poola, S. K.Garg, R.Buyya, Y.Yang, and K.Ramamohanarao, Robust scheduling of scientific workflows with deadline and budget constraints in clouds, in Proc. 2014 IEEE 28th Int. Conf. Advanced Information Networking and Applications, Victoria, Canada, 2014, pp. 858–865.
L. C.Canon and E.Jeannot, Evaluation and optimization of the robustness of DAG schedules in heterogeneous environments, IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 4, pp. 532–546, 2010.
X. Y.Tang, K. L.Li, G. P.Liao, K.Fang, and F.Wu, A stochastic scheduling algorithm for precedence constrained tasks on grid, Future Gener. Comput. Syst., vol. 27, no. 8, pp. 1083–1091, 2011.
T. P.Pham, S.Ristov, and T.Fahringer, Performance and behavior characterization of amazon EC2 spot instances, in Proc. 2018 IEEE 11th Int. Conf. Cloud Computing (CLOUD), San Francisco, CA, USA, 2018, pp. 73–81.
H.Wang, L.Cai, X.Hao, J.Ren, and Y. H.Ma, ETS-TEE: An energy-efficient task scheduling strategy in a mobile trusted computing environment, Tsinghua Science and Technology, vol. 28, no. 1, pp. 105–116, 2023.
Z. Y.Hu and D. S.Li, Improved heuristic job scheduling method to enhance throughput for big data analytics, Tsinghua Science and Technology, vol. 27, no. 2, pp. 344–357, 2022.
H.Arabnejad and J. G.Barbosa, List scheduling algorithm for heterogeneous systems by an optimistic cost table, IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 3, pp. 682–694, 2014.
R. N.Calheiros, R.Ranjan, A.Beloglazov, C. A. F.De Rose, and R.Buyya, CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms, Softw.: Pract. Exper., vol. 41, no. 1, pp. 23–50, 2011.
Wu Q, Fang J, Zeng J, et al. Monte Carlo Simulation-Based Robust Workflow Scheduling for Spot Instances in Cloud Environments. Tsinghua Science and Technology, 2024, 29(1): 112-126. https://doi.org/10.26599/TST.2022.9010065
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
10.26599/TST.2022.9010065.F001
Sample of workflow applications.
Resource allocation
In the resource allocation step, each task from the list is sequentially allocated to a resource by comparing all the candidate resources. Although the computation resources offered by the cloud provider are claimed to be infinite, it is unnecessary to try each of them for resource selection as the unused resources of the same type can be regarded as the same. Hence, the resource candidates that need to be considered include all the resources that have been used in the current solution and those that have not been used but can be added at any time (one resource for each type).
For traditional workflow scheduling models aiming to optimize the makespan, the resource allocation criterion is usually set to select a resource from all the resource candidates, which allows the earliest finish time (e.g., HEFT[
13
] and PEFT[
48
]). The considered model aims to achieve deadline-constrained cost optimization, and the resource allocation step should be aware of the deadline constraint and cost. To this end, the user-specified deadline for the workflow is first distributed to each workflow task, and the criterion for selecting a resource for task is updated as follows: meet ’s sub-deadline and minimize the cost increment of adding . Specifically, the sub-deadline of each task is calculated based on via
When no candidate resource meets the sub-deadline, the selection criterion turns to minimizing the finish time of the task instead. By doing so, the probability for the whole workflow to meet the overall deadline is increased.
For brevity, the stochastic list scheduling heuristic employing a -based task ordering before resource allocation is denoted as SLS-I, and the one employing a topological sort based ordering is denoted as SLS-II. A topological sort based ordering has greater uncertainty than a -based one, as the latter’s uncertainty only comes from whether to calculate transfer time.
Solution evaluation with the predicted interruption time
Given the predicted interruption time in runtime for the used SIs in a schedule solution , the actual makespan and cost for the solution can be acquired. The detail is given in Algorithm 2, where the actual start time and finish time of each task are calculated individually based on the task list (Lines 1–6). For each task, the actual start time is first calculated via Eq. (
4
) based on the actual finish time of its parent tasks and the actual ready time of the resource where it is allocated (Lines 23).
When is less than the predicted interruption time of SI obtained from , can finish executing without interruption. Otherwise, the execution is interrupted, and is moved to an OI for re-execution. If there is an OI that has been launched and is currently free and it is not slower than , is moved to this OI; otherwise, a new OI is launched for re-executing . Then, ’s execution time and its actual finish time are calculated via Eqs. (
2
) and (
5
), respectively. After traversing all the tasks in a workflow, its actual makespan and cost are calculated and returned.
MCLS
The main procedure of MCLS follows the Monte Carlo simulation framework, a popular approach for various problems which involve stochastic factors and are generally infeasible to resolve via deterministic computations. MCLS consists of two main phases, that is, “producing” and “selecting”. In the producing phase, an independent sample is randomly taken from the domain space for each random variable , a list scheduling method is leveraged to generate a schedule solution based on random samples, and the above steps are repeated until adequate candidate solutions are generated. Then, in the selecting phase, candidate solutions are evaluated, and the best one is returned. The whole algorithm of MCLS is described in Algorithm 3.
First, an empty solution pool is created, and SLS-I is performed to obtain a schedule solution , which is added to (Lines 12). In the producing phase with iterations, a sample of is taken for all used SIs in based on their distribution functions, which is denoted as , and then is evaluated based on via Algorithm 2 to compute its makespan and cost values (Lines 34).
Next, SLS-II is performed to produce a new solution . The makespan and cost of are calculated based on . A new solution is regarded as better than if they both meet the deadline constraint and incurs less cost or if one of them does not meet the deadline and the makespan of is less than . is added into if it is better than . Based on the samples , new solutions are built and evaluated, and this procedure is repeated for times.
In the selecting phase, when , samples are taken from their distribution functions to evaluate the candidate solutions in (Lines 14–17). Then, for each solution, its satisfaction rate and average cost are calculated, and then the utility value is obtained (Line 18). Then, solutions in whose ranks in terms of utility value are lower than are removed. By doing so, half of the solutions in survive the evaluation in the next round. When , the solution with the highest average utility value is returned as the final solution. The selection phase guarantees that the performance of the returned solution is robust toward various execution situations in runtime as it is the best one validated by a large number of samplings and evaluations.
MCLS builds schedule solutions in total. To build each solution via list scheduling, the computational complexity is , where is the number of tasks in a workflow and is the number of resource types. As is usually much less than , it can be simplified as . The computational complexity of the producing phase is . The computational complexity of the selecting phase is because can be cut for times at most, and an time complexity is required for a solution evaluation. Combining the above analysis, the computational complexity of MCLS can be deduced, i.e., .
10.26599/TST.2022.9010065.F002
Structure of four real-world workflow applications.
10.26599/TST.2022.9010065.F003
Utility value of each approach with respect to .
10.26599/TST.2022.9010065.F004
Utility value of each approach with respect to .