Scholar - SciOpen

MapReduce is one of the most classic and powerful parallel computing models in the field of big data. It is still active in the big data system ecosystem and is currently evolving towards cloud-native environment. Among them, due to its elasticity and ease-to-use, Serverless computing is one of the most promising directions of cloud-native technology. To support MapReduce big data computing capabilities in a Serverless environment can give full play to Serverless’s advantages. However, due to different underlying system architecture, three issues will be encountered when running MapReduce jobs in the Serverless environment. Firstly, the scheduling strategy is difficult to fully utilize the available resources. Secondly, reading Shuffle index data on cloud storage is inefficient and expensive. Thirdly, cloud storage Input/Output (I/O) request latency has a long tail effect. To solve these problems, this paper proposes three strategies with a MapReduce parallel processing framework in Serverless environment. Experimental results show that compared with cutting-edge systems, our approach shortens job execution time by 25.6% on average and reduces job execution costs by 17.3%.

Open Access Issue

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

Bo Zhao, Hucheng Zhou, Guoqiang Li, Yihua Huang

Big Data Mining and Analytics 2018, 1(1): 57-74

Published: 25 January 2018

Abstract

PDF (2.2 MB) Collect Collected

Downloads：130

Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects: (1) it converts the commonly used serial Collapsed Gibbs Sampling (CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian (MCCB) estimation method, which is embarrassingly parallel; (2) it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity; (3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.

Total 2

<1/11>GOpage