«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

HTML)

分享到：

《武汉工程大学学报》[ISSN:1674-2869/CN:42-1779/TQ]

卷:: 39
期数:: 2017年05期

页码:: 508-513

栏目:: 机电与信息工程

出版日期:: 2017-12-19

文章信息/Info

Title:: Secondary Sort-Based Algorithm for Eliminating Normative Join Redundancy

文章编号:: 20170518

作者:: 刘黎志¹; 2; 张　威¹; 2; 1. 智能机器人湖北省重点实验室（武汉工程大学），湖北武汉 430205； 2. 武汉工程大学计算机科学与工程学院，湖北武汉 430205

Author(s):: LIU Lizhi¹; 2; ZHANG Wei¹; 2; 1. Hubei Key Laboratory of Intelligent Robot （Wuhan Institute of Technology）， Wuhan 430205， China； 2. School of Computer Science and Engineering， Wuhan Institute of Technology， Wuhan 430205， China

关键词:: MapReduce; 连接冗余; 二次排序; HBase

Keywords:: MapReduce; join redundancy; secondary sort; Hbase

分类号:: TP311

DOI:: 10. 3969/j. issn. 1674?2869. 2017. 05. 018

文献标志码:: A

摘要:: 使用MapReduce框架对规范的一对多关系实体进行连接操作时，一方实体的各个属性会在连接的结果中产生大量冗余. 通过对二次排序算法进行优化，重新定义Map阶段的分区过程、Shuffle阶段的排序及分组过程，使得Map阶段的输出为包含一方实体属性值和多方实体排序值的组合键及包含多方实体属性值的集合. Reduce阶段将组合键进行分解，提取一方实体的主码作为HBase表的行健，并将组合键中一方实体的各个属性值及多方实体属性值集合分别写入HBase表中对应的列，从而既实现了连接的语义，又消除了冗余. 实验证明，优化后的算法可以消除一方实体属性值在连接结果中的冗余，提高了对连接结果的查询效率.

Abstract:: The join results of two entities with normative one-to-many relationship by MapReduce may contain some redundancy of one side entity. A combination key with one side entity properties and multi-side sorted values and a list of multi-side entity properties can be got as the input of reduce stage， by optimizing secondary sort-based algorithm and redefining the partition function of map stage， sort and group function of shuffle stage. After splitting the combination key at reduce stage， the key of one side entity was extracted as rowkey of the HBase table to store the join results， and the other properties of the one side entity and the list containing multi- side entity properties were put in the corresponding columns of the HBase table， so the join semantics was realized and the redundancy was eliminated. The examination proves that the optimized algorithm can eliminate the redundancy of one side entity properties and promote the data query efficiency of the join results.

参考文献/References:

［1］　王珊，王会举. 架构大数据：挑战、现状与展望［J］. 计算机学报， 2011， 34（10）： 1741-1751. WANG S， WANG H J. Architecting big data： challenges， studies and forecasts ［J］. Chinese Journal of Computers， 2011， 34（10）：1741-1751. ［2］　孟小峰，慈祥. 大数据管理：概念、技术与挑战［J］. 计算机研究与发展， 2013， 50（1）： 146-169. MENG X F， CI X. Big data management： concepts， techniques and challenges［J］. Journal of Computer Research and Development， 2013， 50（1）：146-159. ［3］　陈吉荣，乐嘉锦. 基于Hadoop生态系统的大数据解决方案综述［J］. 计算机工程与科学， 2013， 35（10）： 25-35. CHEN J R， LE J J. Reviewing the big data solution based on Hadoop ecosystem［J］. Computer Engineering & Science， 2013， 35（10）：25-35. ［4］　LARS G. HBase： the definitive guide［M］. Sebastopol： O’REILLY， 2011. ［5］　ZIKOPOULOS P C， EATON C， DEROOS D， et al. Understanding big data： analytics for enterprise class Hadoop and streaming data［M］. New Youk： McGraw-Hill， 2012. ［6］ AIYER A， BAUTIN M， CHEN G J， et al. Storage Infrastructure Behind Facebook Messages Using HBase at Scale［J］. Bulletin of the Technical Committee on Data Engineering， 2012， 35（2）:996-999. ［7］ VENNER J. Pro Hadoop［M］. Berkeley： Appress，2009. ［8］　蔡睿诚. 基于HDFS的小文件处理与相关MapReduce计算模型性能的优化与改进［D］. 长春：吉林大学， 2012. ［9］ LU W， SHEN Y Y，CHEN S， et al. Efficient processing of k nearest neighbor joins using MapReduce［J］. PVLDB， 2016， 5（10）： 1184-195. ［10］　PANSARE N， BORKAR V R， JERMAINE C， et al. Online aggregation for large MapReduce jobs［J］. PBLDB， 2014， 4（11）： 1135-1145. ［11］　OKCAN I， RIEDEWALD M. Processing theta-joins using MapReduce［C］ //ACM SIGMOD International Conference on Management of Data. ACM， 2011：949-960. ［12］　AFRARTI F N， DAS S A， MENESTRINA D， et al. Fuzzy joins using MapReduce［C］ //IEEE International Conference on Data Engineering. IEEE， 2012：498-509. ［13］　ZHANG X F， SHEN L， WANG M. Efficient multi-way theta-join processing using MapReduce［J］. PVLDB， 2016， 5（11）： 1184-1195. ［14］　BABU S. Towards automatic optimization of MapReduce programs［C］ //ACM Symposium on Cloud Computing. ACM， 2010：137-142. ［15］　SILBA Y N， REED J M. Exploiting MapReduce based similarity joins［C］ //ACM SIGMOD International Conference on Management of Data. ACM， 2012：693-696.

相似文献/References:

[1]李姚舜,刘黎志*.逻辑回归中的批量梯度下降算法并行化研究[J].武汉工程大学学报,2019,(05):499.[doi:10. 3969/j. issn. 1674-2869. 2019. 05. 017]
　LI Yaoshun,LIU Lizhi*.Parallel Research on Batch Gradient Descent Algorithm in Logistic Regression[J].Journal of Wuhan Institute of Technology,2019,(05):499.[doi:10. 3969/j. issn. 1674-2869. 2019. 05. 017]
[2]张晨跃,刘黎志*,邓开巍,等.基于MapReduce的朴素贝叶斯算法文本分类方法[J].武汉工程大学学报,2021,43(01):102.[doi:10.19843/j.cnki.CN42-1779/TQ.202009022]
　ZHANG Chenyue,LIU Lizhi*,DENG Kaiwei,et al.Text Classification Method of Naive Bayes Algorithm Based on MapReduce[J].Journal of Wuhan Institute of Technology,2021,43(05):102.[doi:10.19843/j.cnki.CN42-1779/TQ.202009022]
[3]刘黎志,杨　敏.基于MapReduce的支持向量机参数选择研究[J].武汉工程大学学报,2022,44(01):85.[doi:10.19843/j.cnki.cn42-1779/tq.202012022]
　LIU Lizhi,YANG Min.Optimal Parameters Selection of Support Vector Machine Based on MapReduce Framework[J].Journal of Wuhan Institute of Technology,2022,44(05):85.[doi:10.19843/j.cnki.cn42-1779/tq.202012022]

备注/Memo

备注/Memo:: 收稿日期：2016-12-01 作者简介：刘黎志，硕士, 副教授. E-mail：llz73@163.com

更新日期/Last Update: 2017-10-26

《武汉工程大学学报》[ISSN:1674-2869/CN:42-1779/TQ]

文章信息/Info

参考文献/References:

相似文献/References:

备注/Memo

常用功能

导航/Navigate

工具/Tools

统计/Statistics