Parallel algorithm for mining frequent item sets based on Spark

doi:10.13196/j.cims.2023.04.020

Computer Integrated Manufacturing System ›› 2023, Vol. 29 ›› Issue (4): 1267-1283.DOI: 10.13196/j.cims.2023.04.020

Previous Articles Next Articles

Parallel algorithm for mining frequent item sets based on Spark

MAO Yimin¹,WU Bin¹,XU Chundong¹⁺,ZHANG Maosheng²

1.School of Information Engineering,Jiangxi University of Science and Technology
2.School of Human Settlements and Architectural Engineering,Xi'an Jiaotong University

Online:2023-04-30 Published:2023-05-17
Supported by:
Project supported by the National Natural Science Foundation,China (No.41562019,11864016),and the National Key Research and Development Program,China (No.2018YFC1504705).

基于Spark的并行频繁项集挖掘算法

毛伊敏¹,吴斌¹,许春冬¹⁺,张茂省²

1.江西理工大学信息工程学院
2.西安交通大学人居环境与建筑工程学院

基金资助:
国家自然科学基金资助项目(41562019,11864016);国家重点研发计划资助项目(2018YFC1504705)。

Abstract

Abstract: Aiming at the problem of low space-time efficiency in creating conditional Frequent Pattern Tree (FP-tree),high communication overhead between nodes and redundant search in the Spark-based FP-Growth algorithm,a Parallel Algorithm For Mining Frequent Itemset based on Spark (PAFMFI-Spark) was proposed.A Strategy of Non-Negative Matrix Factorization (SNMF) was proposed,which provided the query of support counts and decomposed the matrix of support counts,thereby solving the problem of low space-time efficiency in creating conditional FP-tree.A Grouping Strategy based on Genetic Algorithm (GS-GA) was proposed,which evenly grouped frequent 1 item sets to solve the problem of high communication overhead between nodes.An Efficiently Reduce Tree Structure Strategy (ERTSS) was proposed,which reduced the structure of FP-tree to solve the redundant search problem.The feasibility of the PAFMFI-Spark algorithm was verified by the experiment,and its performance advantage was proved by comparing with other mining algorithms,which could effectively process frequent itemset mining of various data.

Key words: big data, Spark framework, parallel mining frequent itemsets, frequent pattern growth algorithm, non-negative matrix factorization

摘要： 针对大数据环境下基于Spark的频繁模式增长(FP-Growth)算法存在创建条件频繁模式树(FP-tree)时空效率低,节点间通信开销大,以及冗余搜索等问题,提出了基于Spark的并行频繁项集挖掘算法(PAFMFI-Spark)。首先,该算法提出非负矩阵分解策略(SNMF),通过提供支持度计数查询和分解储存支持度计数的矩阵,解决了创建条件FP-tree的时空效率低的问题;其次,提出基于遗传算法的分组策略(GS-GA),均衡分配频繁1项集至各节点,解决了节点间的通信开销大的问题;最后,提出高效缩减树结构策略(ERTSS),缩减FP-tree树结构,解决了冗余搜索的问题。实验结果验证了PAFMFI-Spark算法的可行性以及相较于其他挖掘算法的性能优势,所提算法能有效适应各种数据的频繁项集挖掘。

关键词: 大数据, Spark框架, 并行频繁项集挖掘, 频繁模式增长算法, 非负矩阵分解

CLC Number:

TP311

MAO Yimin, WU Bin, XU Chundong, ZHANG Maosheng. Parallel algorithm for mining frequent item sets based on Spark[J]. Computer Integrated Manufacturing System, 2023, 29(4): 1267-1283.

毛伊敏, 吴斌, 许春冬, 张茂省. 基于Spark的并行频繁项集挖掘算法[J]. 计算机集成制造系统, 2023, 29(4): 1267-1283.

[1]	PEI Fengque, ZHANG Jiaxuan, TONG Yifei, YUAN Minghai, GU Wenbin. OEE accurate online monitoring for production line cluster facing with industrial big data [J]. Computer Integrated Manufacturing System, 2023, 29(5): 1481-1490.
[2]	LI Chunfa, HU Peipei, LIU Huanxing. Consumer's green preference,big data targeted advertising and evolution of mobile phone green marketing strategies [J]. Computer Integrated Manufacturing System, 2023, 29(5): 1731-1746.
[3]	PEI Yinglei, WANG Kehong. Image recognition of molten pool based on non-negative matrix factorization [J]. Computer Integrated Manufacturing System, 2023, 29(3): 930-937.
[4]	CAO Weidong, NI Jianjun, JIANG Boyan. Parameter adaptive support vector regression for big data [J]. Computer Integrated Manufacturing System, 2023, 29(2): 511-521.
[5]	REN Lei, JIA Zidi, LAI Liyuanjun, ZHOU Longfei, ZHANG Lin, LI Bohu. Data-driven industrial intelligence:Current status and future directions [J]. Computer Integrated Manufacturing System, 2022, 28(7): 1913-1939.
[6]	. Industrial big data-driven fault prognostics and health management [J]. Computer Integrated Manufacturing System, 2022, 28(5): 1314-1336.
[7]	LIU Jianhua, LI Kunping, ZHUANG Cunbo, ZHANG Lei. New connotation and technical systems of digital transformation of manufacturing enterprises in big data era [J]. Computer Integrated Manufacturing System, 2022, 28(12): 3707-3719.
[8]	. Hierarchical filtering algorithm for distributed abnomaly data based on urban computing [J]. , 2021, 27(9): 2525-2531.
[9]	. DBN-DNN-based order completion time prediction method for discrete manufacturing workshop [J]. , 2020, 26(9): 2445-2452.
[10]	. Flexible job shop dynamic scheduling based on industrial big data [J]. , 2020, 26(9): 2497-2510.
[11]	. Mobil phone product improvement based on big data of comment [J]. , 2020, 26(11): 3074-3083.
[12]	. Big data driven cloud-fog manufacturing architecture [J]. , 2019, 25(第9): 2119-2139.
[13]	. Data driven technical framework of real-time monitoring and control optimization for CNC machining production line [J]. , 2019, 25(第8): 1875-1884.
[14]	. Parallel distributed process mining algorithm based on Spark [J]. , 2019, 25(第4): 791-797.
[15]	. Progress of big data analytics methods based on artificial intelligence technology [J]. , 2019, 25(第3): 529-547.

Parallel algorithm for mining frequent item sets based on Spark

基于Spark的并行频繁项集挖掘算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics