Computer Integrated Manufacturing System ›› 2023, Vol. 29 ›› Issue (4): 1267-1283.DOI: 10.13196/j.cims.2023.04.020

Previous Articles     Next Articles

Parallel algorithm for mining frequent item sets based on Spark

MAO Yimin1,WU Bin1,XU Chundong1+,ZHANG Maosheng2   

  1. 1.School of Information Engineering,Jiangxi University of Science and Technology
    2.School of Human Settlements and Architectural Engineering,Xi'an Jiaotong University
  • Online:2023-04-30 Published:2023-05-17
  • Supported by:
    Project supported by the National Natural Science Foundation,China (No.41562019,11864016),and the National Key Research and Development Program,China (No.2018YFC1504705).

基于Spark的并行频繁项集挖掘算法

毛伊敏1,吴斌1,许春冬1+,张茂省2   

  1. 1.江西理工大学信息工程学院
    2.西安交通大学人居环境与建筑工程学院
  • 基金资助:
    国家自然科学基金资助项目(41562019,11864016);国家重点研发计划资助项目(2018YFC1504705)。

Abstract: Aiming at the problem of low space-time efficiency in creating conditional Frequent Pattern Tree (FP-tree),high communication overhead between nodes and redundant search in the Spark-based FP-Growth algorithm,a Parallel Algorithm For Mining Frequent Itemset based on Spark (PAFMFI-Spark) was proposed.A Strategy of Non-Negative Matrix Factorization (SNMF) was proposed,which provided the query of support counts and decomposed the matrix of support counts,thereby solving the problem of low space-time efficiency in creating conditional FP-tree.A Grouping Strategy based on Genetic Algorithm (GS-GA) was proposed,which evenly grouped frequent 1 item sets to solve the problem of high communication overhead between nodes.An Efficiently Reduce Tree Structure Strategy (ERTSS) was proposed,which reduced the structure of FP-tree to solve the redundant search problem.The feasibility of the PAFMFI-Spark algorithm was verified by the experiment,and its performance advantage was proved by comparing with other mining algorithms,which could effectively process frequent itemset mining of various data.

Key words: big data, Spark framework, parallel mining frequent itemsets, frequent pattern growth algorithm, non-negative matrix factorization

摘要: 针对大数据环境下基于Spark的频繁模式增长(FP-Growth)算法存在创建条件频繁模式树(FP-tree)时空效率低,节点间通信开销大,以及冗余搜索等问题,提出了基于Spark的并行频繁项集挖掘算法(PAFMFI-Spark)。首先,该算法提出非负矩阵分解策略(SNMF),通过提供支持度计数查询和分解储存支持度计数的矩阵,解决了创建条件FP-tree的时空效率低的问题;其次,提出基于遗传算法的分组策略(GS-GA),均衡分配频繁1项集至各节点,解决了节点间的通信开销大的问题;最后,提出高效缩减树结构策略(ERTSS),缩减FP-tree树结构,解决了冗余搜索的问题。实验结果验证了PAFMFI-Spark算法的可行性以及相较于其他挖掘算法的性能优势,所提算法能有效适应各种数据的频繁项集挖掘。

关键词: 大数据, Spark框架, 并行频繁项集挖掘, 频繁模式增长算法, 非负矩阵分解

CLC Number: