Computer Integrated Manufacturing System ›› 2023, Vol. 29 ›› Issue (12): 4001-4020.DOI: 10.13196/j.cims.2021.0491

Previous Articles     Next Articles

Parallel spectral clustering algorithm using KD tree and chaotic mayfly optimization algorithm

HU Jian1,2,LIU Xiangmin1,MAO Yimin1+,CHEN Zhigang3   

  1. 1.School of Information Engineering,Jiangxi University of Science and Technology
    2.School of Information Engineering,Gannan University of Science and Technology
    3.School of Computing,Central South University
  • Online:2023-12-31 Published:2024-01-10
  • Supported by:
    Project supported by the National Natural Science Foundation,China(No.41562019),the National Key Research and Development Program,China(No.2018YFC1504705),the Scientific and Technological Innovation 2030—“New Generation Artificial Intelligence”Major Project ,China(No.2020AAA0109605),and the Science and Technology Foundation of Jiangxi Provincial Education Department,China(No.GJJ151528,GJJ209405).

基于KD树和混沌蜉蝣优化的并行谱聚类算法

胡健1,2,刘祥敏1,毛伊敏1+,陈志刚3   

  1. 1.江西理工大学信息工程学院
    2.赣南科技学院信息工程学院
    3.中南大学计算机学院
  • 基金资助:
    国家自然科学基金资助项目(41562019);国家重点研发计划资助项目(2018YFC1504705);科技创新2030—“新一代人工智能”重大项目子课题(2020AAA0109605);江西省教育厅科技资助项目(GJJ151528,GJJ209405)。

Abstract: In the big data environment,the parallel spectral clustering algorithm suffers from imbalanced node load,redundant computation,high overhead for matrix multiplication and initially cluster center sensitivity.To address these problems,a Parallel Spectral Clustering algorithm using KD-tree and chaotic Mayfly Optimization algorithm(PSC-MO)was proposed.The KD-tree data partitioning strategy based on sampling was proposed to divide the data and ensure the load balancing among nodes.During the construction of the sparse similarity matrix,an Optimized Partition Assignment strategy(OPA)and two KD-tree pruning strategies based on triangular inequalities were proposed to perform t nearest neighbor search across partitions,avoiding excessive redundant computations.Then,the regularization theorem was proposed to optimize the Laplacian matrix regularization processes by multiplying the elements correspondingly instead of matrix multiplication,which effectively reduced the time overhead.The Chaotic Mayfly Optimization Algorithm(CMO)was proposed to obtain the best position as the initial cluster center for k-means parallel clustering,which solved the problem that the parallel spectral clustering algorithm was sensitive to the initial cluster center.Experimental results showed that the PSC-MO algorithm not only had excellent clustering performance,but also exhibited both outstanding data scalability and system scalability with large-scale datasets.

Key words: big data, parallelization, MapReduce model, spectral clustering, K-dimension tree, chaotic mayfly optimization algorithm

摘要: 针对大数据环境下并行谱聚类算法存在的节点负载不均衡、冗余计算、矩阵相乘时间开销大以及初始簇中心敏感等问题,提出了基于KD(k-dimension)树和混沌蜉蝣优化算法的并行谱聚类算法(PSC-MO)。首先,提出基于采样的KD-tree数据分区策略(DPS)划分数据,保证了节点间负载均衡;其次,在构建稀疏相似矩阵过程中,提出优化的分区分配策略(OPA)和基于三角不等式的KD树剪枝策略以进行跨分区的t近邻搜索,避免了过多的冗余计算;然后,提出正规化定理,通过元素对应相乘的方式代替矩阵相乘以优化Laplacian矩阵正规化过程,有效地减少了时间开销;最后,提出混沌蜉蝣优化算法(CMO),得到最佳位置作为初始簇中心后进行k-means并行聚类,解决了算法对初始簇中心敏感的问题。实验表明,PSC-MO算法不但具有良好的聚类效果,而且在大规模数据集上表现出了良好的数据和系统可扩展性。

关键词: 大数据, 并行化, MapReduce模型, 谱聚类, KD树, 混沌蜉蝣优化算法

CLC Number: