Computer Integrated Manufacturing System ›› 2024, Vol. 30 ›› Issue (9): 3183-3198.DOI: 10.13196/j.cims.2022.0084

Previous Articles     Next Articles

Parallel support vector machine algorithm based on relative entropy and cosine similarity

MAO Yimin1,GUO Binbin1,YI Jianbing1+,CHEN Zhigang2   

  1. 1.School of Information Engineering,Jiangxi University of Science and Technology
    2.College of Computer Science and Engineering,Central South University
  • Online:2024-09-30 Published:2024-10-09
  • Supported by:
    Project supported by the National Natural Science Foundation,China(No.41562019),and the Science and Technology Innovation 2030-Next Generation Artificial Intelligence Major Project,China(No.2020AAA0109605).

基于相对熵和余弦相似度的并行SVM算法

毛伊敏1,郭斌斌1,易见兵1+,陈志刚2   

  1. 1.江西理工大学信息工程学院
    2.中南大学计算机学院
  • 作者简介:
    毛伊敏(1970-),女,新疆伊犁人,教授,博士,研究方向:数据挖掘、大数据,E-mail:lycmym@163.com;

    郭斌斌(1996-),男,江西吉安人,硕士研究生,研究方向:数据挖掘、大数据,E-mail:guobinbin@mail.jxust.edu.cn;

    +易见兵(1980-),男,江西宜春人,副教授,博士,研究方向:大数据,通讯作者,E-mail:yijianbing8@163.com;

    陈志刚(1964-),男,湖南长沙人,教授,博士,研究方向:大数据,E-mail:czg@csu.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(41562019);科技创新2030-“新一代人工智能”重大资助项目(2020AAA0109605)。

Abstract: Aiming at the problems of parallel support vector machine algorithm in big data environment such as large subset distribution deviation,low parallel efficiency and inaccurate filtering of non-support vector,a parallel support vector machine algorithm based on relative entropy and cosine similarity Parallel Support Vector Machine algorithm based on Relative Entropy and Cosine Similarity (RC-PSVM) was proposed.A data partitioning Data Partitioning based on Relative Entropy (DPRE) strategy based on relative entropy was proposed,which balanced the relative entropy of the current subset and the original data set,and divided the sample into a suitable subset to reduce the deviation of the subset distribution.Then,Redundancy Level Detection Strategy based on Cosine Similarity (CS-RLDS) was designed to calculate the cosine similarity of normal vectors between adjacent layer local support vector machines via comparing the set threshold and similarity to identify and stop the redundancy level,which improved the parallel efficiency.Finally,the Non-Support Vector Filtering strategy (NSVF) was developed,which calculated the support vector similarity by combining the distance between the sample and the decision boundaries of multiple local support vector models to identify Non-support vector to solve the problem of inaccurate filtering of non-support vector.Experiments showed that the classification effect of the RC-PSVM algorithm was better,and the operation was more efficient under big data.

Key words: big data, MapReduce framework, parallel support vector machine, relative entropy, cosine similarity

摘要: 针对大数据环境下并行支持向量机(SVM)算法存在子集分布偏差大,并行效率低以及过滤非支持向量不准确等问题,提出了基于相对熵和余弦相似度的并行SVM算法(RC-PSVM)。该算法首先提出基于相对熵的数据划分策略(DPRE),平衡当前子集和原始数据集的相对熵,划分样本到适合的子集,降低子集分布偏差;然后提出基于余弦相似度的冗余层级检测策略(CS-RLDS),计算相邻层局部SVM之间法向量的余弦相似度,比较设定的阈值与相似度,识别并停止冗余层级,提高了并行效率;最后提出非支持向量过滤策略 (NSVF),结合样本到多个局部支持向量模型决策边界的距离,计算支持向量相似度来识别非支持向量,解决了过滤非支持向量不准确的问题。实验表明,RC-PSVM算法的分类效果更佳,且在大数据下的运行效率更高。

关键词: 大数据, MapReduce框架, 并行支持向量机, 相对熵, 余弦相似度

CLC Number: