Computer Integrated Manufacturing System ›› 2024, Vol. 30 ›› Issue (5): 1719-1732.DOI: 10.13196/j.cims.2023.0376

Previous Articles     Next Articles

Outlier detection algorithm based on mapping distance ratio outlier factor

ZHANG Zhongping1,2,YAO Chunchen1+,SUN Guangxu1,LIU Shuo1,ZHANG Ruibo3,WEI Yonghui4,5   

  1. 1.College of Information Science and Engineering,Yanshan University
    2.Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province
    3.School of International Education,Wuhan University of Technology
    4.Liren College,Yanshan University
    5.School of Information and Communication Technology,Mongolian University of Science and Technology
  • Online:2024-05-31 Published:2024-06-12
  • Supported by:
    Project supported by the National Natural Science Foundation,China(No.61972334),the Innovation Capability Improvement Plan of Hebei Province,China(No.222567626H),the Local Science and Technology Development Fund Guided by the Central Government,China(No.226Z1707G),the Intelligent Image Workpiece Recognition of Sida Railway,China(No.x2021134),and the Performance Appraisal System Qinhuangdao Urban and Health Industry Development Co.,Ltd.,China(No.x2022247).

基于映射距离比离群因子的离群点检测算法

张忠平1,2,姚春辰1+,孙光旭1,刘硕1,张睿博3,魏永辉4,5   

  1. 1.燕山大学信息科学与工程学院
    2.河北省计算机虚拟技术与系统集成重点实验室
    3.武汉理工大学国际教育学院
    4.燕山大学里仁学院
    5.蒙古科技大学信息与通信技术学院
  • 作者简介:张忠平(1972-),男,吉林松原人,教授,博士,研究方向:大数据、数据挖掘、半结构化数据等,E-mail:zpzhang@ysu.edu.cn; +姚春辰(2000-),男,辽宁本溪人,硕士研究生,研究方向:数据挖掘,通讯作者,E-mail:ycc6688win@163.com; 孙光旭(2000-),男,安徽滁州人,硕士研究生,研究方向:数据挖掘,E-mail:1957996858@qq.com; 刘硕(1997-),男,河北廊坊人,硕士研究生,研究方向:数据挖掘,E-mail:1332494613@qq.com; 张睿博(2002-),男,河北秦皇岛人,本科生,研究方向:大数据、数据挖掘,E-mail:1277505954@qq.com; 魏永辉(1982-),男,河北廊坊人,高级实验师,硕士,河北省优秀科技特派员,研究方向:机器视觉、自然语言处理、智能嵌入式,E-mail:wyh4919@ysu.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(61972334);河北省创新能力提升计划基金资助项目(222567626H);中央引导地方科技发展资金资助项目(226Z1707G);四达铁路智能图像工件识别基金资助项目(x2021134);秦皇岛城发健康产业发展有限公司绩效考核管理系统资助项目(x2022247)。

Abstract: To solve the problem that the outlier detection method based on proximity needs a lot of time to filter normal points,and it is difficult to detect local outliers when detecting global outliers,an outlier detection algorithm based on Mapping Distance Ratio Outlier Factor (MDROF) was proposed.To reduce the time consumption of normal points in the detection process,the concept of difference similarity was given,and most normal points in the data set were filtered out by defining the difference similarity pruning factor.The mapping k distance was defined,and the local outlier degree of the data object was described by the ratio of the mapping distance to the reachable distance,and the global outlier degree was described by the reachable density.The mapping distance ratio outlier factor was defined by combining the average rank of the nearest neighbors of the data objects to detect outliers.The accuracy,AUC value and outlier detection curve of the proposed algorithm were compared with other classical outlier detection algorithms on the artificial data set and the real data set.The experimental results showed that MDROF was superior to the comparison algorithms in the accuracy and stability of outlier detection.

Key words: data mining, outlier detection, difference similarity pruning, mapping k distance, mapping distance ratio

摘要: 针对基于邻近性的离群点检测方法需要花费大量时间过滤正常点,并且在检测全局离群点时难以检测出局部离群点的问题,提出一种基于映射距离比离群因子离群点检测(MDROF)算法。首先,为了减少正常点在检测过程中的时间消耗,给出了差异相似度的概念,通过定义差异相似度剪枝因子过滤掉数据集中的大部分正常点。其次,定义映射k距离,通过映射距离与可达距离的比值刻画数据对象的局部离群程度,通过可达密度刻画数据对象的全局离群程度。最后,结合数据对象相互近邻点的平均排位定义映射距离比离群因子来检测离群点。在人工数据集以及真实数据集上分别对该算法与其他经典的离群点检测算法在精确率、AUC值和离群点发现曲线上进行实验对比分析。实验结果证明MDROF算法在离群点检测的准确性和稳定性上明显优于对比算法。

关键词: 数据挖掘, 离群点检测, 差异相似度剪枝, 映射k距离, 映射距离比

CLC Number: