Computer Integrated Manufacturing System ›› 2024, Vol. 30 ›› Issue (8): 2663-2671.DOI: 10.13196/j.cims.2023.BPM02

Previous Articles     Next Articles

Software defect prediction oversampling technique with generalization and difficulty-aware

FAN Hongqi1,2,YAN Yuanting1,2+,ZHANG Yiwen1,2,ZHANG Yanping1,2   

  1. 1.Key Laboratory of Intelligent Computing and Signal Processing,Ministry of Education,Anhui University
    2.School of Computer Science and Technology,Anhui University
  • Online:2024-08-31 Published:2024-09-03
  • Supported by:
    Project supported by the National Natural Science Foundation,China(No.61806002,62272001).

学习困难与泛化能力感知的软件缺陷预测过采样方法

范洪旗1,2,严远亭1,2+,张以文1,2,张燕平1,2   

  1. 1.安徽大学计算智能与信号处理教育部重点实验室
    2.安徽大学计算机科学与技术学院
  • 作者简介:
    范洪旗(1997-),男,安徽阜阳人,硕士研究生,研究方向:软件缺陷预测和机器学习,E-mail:1601260972@qq.com;

    +严远亭(1986-),男,安徽宣城人,副教授,博士,研究方向:数据挖掘、粒计算和机器学习,通讯作者,E-mail:ytyan@ahu.edu.cn;

    张以文(1976-),男,安徽马鞍山人,教授,博士,研究方向:服务计算、云计算和大数据分析;

    张燕平(1962-),女,安徽巢湖人,教授,博士,研究方向:计算智能、粒计算和机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61806002,62272001)。

Abstract: The class imbalanced distribution of software defect data brings great challenges to software defect prediction.Synthetic oversampling is the most popular technique to solve this problem,but how to design a suitable sampling strategy to avoid the risk of over-generalization caused by the introduction of abnormal samples is still an open challenge for software defect prediction.To solve this problem,a Generalization and Difficulty-aware Oversampling(GDOS)method by combining the influence of sample learning difficulty and synthetic generalization for minority oversampling was proposed.For each oversampling seed sample,GDOS evaluated the selection weights of its assistant minority samples by measuring the safe factor and the generalization factor simultaneously according to its local prior probability and the sample distribution information of potential synthesis direction.Through suppressing the possibility of synthesizing samples in potential over-generalization regions and enhancing the possibility of synthesizing samples in relative safe directions,GDOS guaranteed the synthesis of high-quality samples.Numerical comparison with nine state-of-the-art methods on twenty-six datasets from the PROMISE repository had demonstrated the superiority of GDOS in terms of MCC,pd,pf and F-measure.

Key words: software defect prediction, class imbalance, oversampling, overgeneralization

摘要: 软件缺陷数据的类别分布不平衡特点给软件缺陷预测任务带了巨大的挑战。合成过采样是解决这一问题最为主流的技术,但如何设计合适的采样策略避免因引入异常样本而导致的过度泛化风险,始终是软件缺陷预测过采样方法面临的难点。针对这一问题,本文提出一种结合样本学习困难程度和合成泛化影响的过采样方法(GDOS)。具体来说,GDOS方法通过样本的局部先验概率和潜在合成方向上的样本分布信息衡量样本的安全系数与泛化系数,并以此度量样本的选择权重。通过抑制潜在过泛化区域的样本合成概率,给予相对安全的近邻合成方向更高的选择概率,为高质量样本的合成提供保障。在26个PROMISE数据集上的实验表明,GDOS在MCC、pd、pf、F-measure等指标上较于经典的采样方法和专门提出的软件缺陷预测采样方法均取得了更优的性能表现。

关键词: 软件缺陷预测, 类别不平衡, 过采样, 过度泛化

CLC Number: