计算机集成制造系统 ›› 2022, Vol. 28 ›› Issue (7): 2075-2082.DOI: 10.13196/j.cims.2022.07.013

• • 上一篇    下一篇

一种课程学习范式下的知识蒸馏方法

张邵伟1,王朝飞2,杨柯3,罗显光4,吴澄2,李锵1+   

  1. 1.天津大学微电子学院
    2.清华大学自动化系
    3.北京化工大学化学学院
    4.中车株洲电力机车有限公司
  • 出版日期:2022-07-31 发布日期:2022-07-30
  • 基金资助:
    国家自然科学基金资助项目(62071323,61471263,61872267);天津大学自主创新基金资助项目(2021XZC-0024);中国中车股份有限公司科技研究开发资助项目(2018CCA017)。

Improved knowledge distillation method with curriculum learning paradigm

ZHANG Shaowei1,WANG Chaofei2,YANG Ke3,LUO Xianguang4,WU Cheng2,LI Qiang1+   

  1. 1.School of Microelectronics,Tianjin University
    2.Department of Automation,Tsinghua University
    3.College of Chemistry,Beijing University of Chemical Technology
    4.CRRC Zhuzhou Locomotive Co.,Ltd.
  • Online:2022-07-31 Published:2022-07-30
  • Supported by:
    Project supported by the National Natural Science Foundation,China(No.62071323,61471263,61872267),the Independent Innovation Foundation of Tianjin University,China(No.2021XZC-0024),and the Research and Development of Science and Technology Foundation of CRRC,China(No.2018CCA017).

摘要: 随着工业4.0时代的到来,神经网络在实现整个工业系统自动化的各个环节获得了广泛的应用。然而大规模神经网络往往消耗了大量的存储、内存带宽和计算资源,在计算受限的工业场景中很难高效利用,相比之下,轻量级网络具有更加广泛的应用前景。知识蒸馏提取一个大规模高性能教师网络的知识来指导一个轻量化低性能学生网络的训练,在提升轻量级网络性能方面已获得成功验证。但是,现有的知识蒸馏方法均采用传统的训练数据输入策略,即将训练数据集打乱后随机采样小批量的数据序列,从而将知识从教师网络迁移给学生网络,没有考虑样本输入顺序对学生网络学习知识产生的影响。针对该问题,提出将课程学习范式引入知识蒸馏场景,模拟现实教学场景,使学生网络学习知识遵循先易后难的顺序,即在知识蒸馏过程中,样本输入采用先易后难的策略,其中样本的难度由教师网络和学生网络协作判断,以综合教师网络的经验优势和学习网络的需求特点,达到最合理的课程设计。实验在CIFAR数据集上进行了验证,在多种网络结构下均能大幅提升传统知识蒸馏基线方法的准确率,而且提出的课程学习范式还可以应用于其他主流知识蒸馏方法,进一步提升其性能。另外,消融实验也说明了教师网络和学生网络协作相比单独采用教师网络或学生网络进行难度判断有明显的优势。通过研究验证了将课程学习范式引入知识蒸馏场景的有效性,并提出了切实有效且能广泛应用的算法,为知识蒸馏方法的研究提供了一种新的探索路径。

关键词: 知识蒸馏, 教师网络, 学生网络, 随机采样, 课程学习, 难度判断

Abstract: With the advent of industry 4.0 era,neural network has been widely used in realizing the automation of the industrial systems.However,large-scale neural networks tend to consume a large amount of storage,memory and computing resource.They are difficult to be applied efficiently in the industrial scenarios where computing is limited.In contrast,lightweight networks have a broader application prospect.Knowledge distillation extracts the knowledge of a large-scale and high-performance teacher network to guide the training of a lightweight student network,which has been successfully verified in improving the performance of lightweight networks.However,the existing knowledge distillation methods all adopt the traditional training data input strategy.The training dataset is shuffled and the sequence of mini-batches is randomly sampled without considering the influence of the sample order on the training of the student network.To solve this problem,the curriculum learning paradigm was introduced into the knowledge distillation scene.Simulating the real education scene,the knowledge distillation was performed by utilizing an easy to hard training sample input strategy,where the difficulty of the samples was evaluated by the teacher and student networks cooperatively to integrate the experience of the teacher network and the demand of the student network,so as to achieve the most reasonable curricula design.Experimental results on CIFAR dataset indicated that the proposed method could greatly surpass traditional knowledge distillation base lines with various network structures.Moreover,the proposed curriculum learning paradigm could also be applied to other mainstream knowledge distillation methods to further improve their performance.The ablation experiments showed that the cooperation between teacher network and student network had obvious advantages over using teacher network or student network alone for difficulty evaluation.It verified the effectiveness of introducing the curriculum learning paradigm into the knowledge distillation scene.Meanwhile a practical and effective algorithm was proposed,which provided some insights for the research of knowledge distillation.

Key words: knowledge distillation, teacher network, student network, random sampling, curriculum learning, difficulty evaluation

中图分类号: