计算机集成制造系统 ›› 2019, Vol. 25 ›› Issue (第4): 791-797.DOI: 10.13196/j.cims.2019.04.001

• 当期目次 •    下一篇

基于Spark的并行分布式过程挖掘算法

胡小强1,吴翾2,闻立杰1+,王建民1   

  1. 1.清华大学软件学院
    2.吉林大学计算机科学与技术学院
  • 出版日期:2019-04-30 发布日期:2019-04-30
  • 基金资助:
    国家重点研发计划资助项目(2016YFB1001101);国家自然科学基金资助项目(61472207,71690231);工业大数据系统与应用北京市重点实验室资助项目;北京信息科学与技术国家研究中心资助项目。

Parallel distributed process mining algorithm based on Spark

  • Online:2019-04-30 Published:2019-04-30
  • Supported by:
    Project supported by the National Key Research and Development Plan,China(No.2016YFB1001101),the National Natural Science Foundation,China(No.61472207,71690231),the Beijing Key Laboratory for Industrial Big data System and Application,China,and the BNRist,China.

摘要: 针对传统的过程发现算法对大规模事件日志挖掘效率低的问题,提出一种利用Spark集群进行加速过程挖掘的方法。该方法主要针对基于日志活动关系的过程挖掘算法,对抽取活动关系阶段进行加速。通过并行分布式抽取活动关系,将事件日志转化为活动关系矩阵。然后利用关系矩阵,按算法原本的后续步骤,挖掘出过程模型。利用Spark实现分布式α-Mine算法和分布式Flexible Heuristic Miner算法,结果表明:所提方法在时间消耗上优于目前最好的算法,挖掘效率明显提升。

关键词: 过程挖掘算法, Spark集群, 大数据, 并行分布式化

Abstract: Aiming at the problem that the traditional process discovery algorithms had low efficiency for mining models from large-scale event log,a method of using Spark clusters to conduct parallel distributed process mining was proposed.For the process mining algorithm based on the log activity relationship,the method could accelerate the extraction of activity relationship.Through parallel distributed extraction of activity relationships,the event log was transformed into an activity relationship matrix.By using the relation matrix,the process model was mined.The distributed-Mine algorithm and the distributed flexible heuristic miner algorithm were implemented by Spark,and the result showed that the proposed method leaded the current best algorithm implementation in terms of time consumption,and the mining efficiency was improved significantly.

Key words: process mining algorithm, Spark clusters, big data, parallel distribution

中图分类号: