面向大规模感知数据的实时数据流处理方法及关键技术

• 论文 •

面向大规模感知数据的实时数据流处理方法及关键技术

亓开元1,2,3，韩燕波1，赵卓峰1，马强2,3

1.北方工业大学云计算研究中心，北京100144；2.中国科学院计算技术研究所，北京100190；3.中国科学院大学，北京100190

收稿日期:2013-03-25 修回日期:2013-03-25 出版日期:2013-03-25 发布日期:2013-03-25

Real-time data stream processing and key techniques oriented to large-scale sensor data

QI Kai-yuan1,2,3, HAN Yan-bo1, ZHAO Zhuo-feng1, MA Qiang2,3

1.Cloud Computing Research Center, North China University of Technology, Beijing 100144, China; 2.Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 3.Graduate University, Chinese Academy of Sciences, Beijing 100190, China

Received:2013-03-25 Revised:2013-03-25 Online:2013-03-25 Published:2013-03-25

摘要/Abstract

摘要： 为了在大规模历史感知数据基础上实现针对高速传感数据流的实时计算，提出一种面向大规模历史数据的数据流处理方法RTMR，通过中间结果缓存、流水化和本地化改进了MapReduce的数据流处理能力。在此基础上，为了适应性地构造RTMR集群，利用模型分析方法根据应用特征和集群环境配置节点类型和拓扑结构。为实现集群的负载均衡，通过计算负载状态转换关系分组空闲节点和过载节点，将NP难的动态负载均衡问题快速分解为规模较小的子问题，并且综合执行时间和数据移动代价作为子问题的优化目标，提高应对负载倾斜的反应速度。实验表明，上述方法和技术能够保障大规模历史数据上数据流处理的可伸缩性。

关键词: 数据流处理, 大规模数据处理, MapReduce方法

Abstract: With the development of Internet of Things, how to realize real time computation for high speed data stream based on large scale history sensor data became a new challenge to cloud manufacturing. A processing method named Real-Time MapReduce (RTMR) oriented to large scale historical data was proposed, which improved data stream processing capacity of MapReduce through intermediate result cache, pipelining and localization. To construct RTMR sets, the model analysis method was used to configure the node type and topological structure based on application characteristics and cluster environments. Furthermore, to realize cluster load balancing, the idle nodes and overload nodes were grouped by computing load state transition relation. Thus the dynamic load balancing problem of NP hard was decomposed into small scale sub-problems, and execution time as well as data cost were integrated as sub-problem's optimization objective. The experiment result showed that the proposed method and technology could ensure the scalability for data stream processing of large scale historical data.

Key words: data stream processing, large scale data processing, MapReduce, adaptive architecture, load balance

中图分类号:

TP393

亓开元1,2,3，韩燕波1，赵卓峰1，马强2,3. 面向大规模感知数据的实时数据流处理方法及关键技术[J]. .

QI Kai-yuan1,2,3, HAN Yan-bo1, ZHAO Zhuo-feng1, MA Qiang2,3. Real-time data stream processing and key techniques oriented to large-scale sensor data[J]. .

[1]	陈春荣,何霆,廖永新,李海波,黄双喜. 基于需求/服务模式的大规模个性化网络服务定制方法[J]. 计算机集成制造系统, 2021, 27(12): 3659-3668.
[2]	郜启凯,李莹,邓水光. 支持跨组织协作的高可靠性服务编排框架[J]. 计算机集成制造系统, 2021, 27(9): 2501-2507.
[3]	李炜,蒋越,闵江松,张以文,王庆人. 边缘计算环境下自适应移动路径感知的用户分配算法[J]. 计算机集成制造系统, 2021, 27(9): 2592-2603.
[4]	刘庆祥,许小龙,张旭云,窦万春. 基于联邦学习的边缘智能协同计算与隐私保护方法[J]. 计算机集成制造系统, 2021, 27(9): 2604-2610.
[5]	陈明,高铁梁,张志锋,季肖辉,唐启光. 基于用户多兴趣的服务流程推荐方法[J]. 计算机集成制造系统, 2021, 27(9): 2701-2707.
[6]	满君丰,赵龙乾,彭成,李倩倩. 云边协同计算架构下大规模工厂接入的任务调度[J]. 计算机集成制造系统, 2021, 27(8): 2282-2294.
[7]	余洋,孙林夫,任春华,韩敏. 面向多服务价值链的业务资源双边匹配模型[J]. 计算机集成制造系统, 2021, 27(5): 1397-1409.
[8]	吴家贝,常建娥,张峰. 基于病毒传播模型的制造系统关键资源识别指标评价[J]. 计算机集成制造系统, 2020, 26(11): 2955-2964.
[9]	安相华,周立彬,张力伟. 基于粗糙模糊数与耦合分析的产品工艺参数方案绿色优选[J]. 计算机集成制造系统, 2020, 26(11): 3057-3067.
[10]	李锋,陈勇,王家序,汤宝平. 基于强化学习单元匹配循环神经网络的滚动轴承状态趋势预测[J]. 计算机集成制造系统, 2020, 26(8): 2050-2059.
[11]	袁梦祥,颜登程,张以文,周珊. 基于二部网络表示学习的矩阵分解推荐算法[J]. 计算机集成制造系统, 2020, 26(6): 1557-1563.
[12]	杨静雅,孙林夫,吴奇石. 汽车产业链SaaS平台配件库存信息集成安全技术[J]. 计算机集成制造系统, 2020, 26(5期): 1277-1285.
[13]	余晓晖,刘默,蒋昕昊,尹杨鹏,杨希,刘棣斐,张恒升,刘晓曼,池程. 工业互联网体系架构2.0[J]. 计算机集成制造系统, 2019, 25(12): 2983-2996.
[14]	刘阳,韩天宇,谢滨,田娟. 基于工业互联网标识解析体系的数据共享机制[J]. 计算机集成制造系统, 2019, 25(12): 3032-3042.
[15]	陶耀东,徐伟,纪胜龙. 边缘计算安全综述与展望[J]. 计算机集成制造系统, 2019, 25(12): 3043-3051.