基于多层卷积神经网络特征和双向长短时记忆单元的行为识别
Action recognition with hierarchical convolutional neural networks features and bi-directional long short-term memory model
摘要点击 242  全文点击 217  投稿时间:2016-08-12  修订日期:2017-05-26
查看全文  查看/发表评论  下载PDF阅读器
DOI编号  10.7641/CTA.2017.60607
  2017,34(6):790-796
中文关键词  行为识别  卷积神经网络  递归神经网络  双向递归神经网络
英文关键词  action recognition  convolutional neural networks  recurrent neural networks  bi-directional recurrent neural networks
基金项目  国家自然科学基金;省自然科学基金
学科分类代码  
作者单位E-mail
葛瑞 苏州大学 forgerui@163.com 
王朝晖 苏州大学  
徐鑫 苏州大学  
季怡 苏州大学  
刘纯平 苏州大学;吉林大学符号计算与知识工程教育部重点实验室;软件新技术与产业化协同创新中心  
龚声蓉 常熟理工学院;苏州大学 shrgong@suda.edu.cn 
中文摘要
      鲁棒的视频行为识别由于其复杂性成为了一项极具挑战的任务. 如何有效提取鲁棒的时空特征成为解决问题的关键. 在本文中, 提出使用双向长短时记忆单元(Bi--LSTM)作为主要框架去捕获视频序列的双向时空特征. 首先, 为了增强特征表达, 使用多层的卷积神经网络特征代替传统的手工特征. 多层卷积特征融合了低层形状信息和高层语义信息, 能够捕获丰富的空间信息. 然后, 将提取到的卷积特征输入Bi--LSTM, Bi--LSTM包含两个不同方向的LSTM层. 前向层从前向后捕获视频演变, 后向层反方向建模视频演变. 最后两个方向的演变表达融合到Softmax中, 得到最后的分类结果. 在UCF101和HMDB51数据集上的实验结果显示本文的方法在行为识别上可以取得较好的性能.
英文摘要
      Robust action recognition in videos is a challenging task due to its complexity. To solve it, how to effectively capture the robust spatio-temporal features becomes very important. In this paper, we propose to exploit bi-directional long short-term memory (Bi--LSTM) model as main framework to capture bi-directional spatio-temporal features. First, in order to boost our feature representations, the traditional hand-crafted descriptors are replaced by the extracted hierarchical convolutional neural network features. The multiple convolutional layer features fuse the information of low level basic shapes and high level semantic contents to get powerful spatial features. Then, the extracted convolutional features are fed into Bi--LSTM which has two different directional LSTM layers. The forward layer captures the evolution from front to back over video time and the backward layer models the opposite directional evolution. The two directional representations of evolution are then fused into Softmax to get final classification result. The experiments on UCF101 and HMDB51 datasets show that our method can achieve comparable performance with the state of the art methods for action recognition.