基于动态时序移位的视频特征学习方法-《计算机技术与发展》

文章信息/Info

Title:: Video Feature Learning Method Based on Dynamic Temporal Shift

文章编号:: 1673-629X(2022)12-0043-07

作者:: 谈伟峰; 程春玲; 毛毅; 南京邮电大学计算机学院、软件学院、网络空间安全学院,江苏南京 210023

Author(s):: TAN Wei-feng; CHENG Chun-ling; MAO Yi; School of Computer Science,Nanjing University of Posts and Telecommunications, Nanjing 210023,China

关键词:: 视频动作识别; 全连接神经网络; 时序特征学习; 动态时序移位; 全局时空特征学习

Keywords:: video action recognition; fully connected neural network; temporal feature learning; dynamic temporal shift; globalspatiotemporal feature learning

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2022. 12. 007

摘要:: 视频动作识别旨在分类不同视频片段中的动作,而一个视频片段中的动作连续存在于整个时间维度,因此对连续动作所包含的时序特征进行学习是视频动作识别任务中的一个重要方向。现有方法主要通过更多的卷积操作学习时序特征,获取视频动作时序信息的同时增加了模型的复杂度和计算量;而时序移位操作则通过沿时间维度对通道特征进行移位实现时序信息的建模,减少了计算量,但只考虑了低层次通道的时序特征学习,缺乏通道选择的依据,且忽略了时序移位对整个时空特征结构的影响。为此,提出基于动态时序移位( Dynamic Temporal Shift,DTS) 的视频特征学习方法。首先,利用双层全连接神经网络学习不同层次通道上多个时间维度特征间的相关性,获得整个通道的注意力分布,并固定双层全连接神经网络的参数用于保存全局特征信息。然后,设计 DTS 模块,依据通道的注意力分布动态选择通道进行移位。此外,为消除时间维度上特征的移位对全局时空特征结构的影响, 利用全局信息进一步学习全局时空特征( GlobalSpatiotemporal Feature)。在 UCF101 和 Something-something v2 公开数据集上取得较好的识别效果,验证了方法的有效性。

Abstract:: Video action recognition aims to classify the action in different video clips. The actions in a video clip continuously exist in theentire time dimension, so it is an important direction in video action recognition task to learn time sequence features contained incontinuous action. Existing methods mainly learn temporal features through more convolution operations to obtain video actioninformation while increasing the complexity of the model and measurement calculations. Temporal shift operation shifts channel featuresalong the time dimension to achieve temporal information modeling which reduces the calculation,but only low-level feature learning isconsidered,the basis for channel selection is lacking,and the influence of temporal shift on the entire spatiotemporal feature structure is ignored. To this end,an video feature learning method based on Dynamic Temporal Shift ( DTS) is proposed. Firstly,double-layer fullconnection neural network is adopted to learn the correlation between multiple time dimension features on different levels of channels,andthe attention distribution of the entire channel is obtained,and the double-layer full connection parameters is fixed to save global featureinformation. Then the DTS module is designed to dynamically select the channel for shifting according to the attention distribution of thechannel. In addition,in order to eliminate the influence of temporal shift on the entire feature structure,global information is used tofurther learn global spatiotemporal features. An excellent recognition effect was achieved on the public data sets of UCF101 andSomething-something v2, which verified the effectiveness of the proposed method.

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

相似文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics