ML p(r)ior | Order-aware Convolutional Pooling for Video Based Action Recognition

Order-aware Convolutional Pooling for Video Based Action Recognition

2016-01-31
Most video based action recognition approaches create the video-level representation by temporally pooling the features extracted at each frame. The pooling methods that they adopt, however, usually completely or partially neglect the dynamic information contained in the temporal domain, which may undermine the discriminative power of the resulting video representation since the video sequence order could unveil the evolution of a specific event or action. To overcome this drawback and explore the importance of incorporating the temporal order information, in this paper we propose a novel temporal pooling approach to aggregate the frame-level features. Inspired by the capacity of Convolutional Neural Networks (CNN) in making use of the internal structure of images for information abstraction, we propose to apply the temporal convolution operation to the frame-level representations to extract the dynamic information. However, directly implementing this idea on the original high-dimensional feature would inevitably result in parameter explosion. To tackle this problem, we view the temporal evolution of the feature value at each feature dimension as a 1D signal and learn a unique convolutional filter bank for each of these 1D signals. We conduct experiments on two challenging video-based action recognition datasets, HMDB51 and UCF101; and demonstrate that the proposed method is superior to the conventional pooling methods.
PDF

Highlights - Most important sentences from the article

Login to like/save this paper, take notes and configure your recommendations

Related Articles

2014-12-02

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensi… show more
PDF

Highlights - Most important sentences from the article

2017-05-08

Deep convolutional networks have achieved great success for image recognition. However, for action r… show more
PDF

Highlights - Most important sentences from the article

2016-12-02

We introduce the concept of "dynamic image", a novel compact representation of videos useful for vid… show more
PDF

Highlights - Most important sentences from the article

2014-06-09
1406.2199 | cs.CV

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for … show more
PDF

Highlights - Most important sentences from the article

2017-11-30

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study t… show more
PDF

Highlights - Most important sentences from the article

2016-11-21
1611.06678 | cs.CV

The CNN-encoding of features from entire videos for the representation of human actions has rarely b… show more
PDF

Highlights - Most important sentences from the article

2015-05-19
1505.04868 | cs.CV

Visual features are of vital importance for human action understanding in videos. This paper present… show more
PDF

Highlights - Most important sentences from the article

2017-11-11
1711.04161 | cs.CV

From the frame/clip-level feature learning to the video-level representation building, deep learning… show more
PDF

Highlights - Most important sentences from the article

2017-08-24
1708.07335 | cs.CV

Frame-level visual features are generally aggregated in time with the techniques such as LSTM, Fishe… show more
PDF

Highlights - Most important sentences from the article

2017-11-27

Recently, substantial research effort has focused on how to apply CNNs or RNNs to better extract tem… show more
PDF

Highlights - Most important sentences from the article

2017-11-28
1711.10305 | cs.CV

Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recog… show more
PDF

Highlights - Most important sentences from the article

2018-04-09
1804.03247 | cs.CV

In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activi… show more
PDF

Highlights - Most important sentences from the article

2015-04-07

Classifying videos according to content semantics is an important problem with a wide range of appli… show more
PDF

Highlights - Most important sentences from the article

2018-04-19

Acquiring spatio-temporal states of an action is the most crucial step for action classification. In… show more
PDF

Highlights - Most important sentences from the article

2016-04-15
1604.04494 | cs.CV

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Rec… show more
PDF