Integration of CNN and LSTM Networks for Behavior Feature Recognition: An Analysis

Teh Noranis Mohd Aris (1), Chen Ningning (2)
(1) Department of Computer Science, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
(2) Department of Computer Science, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
Fulltext View | Download
How to cite (IJASEIT) :
Aris , Teh Noranis Mohd, and Chen Ningning. “Integration of CNN and LSTM Networks for Behavior Feature Recognition: An Analysis”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 5, Oct. 2024, pp. 1793-9, doi:10.18517/ijaseit.14.5.10116.
This study explores an integration model combining convolutional neural networks (CNNs) and long short-term memory networks (LSTMs) for behavior feature recognition. Initially, a straightforward three-dimensional deep CNN structure was introduced for behavior recognition, capturing static and dynamic characteristics, and analyzing the network's convergence speed. Subsequent experiments utilize the VGG16 CNN model, substituting the fully connected layer with global average pooling. Then, a comparative experiment was conducted on the MSRC-12 behavior dataset between the models. Due to the complexity of LSTM, a simpler GRU model with similar effectiveness was used for comparison. The experimental results showed that the GRU-CNN model performed best, outperforming other algorithms in the literature on the same dataset. Under the same experimental parameters, the GRU-CNN model converges significantly faster than the LSTM-CNN model, with speedier training speed. In addition, the best accuracy is achieved by adjusting the dropout and epoch. Due to cross-validation in this study, the GRU-CNN models achieved good experimental results when the hidden node dropout rate was 0.5. The epoch size had negligible impact on the GRU-CNN model. Still, the accuracy of the CNN and CNN-GRU models increased significantly with more epochs, further validating the effectiveness of the GRU-CNN model. These experiments also indicate that convolutional neural networks based on deep learning are superior to traditional machine learning methods for human behavior recognition. Using depth images instead of conventional images allows for better extraction of spatial features, and the integration with long short-term memory networks enhances the extraction of temporal features from sequences.

C. Shiranthika, N. Premakumara, H.-L. Chiu, H. Samani, C. Shyalika, and C.-Y. Yang, “Human Activity Recognition Using CNN & LSTM,” 2020 5th International Conference on Information Technology Research (ICITR), pp. 1–6, Dec. 2020, doi:10.1109/icitr51448.2020.9310792.

K. Hu, J. Jin, F. Zheng, L. Weng, and Y. Ding, “Overview of behavior recognition based on deep learning,” Artificial Intelligence Review, vol. 56, no. 3, pp. 1833–1865, Jun. 2022, doi: 10.1007/s10462-022-10210-8.

V. Mahalakshmi et al., “Few-shot learning-based human behavior recognition model,” Computers in Human Behavior, vol. 151, p. 108038, Feb. 2024, doi: 10.1016/j.chb.2023.108038.

A. Martin-Cirera, M. Nowak, T. Norton, U. Auer, and M. Oczak, “Comparison of Transformers with LSTM for classification of the behavioural time budget in horses based on video data,” Biosystems Engineering, vol. 242, pp. 154–168, Jun. 2024, doi:10.1016/j.biosystemseng.2024.04.014.

T. Guo, H. Liu, Z. Chen, M. Liu, T. Wang, and R. Ding, “Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 762–770, Jun. 2022, doi:10.1609/aaai.v36i1.19957.

J. Li, H. Liu, C. Zhang, K. Li, and Y. Sun, “Interaction Recognition Using Depth Information Based on 3D CNNs,” 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), pp. 1–7, Dec. 2019, doi:10.1109/icsidp47821.2019.9172824.

S. Kahlouche and M. Belhocine, “Human Action Recognition using Convolutional Neural Network: Case of Service Robot Interaction,” Proceedings of the 19th International Conference on Informatics in Control, Automation and Robotics, pp. 105–112, 2022, doi:10.5220/0011122300003271.

J. Hu, B. Weng, T. Huang, J. Gao, F. Ye, and L. You, “Deep Residual Convolutional Neural Network Combining Dropout and Transfer Learning for ENSO Forecasting,” Geophysical Research Letters, vol. 48, no. 24, Dec. 2021, doi: 10.1029/2021gl093531.

Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297, Jun. 2012, doi: 10.1109/cvpr.2012.6247813.

O. Oreifej and Z. Liu, “HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences,” 2013 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2013, doi: 10.1109/cvpr.2013.98.

H. Pazhoumand-Dar, C.-P. Lam, and M. Masek, “Joint movement similarities for robust 3D action recognition using skeletal data,” Journal of Visual Communication and Image Representation, vol. 30, pp. 10–21, Jul. 2015, doi: 10.1016/j.jvcir.2015.03.002.

X. Yang and Y. Tian, “Super Normal Vector for Human Activity Recognition with Depth Cameras,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 5, pp. 1028–1039, May 2017, doi: 10.1109/tpami.2016.2565479.

C. Lu, J. Jia, and C.-K. Tang, “Range-Sample Depth Feature for Action Recognition,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 772–779, Jun. 2014, doi: 10.1109/cvpr.2014.104.

Y. Du, Y. Fu, and L. Wang, “Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3010–3022, Jul. 2016, doi: 10.1109/tip.2016.2552404.

Z. Shi and T.-K. Kim, “Learning and Refining of Privileged Information-Based RNNs for Action Recognition from Depth Sequences,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4684–4693, Jul. 2017, doi:10.1109/cvpr.2017.498.

R. Parhi and R. D. Nowak, “The Role of Neural Network Activation Functions,” IEEE Signal Processing Letters, vol. 27, pp. 1779–1783, 2020, doi: 10.1109/lsp.2020.3027517.

K. A. Athira and J. Divya Udayan, “Temporal Fusion of Time-Distributed VGG-16 and LSTM for Precise Action Recognition in Video Sequences,” Procedia Computer Science, vol. 233, pp. 892–901, 2024, doi: 10.1016/j.procs.2024.03.278.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell.” Long-term recurrent convolutional networks for visual recognition and description”, “2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),” 2015, doi: 10.1109/cvpr31182.2015.

S. Aburass, O. Dorgham, and J. A. Shaqsi, “A hybrid machine learning model for classifying gene mutations in cancer using LSTM, BiLSTM, CNN, GRU, and GloVe,” Systems and Soft Computing, vol. 6, p. 200110, Dec. 2024, doi: 10.1016/j.sasc.2024.200110.

H. Wang, Y. Zhang, J. Liang, and L. Liu, “DAFA-BiLSTM: Deep Autoregression Feature Augmented Bidirectional LSTM network for time series prediction,” Neural Networks, vol. 157, pp. 240–256, Jan. 2023, doi: 10.1016/j.neunet.2022.10.009.

H. Rao, S. Xu, X. Hu, J. Cheng, and B. Hu, “Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition,” Information Sciences, vol. 569, pp. 90–109, Aug. 2021, doi: 10.1016/j.ins.2021.04.023.

T. Zhou, A. Tao, L. Sun, B. Qu, Y. Wang, and H. Huang, “Behavior recognition based on the improved density clustering and context-guided Bi-LSTM model,” Multimedia Tools and Applications, vol. 82, no. 29, pp. 45471–45488, May 2023, doi: 10.1007/s11042-023-15501-y.

Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, doi: 10.1109/iccv48922.2021.01311.

N. Zhang, Y. Song, D. Fang, Z. Gao, and Y. Yan, “An Improved Deep Convolutional LSTM for Human Activity Recognition Using Wearable Sensors,” IEEE Sensors Journal, vol. 24, no. 2, pp. 1717–1729, Jan. 2024, doi: 10.1109/jsen.2023.3335213.

P. Lalwani and G. Ramasamy, “Human activity recognition using a multi-branched CNN-BiLSTM-BiGRU model,” Applied Soft Computing, vol. 154, p. 111344, Mar. 2024, doi: 10.1016/j.asoc.2024.111344.

C. Cheng and H. Xu, “A 3D motion image recognition model based on 3D CNN-GRU model and attention mechanism,” Image and Vision Computing, vol. 146, p. 104991, Jun. 2024, doi: 10.1016/j.imavis.2024.104991.

S. Fothergill, H. Mentis, P. Kohli, and S. Nowozin, “Instructing people for training gestural interactive systems,” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, May 2012, doi: 10.1145/2207676.2208303.

M. E. Hussein, M. Torki, M. A. Gowayyed and M. El-Saban, "Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations", Proc. Int. Joint Conf. Artif. Intell., pp. 2466-2472, 2013.

Y. Du, Y. Fu, and L. Wang, “Skeleton based action recognition with convolutional neural network,” 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583, Nov. 2015, doi: 10.1109/acpr.2015.7486569.

P. Wang, Z. Li, Y. Hou, and W. Li, “Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks,” Proceedings of the 24th ACM international conference on Multimedia, Oct. 2016, doi: 10.1145/2964284.2967191.

R. Ibañez, Á. Soria, A. Teyseyre, G. Rodríguez, and M. Campo, “Approximate string matching: A lightweight approach to recognize gestures with Kinect,” Pattern Recognition, vol. 62, pp. 73–86, Feb. 2017, doi: 10.1016/j.patcog.2016.08.022.

F. Deboeverie, S. Roegiers, G. Allebosch, P. Veelaert, and W. Philips, “Human gesture classification by brute-force machine learning for exergaming in physiotherapy,” 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–7, Sep. 2016, doi: 10.1109/cig.2016.7860414.

F. Meng, H. Liu, Y. Liang, M. Liu, and W. Liu, “Hierarchical Dropped Convolutional Neural Network for Speed Insensitive Human Action Recognition,” 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, Jul. 2018, doi: 10.1109/icme.2018.8486477.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).