Robust Estimation of Crowd Density Using Vision Transformers

Chuho Yi (1), Jungwon Cho (2)
(1) Department of AI Convergence, Hanyang Women's University, Seoul, Republic of Korea
(2) Department of Computer Education, Jeju National University, Jeju, Republic of Korea
Fulltext View | Download
How to cite (IJASEIT) :
Yi , Chuho, and Jungwon Cho. “Robust Estimation of Crowd Density Using Vision Transformers”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 5, Oct. 2024, pp. 1528-33, doi:10.18517/ijaseit.14.5.11267.
Estimating and predicting crowd density at large events or during disasters is of paramount importance for enhancing emergency management systems. This includes planning effective evacuation routes, optimizing rescue operations, and ensuring efficient deployment of emergency services. Traditionally, surveillance systems that rely on cameras have been employed to monitor crowd movements. However, accurately estimating crowd density using such systems presents several challenges. These challenges stem primarily from the interaction between large crowds and the limitations of two-dimensional cameras in capturing the full scope of three-dimensional spaces. Optical distortions, environmental factors, and variations in camera angles further complicate the task, making accurate estimations difficult to achieve. To address these challenges, this paper introduces a robust method for calculating crowd density that leverages advanced vision transformers. By combining the output of these transformers with a two-stage neural network, the method effectively mitigates the limitations of traditional approaches. One of the key advantages of the proposed system is its robustness, which allows it to perform well across different camera specifications, installation locations, and image aspect ratios. The method applies and evaluates various deep learning techniques, introducing improvements to existing network structures that are better suited for the problem at hand. Extensive experimental verification demonstrates that the proposed method consistently produces accurate crowd density estimates, even in diverse and complex crowd environments. This robust performance underscores its potential for improving emergency management and crowd control in real-world situations.

B. Li, H. Huang, A. Zhang, P. Liu, and C. Liu, “Approaches on crowd counting and density estimation: a review,” Pattern Analysis and Applications, vol. 24, no. 3, pp. 853–874, Feb. 2021, doi:10.1007/s10044-021-00959-z.

M. A. Khan, H. Menouar, and R. Hamila, “Revisiting crowd counting: State-of-the-art, trends, and future perspectives,” Image and Vision Computing, vol. 129, p. 104597, Jan. 2023, doi:10.1016/j.imavis.2022.104597.

D. Morgan, “Where are we?: camera movements and the problem of point of view,” New Review of Film and Television Studies, vol. 14, no. 2, pp. 222–248, Feb. 2016, doi: 10.1080/17400309.2015.1125702.

H. Rahmalan, M. S. Nixon, and J. N. Carter, “On crowd density estimation for surveillance,” IET Conference on Crime and Security, vol. 2006, pp. 540–545, 2006, doi: 10.1049/ic:20060360.

H. Rahmalan, N. Suryana, & N. A. Abu, “A general approach for measuring crowd movement,” Malaysian Technical Universities Conference and Exhibition on Engineering and Technology, Jan. 2009.

S. A. M. Saleh, S. A. Suandi, and H. Ibrahim, “Recent survey on crowd density estimation and counting for visual surveillance,” Engineering Applications of Artificial Intelligence, vol. 41, pp. 103–114, May 2015, doi: 10.1016/j.engappai.2015.01.007.

Z. Fan, H. Zhang, Z. Zhang, G. Lu, Y. Zhang, and Y. Wang, “A survey of crowd counting and density estimation based on convolutional neural network,” Neurocomputing, vol. 472, pp. 224–251, Feb. 2022, doi: 10.1016/j.neucom.2021.02.103.

M. Elgendy, Deep learning for vision systems, Simon and Schuster, 2020.

N. Sharma, R. Sharma, & N. Jindal, “Machine learning and deep learning applications-a vision,” Global Transitions Proceedings, 2(1), pp. 24-28, 2021.

N. O’Mahony, S. Campbell, A. Carvalho, S. Harapanahalli, G. V. Hernandez, L. Krpalkova, & J. Walsh, “Deep Learning vs. Traditional Computer Vision,” Advances in Computer Vision, pp. 128–144, Apr. 2019, doi: 10.1007/978-3-030-17795-9_10.

S. Islam et al., “A comprehensive survey on applications of transformers for deep learning tasks,” Expert Systems with Applications, vol. 241, p. 122666, May 2024, doi:10.1016/j.eswa.2023.122666.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, & N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.

M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, & A. Dosovitskiy, “Do vision transformers see like convolutional neural networks?,” Advances in Neural Information Processing Systems, 34, pp. 12116–12128, 2021.

Y. Chen, J. Yang, B. Chen, and S. Du, “Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1055–1068, Mar. 2023, doi:10.1109/tcsvt.2022.3208714.

Y. Xiao et al., “A review of object detection based on deep learning,” Multimedia Tools and Applications, vol. 79, no. 33–34, pp. 23729–23791, Jun. 2020, doi: 10.1007/s11042-020-08976-6.

H. Lin, Z. Ma, R. Ji, Y. Wang, and X. Hong, “Boosting Crowd Counting via Multifaceted Attention,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, doi:10.1109/cvpr52688.2022.01901.

H. Lin, Z. Ma, X. Hong, Q. Shangguan, and D. Meng, “Gramformer: Learning Crowd Counting via Graph-Modulated Transformer,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 4, pp. 3395–3403, Mar. 2024, doi: 10.1609/aaai.v38i4.28126.

M. Zand, H. Damirchi, A. Farley, M. Molahasani, M. Greenspan, and A. Etemad, “Multiscale Crowd Counting and Localization By Multitask Point Supervision,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, doi: 10.1109/icassp43922.2022.9747776.

C. Xu, K. Qiu, J. Fu, S. Bai, Y. Xu, and X. Bai, “Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), vol. 521, pp. 8381–8389, Oct. 2019, doi:10.1109/iccv.2019.00847.

C. Liu, H. Lu, Z. Cao, and T. Liu, “Point-Query Quadtree for Crowd Counting, Localization, and More,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), vol. 2105, pp. 1676–1685, Oct. 2023, doi: 10.1109/iccv51070.2023.00161.

V. Sindagi, R. Yasarla, and V. M. M. Patel, “JHU-CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020, doi: 10.1109/tpami.2020.3035969.

Q. Wang, J. Gao, W. Lin, and X. Li, “NWPU-Crowd: A Large-Scale Benchmark for Crowd Counting and Localization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 6, pp. 2141–2149, Jun. 2021, doi: 10.1109/tpami.2020.3013269.

T. E. Oliphant, Guide to NumPy (Vol. 1, p. 85), USA: Trelgol Publishing, 2006.

S. van der Walt, S. C. Colbert, and G. Varoquaux, “The NumPy Array: A Structure for Efficient Numerical Computation,” Computing in Science & Engineering, vol. 13, no. 2, pp. 22–30, Mar. 2011, doi:10.1109/mcse.2011.37.

R. Shen, S. Bubeck, & S. Gunasekar, “Data augmentation as feature manipulation,” In International Conference on Machine Learning (PMLR), pp. 19773-19808, June, 2022.

S.-H. Choi, “A study on Object Detection Method using Raspberry Pi,” Intelligent Information Convergence and Future Education, pp.1-6, 2022.

S.-H. Go, S.-M. Yang, H.-Y. Kim, and S.-B. Gwak, “Multi-Spectrum CNN-Based High-Resolution Color Image Interpolation Technique,” The Korean Association of Computer Education, 27(3), pp. 145-153, 2024.

D. Kim, J. Jeon, S. Lim, and H. Lee, “An Object Pseudo-Label Generation Technique based on Self-Supervised Vision Transformer for Improving Dataset Quality,” Journal of KIISE, vol. 51, no. 1, pp. 49–58, Jan. 2024, doi: 10.5626/jok.2024.51.1.49.

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G., Chanan, & S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, 32, 2019.

S. Imambi, K. B. Prakash, & G. R. Kanagachidambaresan, PyTorch. Programming with TensorFlow: solution for edge computing applications, pp. 87-104, 2021.

Z. Ma, X. Hong, X. Wei, Y. Qiu, and Y. Gong, “Towards A Universal Model for Cross-Dataset Crowd Counting,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, doi:10.1109/iccv48922.2021.00319.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).