Multi-Modal Deep Learning based Metadata Extensions for Video Clipping

Woo-Hyeon Kim; Geon-Woo Kim; Joo-Chang Kim

doi:10.18517/ijaseit.14.1.19047

DOI : https://doi.org/10.18517/ijaseit.14.1.19047

Multi-Modal Deep Learning based Metadata Extensions for Video Clipping

Woo-Hyeon Kim ⁽¹⁾, Geon-Woo Kim ⁽²⁾, Joo-Chang Kim ⁽³⁾

(1) Department of Computer Science, Kyonggi University, Suwon 16227, Republic of Korea

(2) Department of Computer Science, Kyonggi University, Suwon 16227, Republic of Korea

(3) Division of AI Computer Science and Engineering, Kyonggi University, Suwon 16227, Republic of Korea

Fulltext View | Download

How to cite (IJASEIT) :

Kim, Woo-Hyeon, et al. “Multi-Modal Deep Learning Based Metadata Extensions for Video Clipping”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 1, Feb. 2024, pp. 375-80, doi:10.18517/ijaseit.14.1.19047.

Citation Format :

General video search and recommendation systems primarily rely on metadata and personal information. Metadata includes file names, keywords, tags, and genres, among others, and is used to describe the video's content. The video platform assesses the relevance of user search queries to the video metadata and presents search results in order of highest relevance. Recommendations are based on videos with metadata judged to be similar to the one the user is currently watching. Most platforms offer search and recommendation services by employing separate algorithms for metadata and personal information. Therefore, metadata plays a vital role in video search. Video service platforms develop various algorithms to provide users with more accurate search results and recommendations. Quantifying video similarity is essential to enhance the accuracy of search results and recommendations. Since content producers primarily provide basic metadata, it can be abused. Additionally, the resemblance between similar video segments may diminish depending on its duration. This paper proposes a metadata expansion model that utilizes object recognition and Speech-to-Text (STT) technology. The model selects key objects by analyzing the frequency of their appearance in the video, extracts audio separately, transcribes it into text, and extracts the script. Scripts are quantified by tokenizing them into words using text-mining techniques. By augmenting metadata with key objects and script tokens, various video content search and recommendation platforms are expected to deliver results closer to user search terms and recommend related content.

Kim J. C. and Chung K. Y., “Knowledge expansion of metadata using script mining analysis in multimedia recommendation”, Multimedia Tools and Applications, vol. 80, pp. 34679-34695, 2020.

Lee J. H., "The Growth and Impact of OTT on Video Viewing Behavior", Asian-pacific Journal of Convergent Research Interchange, vol. 6, no. 1, pp. 41-50, 2020.

Shah S. and Mehta N., “Over-the-top (OTT) streaming services: studying users’ behaviour through the UTAUT model”, Management and Labour Studies, vol. 48, no. 4, pp. 531-547, 2023.

Luo, M., Chen, F., Cheng, P., Dong, Z., He, X., Feng, J. and Li, Z., “Metaselector: Meta-learning for recommendation with user-level adaptive model selection”, In Proceedings of The Web Conference 2020, pp. 2507-2513, 2020.

N. Silva, T. Silva, H. Werneck, L. Rocha and A. Pereira, “User cold-start problem in multi-armed bandits: When the first recommendations guide the user’s experience”, ACM Transactions on Recommender Systems, vol. 1, no. 1, pp. 1-24, 2023.

Hao, B., Yin, H., Zhang, J., Li, C. and Chen, H., “A Multi-strategy-based Pre-training Method for Cold-start Recommendation”, ACM Transactions on Information Systems, vol. 41, no. 2, pp. 1-24, 2023.

A. Ishikawa, E. Bollis and S. Avila, "Combating the elsagate phenomenon: Deep learning architectures for disturbing cartoons," in Proc.IWBF’19, pp. 1-6. 2019.

Matakupan, N. I., “The Study of'Don't Hug Me I'm Scared'Web Series Storytelling For IP Design Regarding Safe Viewing Content For Children”, IJVCDC (Indonesian Journal of Visual Culture, Design, and Cinema), vol. 2, no. 2, pp. 172-177, 2023.

Yousaf, K. and Nawaz, T., “An attention mechanism-based CNN-BiLSTM classification model for detection of inappropriate content in cartoon videos”, Multimedia Tools and Applications, pp. 1-24, 2023.

Huang, S., Liu, G., Chen, Y., Zhou, H. and Wang, Y., “Video Recommendation Method Based on Deep Learning of Group Evaluation Behavior Sequences”, International Journal of Pattern Recognition and Artificial Intelligence, vol. 37, no. 02, pp. 2352002, 2023.

A. Yousaf., A. Mishra., B. Taheri and M. Kesgin, “A cross-country analysis of the determinants of customer recommendation intentions for over-the-top (OTT) platforms,” Information & Management, vol. 58, no. 8, pp. 103543, 2021.

Hashemi, M., “Web page classification: a survey of perspectives, gaps, and future directions”, Multimedia Tools and Applications, vol. 79, no. 17-18, pp. 11921-11945, 2020.

Hesmondhalgh, D. and Lotz, A., “Video screen interfaces as new sites of media circulation power”, International Journal of Communication, vol. 14, pp. 386-409, 2020.

Patnaik, R., Shah, R. and More, U., “Rise of OTT platforms: effect of the C-19 pandemic”, PalArch's Journal of Archaeology of Egypt/Egyptology, vol. 18, no. 7, pp. 2277-2287, 2021.

Singh, N., Arora, S. and Kapur, B., “Trends in over the top (OTT) research: a bibliometric analysis”, VINE Journal of Information and Knowledge Management Systems, vol. 52, no. 3, 411-425, 2022.

Sontakke, K. S., “Trends in OTT Platforms Usage During COVID-19 Lockdown in India”, Journal of Scientific Research, vol. 65. no. 8, pp. 23, 2021.

Sun, C., Jia, Y., Hu, Y. and Wu, Y., “Scene-aware context reasoning for unsupervised abnormal event detection in videos”, In Proceedings of the 28th ACM International Conference on Multimedia, pp. 184-192, 2020.

Ramachandra, B., Jones, M. J. and Vatsavai, R. R., “A survey of single-scene video anomaly detection”, IEEE transactions on pattern analysis and machine intelligence, vol. 44, no.5, pp. 2293-2312, 2020.

Raja, R., Sharma, P. C., Mahmood, M. R. and Saini, D. K., “Analysis of anomaly detection in surveillance video: recent trends and future vision”, Multimedia Tools and Applications, vol. 82, no. 8, pp. 12635-12651, 2023.

Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N. And Li, T., “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning”, Neurocomputing, vol. 508, pp. 293-304, 2022.

Haq, H. B. U., Asif, M. and Ahmad, M. B., “Video summarization techniques: a review”, Int. J. Sci. Technol. Res, vol. 9, no. 11, pp. 146-153, 2020.

Workie, A., Sharma, R. and Chung, Y. K., “Digital video summarization techniques: A survey”, Int. J. Eng. Technol, vol. 9. no. 1, pp. 81-85, 2020.

Tiwari, V. and Bhatnagar, C., “A survey of recent work on video summarization: approaches and techniques”, Multimedia Tools and Applications, vol. 80, no. 18, pp. 27187-27221, 2021.

Malik, M., Malik, M. K., Mehmood, K. and Makhdoom, I., “Automatic speech recognition: a survey”, Multimedia Tools and Applications, vol. 80, pp. 9411-9457, 2021.

Li, J., “Recent advances in end-to-end automatic speech recognition”, APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022.

Guo, Z., Leng, Y., Wu, Y., Zhao, S. and Tan, X, “PromptTTS: Controllable text-to-speech with text descriptions,” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023.

Saranya, V., Devi, T. and Deepa, N, “Text Normalization by Bi-LSTM Model with Enhanced Features to Improve Tribal English Knowledge”, In 2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 1674-1679, 2023.

Ultralytics YOLOv8, 2024. 01. 01. https://docs.ultralytics.com/.

CoCo Dataset, 2024. 01. 01. https://cocodataset.org/.

Google Cloud Speech, 2024. 01. 01. https://cloud.google.com/.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution LicenseÂ that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (SeeÂ The Effect of Open Access).