Cite Article
Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words
Choose citation formatBibTeX
@article{IJASEIT10237, author = {Ruhaila Maskat and Nurazzah Abdul Rahman}, title = {Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words}, journal = {International Journal on Advanced Science, Engineering and Information Technology}, volume = {10}, number = {4}, year = {2020}, pages = {1380--1386}, keywords = {text analytics; social media; data pre-processing; normalization; malay language.}, abstract = {As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.
}, issn = {2088-5334}, publisher = {INSIGHT - Indonesian Society for Knowledge and Human Development}, url = {http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&Itemid=1&article_id=10237}, doi = {10.18517/ijaseit.10.4.10237} }
EndNote
%A Maskat, Ruhaila %A Abdul Rahman, Nurazzah %D 2020 %T Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words %B 2020 %9 text analytics; social media; data pre-processing; normalization; malay language. %! Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words %K text analytics; social media; data pre-processing; normalization; malay language. %XAs more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.
%U http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&Itemid=1&article_id=10237 %R doi:10.18517/ijaseit.10.4.10237 %J International Journal on Advanced Science, Engineering and Information Technology %V 10 %N 4 %@ 2088-5334
IEEE
Ruhaila Maskat and Nurazzah Abdul Rahman,"Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words," International Journal on Advanced Science, Engineering and Information Technology, vol. 10, no. 4, pp. 1380-1386, 2020. [Online]. Available: http://dx.doi.org/10.18517/ijaseit.10.4.10237.
RefMan/ProCite (RIS)
TY - JOUR AU - Maskat, Ruhaila AU - Abdul Rahman, Nurazzah PY - 2020 TI - Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words JF - International Journal on Advanced Science, Engineering and Information Technology; Vol. 10 (2020) No. 4 Y2 - 2020 SP - 1380 EP - 1386 SN - 2088-5334 PB - INSIGHT - Indonesian Society for Knowledge and Human Development KW - text analytics; social media; data pre-processing; normalization; malay language. N2 -As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.
UR - http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&Itemid=1&article_id=10237 DO - 10.18517/ijaseit.10.4.10237
RefWorks
RT Journal Article ID 10237 A1 Maskat, Ruhaila A1 Abdul Rahman, Nurazzah T1 Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words JF International Journal on Advanced Science, Engineering and Information Technology VO 10 IS 4 YR 2020 SP 1380 OP 1386 SN 2088-5334 PB INSIGHT - Indonesian Society for Knowledge and Human Development K1 text analytics; social media; data pre-processing; normalization; malay language. ABAs more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.
LK http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&Itemid=1&article_id=10237 DO - 10.18517/ijaseit.10.4.10237