Cite Article

Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words

Choose citation format

BibTeX

@article{IJASEIT10237,
   author = {Ruhaila Maskat and Nurazzah Abdul Rahman},
   title = {Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words},
   journal = {International Journal on Advanced Science, Engineering and Information Technology},
   volume = {10},
   number = {4},
   year = {2020},
   pages = {1380--1386},
   keywords = {text analytics; social media; data pre-processing; normalization; malay language.},
   abstract = {

As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.

},    issn = {2088-5334},    publisher = {INSIGHT - Indonesian Society for Knowledge and Human Development},    url = {http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&Itemid=1&article_id=10237},    doi = {10.18517/ijaseit.10.4.10237} }

EndNote

%A Maskat, Ruhaila
%A Abdul Rahman, Nurazzah
%D 2020
%T Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words
%B 2020
%9 text analytics; social media; data pre-processing; normalization; malay language.
%! Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words
%K text analytics; social media; data pre-processing; normalization; malay language.
%X 

As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.

%U http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&Itemid=1&article_id=10237 %R doi:10.18517/ijaseit.10.4.10237 %J International Journal on Advanced Science, Engineering and Information Technology %V 10 %N 4 %@ 2088-5334

IEEE

Ruhaila Maskat and Nurazzah Abdul Rahman,"Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words," International Journal on Advanced Science, Engineering and Information Technology, vol. 10, no. 4, pp. 1380-1386, 2020. [Online]. Available: http://dx.doi.org/10.18517/ijaseit.10.4.10237.

RefMan/ProCite (RIS)

TY  - JOUR
AU  - Maskat, Ruhaila
AU  - Abdul Rahman, Nurazzah
PY  - 2020
TI  - Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words
JF  - International Journal on Advanced Science, Engineering and Information Technology; Vol. 10 (2020) No. 4
Y2  - 2020
SP  - 1380
EP  - 1386
SN  - 2088-5334
PB  - INSIGHT - Indonesian Society for Knowledge and Human Development
KW  - text analytics; social media; data pre-processing; normalization; malay language.
N2  - 

As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.

UR - http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&Itemid=1&article_id=10237 DO - 10.18517/ijaseit.10.4.10237

RefWorks

RT Journal Article
ID 10237
A1 Maskat, Ruhaila
A1 Abdul Rahman, Nurazzah
T1 Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words
JF International Journal on Advanced Science, Engineering and Information Technology
VO 10
IS 4
YR 2020
SP 1380
OP 1386
SN 2088-5334
PB INSIGHT - Indonesian Society for Knowledge and Human Development
K1 text analytics; social media; data pre-processing; normalization; malay language.
AB 

As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.

LK http://ijaseit.insightsociety.org/index.php?option=com_content&view=article&id=9&Itemid=1&article_id=10237 DO - 10.18517/ijaseit.10.4.10237