Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM
Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on natural language processing (NLP) research. However, Twitter data pose numerous challenges and obstacles to NLP tasks. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks, which present significant opportunities for English NLP research and applications. Part-of-speech (POS) tagging for English tweets is one of the tasks that is offered and facilitated by such a tool. In contrast, only a few attempts have been made to develop POS taggers for Arabic content on Twitter. In this paper, we consider POS tagging, which is one of the NLP tasks that directly affects the performance of other subsequent text processing tasks. We introduce three manually annotated datasets for the POS tagging of Arabic tweets: the ‘Mixed,’ ‘MSA,’ and ‘GLF’ datasets with 3000, 1000, and 1000 Arabic tweets, respectively. In addition, we present an exploratory analysis of the behavior of using hashtags in Arabic tweets, which is a phenomenon that affects the task of POS tagging. We also present two supervised POS taggers that are developed based on two approaches: Conditional Random Fields and Bidirectional Long Short-Term Memory (Bi-LSTM) models. We conclude that the Bi-LSTM-based POS tagger achieves the state-of-the-art results for the ‘Mixed’ dataset with 96.5% accuracy. However, the specific-dialect taggers trained on the ‘MSA’ and ‘GLF’ datasets achieve an accuracy of 95.6% and 95%, respectively. The results for the ‘Mixed’ dataset indicate the effectiveness of developing a joint POS tagger without the need for a dialect-specific POS tagger.
The field of natural language processing (NLP) has witnessed a boom in language representation models with the introduction of pretrained language models that are trained on massive textual data…
Approaches for developing Dialogue Systems (DSs) are typically categorized into rule-based and data-driven. Data-driven DSs require a massive quantity of training data, while rule-based DSs rely…
Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on…