An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus

هبه عبدالعزيز الجبرين

Associate Professor

مديرة مركز الدراسات المتقدمة في الذكاء الاصناعي (ذكاء) بالجامعة

علوم الحاسب والمعلومات

قسم تقنية المعلومات | المدينة الجامعية للطالبات مكتب 74 مبنى 6 الدور الثالث

An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus

Due to the rapid developments in technology and the sudden expansion of social media use, Dialect Arabic has become an important source of data that needs to be addressed when building Arabic corpora. In this paper, thirty-three Arabic corpora are surveyed to show that despite all of the developments in the literature, Saudi dialect (SD) corpora still need further expansion. This paper contributes to the literature on SD corpora by creating the largest Saudi corpus - the King Saud University Saudi Corpus (KSUSC) - with +1B total words, including +119M SD words. The KSUSC not only is the newest and largest SD corpus but is also diverse, covering 26 domains in text collected from five different sources. This paper also contributes to the literature by developing a new incremental preprocessing system that is used to create relevant lexicons that are then used to clean and normalize the collected data. This incremental system is scalable and can be adapted for different resources and dialects. Moreover, the collection process for building the KSUSC is discussed in detail, and the challenges in collecting SD text with respect to each platform are highlighted. By the end of this paper, different design criteria are proposed and used with the KSUSC to conclude that the resulting corpus can be of great benefit to researchers who are interested in integrating the corpus with their own work or using its resulting lexicons with Saudi-based NLP tasks.

اسم الناشر

IEEE Access

رقم المجلد

الصفحات

88405-88428

مزيد من المنشورات

Reducing Children’s Obesity in the Age of Telehealth and AI/IoT Technologies in Gulf Countries

Childhood obesity has become one of the major health issues in the global population. The increasing prevalence of childhood obesity is associated with serious health issues and comorbidities…

بواسطة M. Faisal, H. Elgibreen, N. Alafif, C. Joumaa

2022

Telepresence Robot System for People with Speech or Mobility Disabilities

Due to an increase in the number of disabled people around the world, inclusive solutions are becoming a priority. People with disabilities may encounter many problems and may not be able to…

بواسطة Hebah ElGibreen, Ghada Al Ali, Rawan AlMegren, Reema AlEid, Samar AlQahtani

2022

تم النشر فى:

Sensors

A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions

A number of AI-generated tools are used today to clone human voices, leading to a new technology known as Audio Deepfakes (ADs). Despite being introduced to enhance human lives as audiobooks, ADs…

بواسطة Zaynab Almutairi, Hebah Elgibreen

2022