Transformer-based automatic Arabic text diacritization

Authors

  • Ali Assad Wasit University, Iraq
  • Abdul Hadi M. Alaidi Wasit University, Iraq
  • Amjad Yousif Sahib Wasit University, Iraq
  • Haider TH. Salim ALRikabi Wasit University, Iraq
  • Ahmed Magdy Suez Canal University, Egypt

DOI:

https://doi.org/10.37868/sei.v6i2.id305

Abstract

In Arabic natural language processing (NLP), automatic text diacritization is a major obstacle, and progress has been slow when compared to other language processing tasks. Automatic diacritical marking of Arabic text is proposed in this work using the first transformer-based paradigm designed solely for this task. By taking advantage of the attention mechanism, our system is able to capture more of the innate patterns in Arabic, surpassing the performance of both rule-based alternatives and neural network techniques. The model trained with the Clean-50 dataset had a diacritic error rate (DER) of 2.03%, even though the model trained with the Clean-400 dataset had a DER of 1.37%. As compared to state-of-the-art results, the improvement for the Clean-50 dataset is minimal. However, for the larger Clean-400 dataset, it is a notable improvement, indicating that this approach can deliver more accurate solutions for applications requiring precise diacritical marks with larger datasets. Additionally, this method achieves a DER of 1.21% for the Clean-400 dataset, and it performs even better when given extended input text with overlapping windows.

Published

2024-11-29

How to Cite

[1]
A. Assad, A. H. M. Alaidi, A. Y. Sahib, H. T. S. ALRikabi, and A. Magdy, “Transformer-based automatic Arabic text diacritization”, Sustainable Engineering and Innovation, vol. 6, no. 2, pp. 285-296, Nov. 2024.

Issue

Section

Articles