Topic Identification of Noisy Texts:Statistical Approaches


This paper deals with the problem of automatic theme identification of noisy Arabic texts. Actually, there exist several works in this field based on statistical and machine learning approaches for different text categories. Unfortunately, most of the proposed approaches are suitable in clean and long texts. In this investigation, we carried out a comparative study between two different statistical approaches based on tf-idf. Hence, different configurations were used in both approaches to provide a large comparison. Furthermore, an in-house corpus called ANTSIX was created to evaluate the proposed approaches, which contains discussion forum texts related to 6 different topics. Experimental results show that the two statistical approaches are suitable for topic identification of noisy Arabic texts, but each technique has advantages and drawbacks.

International Journal of Hidden Data Mining and Scientific Knowledge Discovery