Toward Accurate Social Media Language Identification: Combining Language Features with a Graphical Approach


Language identification (LID) is a primordial step of several NLP applications, and it consists of automatically recognizing the natural language in which the text is written. Hence, it faces three major issues: noisy and short texts, multilingual texts, and similar languages. In the investigated work, we address the benchmarked problem of LID of Twitter messages, where we present an effective approach (HAG) to deal with the three difficulties. The HAG approach is a combination of a graphical approach and a statistical approach, and does not require a large training set unlike existing approaches of LID. Furthermore, it can be trained with any source of texts including noisy and short texts. The empirical evaluation was undergone on a Twitter corpus regrouping tweets written in 31 languages, while our proposal was trained on a small corpus of discussion forum messages. In addition, our proposal was compared with the-state-of-the-art tools of LID, i.e. and LangDetect, and the results showed that the HAG approach maintains consistently high performances (over 87% of accuracy).

The 3rd International Conference on Pattern Analysis and Intelligent Systems