A recent overview of the state-of-the-art elements of text classification

1. Introduction

Text classification is a construction problem of models which can classify new documents into pre-defined classes (Liu, 2006, Manning, Raghavan, Schütze, 2008). Currently, it is a sophisticated process involving not only the training of models, but also numerous additional procedures, e.g. data pre-processing, transformation, and dimensionality reduction. Text classification remains a prominent research topic, utilising various techniques and their combinations in complex systems. Furthermore, researchers are either developing new classification systems or improving the existing ones, including their elements to yield better results, i.e. a higher computational efficiency (Altınel, Ganiz, Diri, 2015, Pinheiro, Cavalcanti, Tsang, 2017, Wang, Wu, Zhang, Xu, Lin, 2013).

Literature overviews of text classification usually reveal its crucial elements, techniques, and solutions, proposing the further development of this research area. Nevertheless, the existing reviews are still useful as they address the significant problems of text classification (Aas, Eikvil, 1999, Aggarwal, Zhai, 2012). However, these works are slightly outdated as they do not include the latest studies. Furthermore, their explanation of text classification has some limitations, for example, they lay emphasis only on machine learning techniques or algorithms, omit some essential elements of text classification, or focus on a particular research domain (Adeva, Atxa, Carrillo, Zengotitabengoa, 2014, Guzella, Caminhas, 2009, Ittoo, Nguyen, van den Bosch, 2016). We reiterate here that these are excellent works, which are still useful to the research community. However, with the increasing interest in the area of text classification, we need the most recent systematic overview to better understand what has been achieved in this field.

In this study, we aim to overcome the difficulties mentioned above. Moreover, the article presents a latest and holistic summary of text classification. We direct significant effort to generate a research map for text classification to assist in recognising its main elements by examining both the most recent and former studies. More specifically, in addition to the understandable requirement to complement the existing reviews, the objectives of this study are as follows:

1.To extract and present the essential phases of the text classification process, including the most common vocabulary, as a baseline framework. This framework could be referred as the map of text classification.2.To enumerate both the older and new techniques utilised in each phase of text classification. These techniques are identified systematically via a qualitative analysis.3.To perform a quantitative analysis of the system to exhibit the research trends in this area.

According to the best of our knowledge, there are no similar recent studies in the form of an overview of the investigated field. Furthermore, we believe that this study significantly systematises and enhances the knowledge regarding the modelling of classification systems. The results of the text classification process with its elements are particularly relevant. Moreover, we show that it is possible to identify, explore, and develop new aspects of text classification or alternatively upgrade its existing components. In addition, our study constitutes a relevant and modern complement to the current reviews.

The paper is structured as follows. Section 2 presents a comprehensive description of the existing reviews. Next, Section 3 describes the text classification process and explains the review procedure. Then, Section 4 explains the problems, objective, and components of text classification via a qualitative analysis. Section 5 introduces a quantitative analysis of the text classification journals, including conference proceedings. Finally, Section 6 concludes the research study.

Fig. 1. Flowchart of the text classification process with the state-of-the-art elements.

Fig. 4. Different types of dimensionality reduction.

Download : Download high-res image (337KB) Download : Download full-size image

Fig. 5. Basic steps of model evaluation.

Download : Download high-res image (198KB) Download : Download full-size image

Fig. 6. Distribution of the articles over selected topics, i.e. classification systems and application areas, document representation, feature construction, feature weight, feature selection, feature projection, instance selection, learning methods, and evaluation methods. The research sample contains 233 articles.

Download : Download high-res image (303KB) Download : Download full-size image