International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 175 - Number 32 |
Year of Publication: 2020 |
Authors: Sri Winiarti, Ulaya Ahdiani, Romakh Fitriani |
10.5120/ijca2020920879 |
Sri Winiarti, Ulaya Ahdiani, Romakh Fitriani . Scanning of Thesis Script Similarity with Vector Space Model. International Journal of Computer Applications. 175, 32 ( Nov 2020), 38-46. DOI=10.5120/ijca2020920879
The rapid growth of online textual data has increased the need for information retrieval (IR) methods that is time efficient. Text classification is the process of finding the category of a document based on its content. However, few discuss text classification using cascading texts. In general, text classification uses the Vector Space Model (VSM) proposed by Salton, Wong, and Yang (1975) as a model for document representation and queries. One of the limitations of VSM is the problem of space, because each document must be represented using all the words in the dictionary (i.e. vocabulary). With the convenience provided by search engines to assist users in searching for information online, the internet is the dominant data and information center. No exception for students, the level of internet use in finding references is very high. Statistics released by the Indonesian Internet Service Providers Association (APJII) in 2019 stated that Indonesian internet users had reached 171.17 million people or around 64.8%. Young people with an age range of 15-34 years dominate the number of users up to 49.52%. There is possibilities of similarities in publication, due to the large number of scientific publications that are published each year. The highest level of similarity in the thesis text is in the title and theoretical study. In searching for references for theoretical studies, students tend to plagiarize on a scientific work by copying part or even the entire content without mentioning the original source of the scientific work. Therefore, this research aims to create a document similarity detection system using the Vector Space Models (SVM) method. The data sets used to detect the similarities were 443 undergraduate thesis titles and 442 studies on the theory of thesis texts. From the accuracy test carried out on 132 queries from 321 thesis texts, it was obtained a mean average precision of 0.996.