VGTU talpykla > Fundamentinių mokslų fakultetas / Faculty of Fundamental Sciences > Moksliniai straipsniai / Research articles >

Lietuvių   English
Please use this identifier to cite or link to this item: http://dspace.vgtu.lt/handle/1/3965

Title: The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures
Authors: Stefanovič, Pavel
Kurasova, Olga
Štrimaitis, Rokas
Keywords: self-organizing maps
text mining
text similarity measures
n-grams
frequency matrix
Issue Date: 2019
Publisher: MDPI
Citation: Stefanovič, P.; Kurasova, O.; Štrimaitis, R. The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures. Appl. Sci. 2019, 9, 1870.
Series/Report no.: 9;9
Abstract: In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.
Description: This article belongs to the Special Issue Advances in Deep Learning
URI: http://dspace.vgtu.lt/handle/1/3965
ISSN: 2076-3417
Appears in Collections:Moksliniai straipsniai / Research articles

Files in This Item:

File Description SizeFormat
The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures.pdf1.76 MBAdobe PDFView/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - Feedback