SlideShare a Scribd company logo
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
DOI: 10.5121/ijaia.2023.14204 41
A COMPARISON OF DOCUMENT SIMILARITY
ALGORITHMS
Nicholas Gahman and Vinayak Elangovan
Computer Science Program, Division of Science and Engineering,
Penn State University at Abington, Pennsylvania, USA
ABSTRACT
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
KEYWORDS
Natural Language Processing, Document Similarity
1. INTRODUCTION
Document similarity analysis is a Natural Language Processing (NLP) task where two or more
documents are analyzed to recognize the similarities between these documents. Document
similarity is heavily used in text summarization, recommender systems, plagiarism-detection as
well as in search engines. Identifying the level of similarity or dissimilarity between two or more
documents based on their content is the main objective of document similarity analysis.
Commonly used techniques like cosine similarity, Euclidean distance, etc., compares the
document’s text features to provide a similarity score between 0 and 1, where 1 indicates
complete similarity and 0 indicates no similarity. Although there are numerous algorithms used
for document similarity, there are no algorithms universally recognized as the most effective or
efficient in a given area.
This paper seeks to rectify this issue by categorizing the document similarity algorithms into 3
different types of document similarity algorithms: statistical algorithms, neural networks, and
corpus/knowledge-based algorithms. Each category’s algorithms will be closely inspected, and
several algorithms from each category will be judged to be the best of said category. Once the
most effective algorithms of each category are found, they will be run through four different
datasets and three different metrics. Finally, once this data is obtained, it will be analyzed to
determine which of the algorithms are most effective.
This paper is organized as follows: In section 2: Relevant Works, a variety of similar research
work are discussed. In section 3: Proposed Methodology: a general description of three
categories of document similarity algorithms is given, and the methodology and the comparison
process are more thoroughly discussed. Finally, in section 4: Results, the specifics, and
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
42
limitations of implementation of each algorithm are discussed, the results of the comparison
process are given followed by conclusions in section 5.
2. RELEVANT WORKS
Jesus M. Sanchez-Gomez, along with other researchers, compared term-weighting schemes and
similarity measures when used for extractive multi-document text summarization [1]. The result
of this research is that overall cosine similarity appears to be the most effective statistical
similarity measure in this area. Kazuhiro Seki created a text similarity measure that was able to
work across languages [2]. The proposal accomplished this by using two Neural Machine
Translation models to build word embeddings that it can use to compute document similarity.
The NMTs in question translated multiple possible translations to account for mistranslations.
The result of the translations are matrices that are normalized and transformed, creating
multilingual word embeddings that can be compared using cosine similarity. The proposal has
significant flexibility, as the study explicitly states that it is possible to add other similarity scores
onto this system. This proposal was compared to other multilingual similarity algorithms such as
Doc2Vec, Sec2Vec, and the S2Net Siamese Neural Network, with the result being that it
outperformed all other algorithms. When the proposal’s sentence retrieval was compared to
Google Translate, they had similar results, despite the proposal having a much lower BLEU score
due to its relatively small training data and models.
Emrah Inam proposed a document similarity algorithm that combines the usage of word
embeddings with knowledge-based measures through a semantic lexicon named ConceptNet [3].
Once the sentences are run through ConceptNet, the proposal then produces a vector
representation of the transformed sentences using a breadth first traversal algorithm. Finally, a
soft cosine similarity measure is used to compare the vectorized sentences. The similarity from
the ConceptNet metric and the similarity from the dependency parser model is then combined to
produce the final similarity score. The proposal, named SimiT, is given the Microsoft Research
Paraphrase Corpus (MRPC) as an evaluation task to determine if it can detect and understand
paraphrases. Its result is measured using Pearson correlation and is then compared to several
other similarity algorithms’ results. Among these algorithms are basic cosine similarity,
Word2Vec cosine and soft cosine, and several state-of-the-art methods such as BERT and its
variations. Of the compared algorithms, SimiT performed very well, exceeding the performance
of all but the state-of-the-art methods. Even among the state-of-the-art methods, SimiT was still
useful due to its incredibly low run time.
Table 1. Chosen dataset’s tasks and purpose.
MRPC AFS SICK-R SICK-E
Benchmark
Purpose
Testing if algorithm
can detect
paraphrases
Testing if
algorithm can
detect similar
arguments
Testing if
algorithm can see
lexical similarity
of sentences
Testing if algorithm
can see semantic
similarities/differen
ces between
sentences
Aminul Islam and Diana Inkpen collaborated on a method to give knowledge-based measures the
ability to determine sentence similarity [4]. Knowledge-based measures are designed primarily
to compare concepts, and while turning a word into a concept is easy it is much harder to do so
for a sentence. As part of the process, the authors introduce a string similarity word method
using three modifications of the longest common subsequence: The Normalized Longest
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
43
Common Subsequence (NLCS), the Normalized Maximal Consecutive Longest Common
Subsequence starting from the first character (NMCLCS1), and the Normalized Maximal
Consecutive Longest Common Subsequence starting from any character (NMCLCSn). The scores
from each of the three modified LCS algorithms are combined, then divided by three, such that
there is an average between 0 and 1.
To develop a sentence similarity algorithm that uses knowledge-based measures, the authors first
process the sentences by removing all special characters, punctuation, capital letters, and stop
words. At this point, the lengths of the two sentences are stored for later use. From here, they
then find all of the words in both sentences that match with each other using the knowledge-
based measures or string similarity measures, count the number of words where this is the case,
and remove all matching words from the sentence. Then, all the remaining words are compared
using the knowledge-based measure, and the results are put in a matrix whose length is equal to
the length of the first sentence and width is equal to the length of the second sentence. Another
matrix is created using the same method, with the string similarity method replacing the
knowledge-based measure. The knowledge-based measure matrix is added with the string
similarity matrix to produce a joint matrix. Next, the highest value in the joint matrix is found
and added to a list, after which the column and row the value was in is deleted. This continues
until the highest value is zero or until there are no rows or columns left. The final algorithm to
determine the similarity of the two sentences is the number of matching words plus all the values
in the list, times the length of the two sentences without stop words combined. The result is then
divided by two times the length of the first sentence without stop words, times the length of the
second sentence without stop words. Table 2 summarizes the advantages and disadvantages of
different algorithm categories.
Table 2. Algorithm categories.
Advantages Disadvantages
Basic Statistical
Techniques
Simple, easy to use Does not obtain enough semantic
information to make accurate
predictions about entailment vs
contradiction
Neural Networks Incredibly effective, achieves
state-of-the-art results
Incredibly computationally expensive
and memory intensive, difficult to
debug
Knowledge/Corpus-
Based Measures
Can trivially obtain and
process the semantic
information needed for
accurate predictions
Relies heavily on large corpora and
semantic networks to work properly, is
designed for word/concept similarity
and thus difficult to scale
Yigit Sever and GonencErcan perform a comparison of cross-lingual semantic similarity methods
by building a cross-lingual textual similarity evaluation dataset utilizing seven different
languages [5]. Wordnets measure concepts, and while some concepts are specific to a given
language, others are shared between languages. By linking several of these synsets together, a
foundation is built for building the evaluation dataset. The resulting dataset can determine the
effectiveness of both unsupervised and supervised text similarity algorithms, allowing the two
categories to be easily compared. Word embeddings are victim to the hubness problem, where
many of the vector embeddings are similar to other vector embeddings. To investigate how hubs
such as these can affect the similarity result, the evaluation task is split into two different tasks:
alignment and pseudo-retrieval.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
44
In addition to building the evaluation dataset, the paper also compares several text similarity
models, both unsupervised and supervised. Among the algorithms tested is machine translation
the monolingual baseline (MT), cosine similarity between sentence embeddings (SEMB), word
mover’s distance (WMD), Winkhorn (SNK), and Siamese long-short term memory (LSTM).
Concerning retrieval, WMD surprisingly performed the best, followed by SNK. LSTM
performed the third best, but it was noted that it was trained using a limited number of instances,
so it may be possible for it to score higher in this area. Concerning the matching task, SNK
performed the best, followed by WMD, followed by SEMB.
Yuki Arase and Junichi Tsuji developed a method of improving the BERT model through
transfer fine tuning [6]. Their method of pretraining focused on semantic pair modeling,
allowing the proposed method to have significant improvements over the normal BERT model.
Furthermore, in addition to performing better than the baseline BERT model, it is also more cost-
effective, as it focuses on phrase alignments, which can be automatically generated.
As part of the evaluation process, the paper used two different benchmarks: the GLUE
Benchmark, and the PAWS dataset. The GLUE Benchmark used consists of nine different
evaluation datasets whose tasks cover various parts of natural language understanding. The most
relevant tasks for the paper, however, is that of Semantic Equivalence, whose subtasks consist of
paraphrase understanding and an understanding of Semantic Text Similarity (STS). The PAWS
dataset, like the MRPC dataset, focuses on paraphrase understanding, but utilizes controlled word
swapping and back translation to determine if an algorithm is sensitive to context and word order.
In the GLUE Benchmark, the proposed method successfully performed better than their BERT
counterpart in all areas except for QNLI and SST, while in the PAWS benchmark the proposed
method performed better than BERT in all areas. Furthermore, the evaluation showed that the
proposed method performed better when the fine-tuned training corpus used is smaller, which
would make creating said training corpi much easier.
3. PROPOSED METHODOLOGY
Statistical techniques are the simplest of the three types of document similarity algorithms.
They compare text by first turning the sentences into vectors, and then comparing said vectors.
The most used and most effective way of comparison is through cosine similarity, but other
methods such as Euclidean distance are occasionally used. Of the possible preprocessors,
Sentence-BERT was judged to be most ideal for the purposes of this paper. This is because it
performed incredibly well among the algorithms tested for this purpose and was the second most
computationally efficient algorithm tested [7]. Neural networks are another possible avenue of
document similarity, and a very effective one. The basics of neural network-based techniques is
that they are first fed training documents of pairs of texts that are either similar to each other or
different. Gradually, the neural network learns which pairs of texts are similar and which pairs
are different through understanding semantic information. Generally, the text is run through a
tokenizer first, which allows the semantic information to be more understandable to the network.
Some models, like BERT and XLNet, are pre-trained, which allows the network to have
accurate predictions without needing long training times and massive corpora for each
individual task. Figure 1 shows a typical methodology of document analysis using statistical
algorithms.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
45
Figure 1. Methodology of statistical algorithms.
Corpus-based measures find text similarity by using large corpora to determine the similarity
between different words [8]. Two methods that utilize this technique are Pointwise Mutual
Information, and Latent Semantic Analysis. Pointwise Mutual Information accomplishes this by
determining how often the compared words appear together, while Latent Semantic Analysis
accomplishes this by performing singular value decomposition on the corpus. Knowledge-based
measures follow a similar path by converting words into concepts, and then using semantic
networks to compare those concepts [8].
Two neural-network systems are included, as they seem to be overall most effective. These are
knowledge-distilled MT-DNN, and XLNet [9]. The Multi-Task Deep Neural Network proposed
in [10] is improved using knowledge distillation. Several “teacher” MT-DNNs are developed,
one for each task. Each “teacher” MT-DNN then generates a set of “soft” targets, which are
combined with the correct “hard” targets for their respective task. A single “student” MT-DNN
then uses these combined targets as the target for each task. The end result of this process is that
the “student” MT-DNN is able to significantly outperform other MT-DNNs that do not use
knowledge distillation as part of their training. XLNet combines the usage of both
autoregressive language modeling and autoencoding to build a pretrained model that can obtain
context from both previous and future words without relying on data corruption like BERT [11].
As a result, XLNet is able to significantly outperform the baseline BERT in most NLP tasks. In
addition, a statistical technique is also included, dubbed Sentence-BERT + cosine similarity.
Sentence-BERT is a modification of BERT that uses siamese triplet networks to derive sentence
embeddings that are both semantically meaningful and can easily be compared using cosine
similarity [7]. In addition to these algorithms, a combination of the Lin knowledge-based
measure proposed in [8] and a string similarity method proposed in [12] and [4] is also used.
The Lin knowledge-based measure is designed specifically to understand semantic similarity,
while the string similarity method was designed to understand lexical similarity. As a result, the
combination of the two algorithms would be able to understand far more information than each
of the algorithms individually.
The datasets used will be the Microsoft Research Paraphrase Corpus (MRPC) [13], the
Argument Facet Similarity (AFS) dataset [14], and the Sentences Involving Composition
Knowledge (SICK) dataset [15]. The MRPC dataset is designed to test whether the algorithm
can detect paraphrases, while the AFS dataset is designed to test whether the algorithm can
detect similar arguments. The SICK dataset is split into two tasks, named SICK-R and SICK-E.
SICK-R tests the algorithm on the sentences’ lexical similarity, while SICK-E tests whether the
algorithm is capable of high-level semantic knowledge, by deciding whether the sentences are
similar, have no relation, or have the opposite meaning. The evaluations used for the tasks will
be the Pearson correlation and Spearman rank correlation for AFS and SICK-R, and
classification accuracy for MRPC and SICK-E.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
46
4. RESULTS
Overall, the implementation of Sentence-BERT + cosine similarity went well. For the most part,
the corpuses, once downloaded, were trivial to implement. The biggest issue came from the
AFS dataset, as it was stored in a comma separated value format, which made it difficult to
distinguish the commas in sentences from the commas separating the data. This issue was
solved by editing the dataset to make every comma in a sentence appear twice, and then use
regex to split the sentences using only the single commas from the CSV format. Once the
sentences were split, the double commas could be converted to single commas trivially. The
evaluations were somewhat difficult to code, but once their respective formulas were found
implementing them was relatively trivial. Unfortunately, due to how SICK-E’s classification
works, SemanticBERT + cosine similarity is unable to properly classify the information, and a
score could not be found.
Unfortunately, the implementation of XLNet was not so simple. The original plan was to use
the implementation linked in the original paper [16], but after much consideration the plan was
changed to use hugging face transformers [17] instead, as it would be significantly easier to
import and use. Initially, there were major issues with the XLNet model, as the input
dimensions would consistently be off. This was solved through usage of padding. Then,
training and evaluation became an issue, as running either would consistently fail from using too
much GPU memory. This was solved by reducing batch steps. After this, I was able to find the
accuracy of classification datasets without issue, but unfortunately, finding an XLNet model
capable of regression was still a major problem.
Implementing MT-DNN was a major issue from the very beginning. It took several days to
figure out atensorflow import that kept going wrong required an older version of Python (3.6),
and another day after that to realize the GitHub project used [18] was designed for a Linux OS.
After those two problems were found, the location of implementation was switched to a cloud
computing server known as Paperspace Gradient. Once accomplished, the model was able to be
set up. Import errors made it difficult at times but installing them via apt install worked nicely to
clear up the problems. Rewriting code in the program was often necessary to account for
running python programs with python3 instead of python2, as well. Furthermore, for the MT-
DNN to work quickly, reinstalling a CUDA driver that was only partially installed became
necessary, which was difficult as simply deleting it did not work as there were dependencies
involved. Ultimately, manually deleting the system programs that relied on said driver, then
deleting the driver solved the issue. Uploading the datasets seemed simple at first, but quickly
became difficult, as each task had a specific data orientation that the uploaded datasets did not
match. After preprocessing the datasets again to account for this and splitting them up into
train/test/dev splits, evaluation was thankfully simple.
Implementing the knowledge-based measure was for the most part simple, although there were
some hiccups along the way. The initial plan was to use the implementation mentioned in [8],
but it was not linked anywhere in the paper and it could not be found online. Ultimately, the
implementation was changed to include Sematch [19,20] as the implementation for the
knowledge-based measure. Originally, the plan was to combine multiple knowledge-based
measures together, but it quickly became clear that doing so would be difficult to impossible in
the time remaining. Ultimately, the knowledge-based measure chosen was the Lin method, as it
seemed the most effective of the individual algorithms mentioned in [8] and had an easy
normalization of between 0 and 1. Having chosen the algorithm and GitHub implementation, the
most difficult problem of implementation made itself known. Knowledge-based measures work
by turning words into concepts and comparing those concepts, and thus they are designed for
word similarity instead of sentence similarity. All the evaluation metrics compared sentence
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
47
similarity, so it was necessary to convert word similarity to sentence similarity. Thankfully,
several papers that did just that were found [12,13]. They achieved this feat by building a matrix
of comparisons between each word in the two sentences. Once this was done, the papers took the
highest value in the matrix, added it to a list, deleted the row and column the value was in, and
continued until the highest value was zero or until there were no columns or rows left. The
authors also combined the knowledge-based measure with a similarity measure so that the
algorithm would be more sensitive to lexical information. It was trivial to use those papers to
convert the implementations’ word similarity to sentence similarity, and from there things
proceeded smoothly.
Table 3. Result analysis per task/evaluation combination.
SICK-R
Pearson
SICK-R
Spearman
SICK-E
Accuracy
AFS
Pearson
AFS
Spearman
MRPC
Accuracy
Sentence-BERT
+ Cosine
Similarity
73.61% 75.038% N/A 28.034% 75.063% 68.017%
MT-DNN 92.185% 88.567% 90.706% 80.860% 78.178% 88.286%
XLNet2 (second
implementation)
47.697% N/A 57.383% 72.752% N/A 66.236%
XLNet1 (first
implementation)
N/A N/A 56.911% N/A N/A 66.983%
Lin Measure +
String Similarity
60.546% 75.038% N/A 32.273% 75.063% 70.172%
During the XLNet implementation, it became clear that the transformers XLNet code being
working with was designed for classification problems and could thus not handle continuous
data. As such, it was decided to implement a different implementation linked in the original
XLNet paper [16] as well to ensure that XLNet data concerning SICK-R and AFS could be
gathered. The main issue was the matter of installing CUDA 10.0, as Paperspace Gradient
automatically installed CUDA 10.2, which needed to be uninstalled before CUDA 10.0 was
installed. There were also issues with the Nvidia driver, which would prevent CUDA from fully
installing. In the end, the issue was resolved by removing all traces of CUDA from my system,
partially installing CUDA 10.0, manually removing the nVidia driver and its dependencies, and
then running apt --fix-broken install to properly install CUDA 10.0. After this, implementation
largely went smoothly. The only real roadblocks after this point was that XLNet only used
Pearson similarity as a metric, and it did not naturally include MRPC processing. Coding in an
MRPC Processor subclass using their basic MNLI Matched Processor became necessary, after
which the evaluation went smoothly. Unfortunately, adding Spearman Rank Similarity was a
much more arduous task, and there was no time left to finish the job. All results used 60% train,
20% test and 20% dev datasets. Figure 2 shows the accuracy of document similarity algorithms.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
48
Figure 2. Result analysis per task/evaluation combination.
Both Sentence-Bert + cosine similarity and the unofficial XLNet code was implemented on an
Alienware m15 R4 with an Intel i7-10870H CPU, 32 GB of RAM, and a NVIDIA GForce RTX
3070 GPU. The MT-DNN, second XLNet implementation and Lin measure + string similarity
was implemented on Paperspace, with an Intel Xeon E5-2623 v4 CPU, 30 GB of RAM, and a
NVIDIA P5000 GPU.
Of the algorithms tested, the MT-DNN is by far the most effective, as it was the only algorithm to
successfully process all evaluation metrics, and it also had the highest scores in all evaluation
metrics. Lin Measure + String Similarity and Semantic BERT + Cosine Similarity had roughly
the same results, with Semantic BERT + Cosine Similarity performing better in the SICK-R
Pearson task, but slightly worse in the AFS Pearson and MRPC tasks. The official XLNet code
performed incredibly well in the AFS Pearson task but performed badly in the SICK-R Pearson
task and had a slightly worse MRPC score than Lin + String Similarity and BERT + cosine
similarity’s scores. However, unlike Lin + String similarity and BERT + Cosine Similarity, the
second XLNet implementation was able to give a score for SICK-E, so it has a major advantage
in flexibility. The first XLNet implementation had similar scores to the second XLNet
implementation, but was unable to handle continuous data, so in terms of versatility and results it
performed the worst of all tested algorithms.
In the future, XLNet could be further investigated. The second implementation of XLNet is
theoretically capable of using Spearman Rank similarity as a metric, making it the only algorithm
other than MT-DNN that can process all evaluation metrics.
5. CONCLUSIONS
This research gave a comprehensive study of four algorithms, Semantic BERT + Cosine
Similarity, MT-DNN, XLNet, and Lin Measure + String Similarity. Using the MRPC, AFS,
SICK-E and SICK-R datasets, and the Spearman Rank, Pearson, and Accuracy evaluation
metrics, the research was able to look at each of the four algorithms strengths and weaknesses,
and thus give an answer to the question of which document similarity algorithm is most effective.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
49
From the data obtained in this research, the MT-DNN is clearly the most effective document
similarity algorithm tested.
ACRONYMS
MRPC: Microsoft Research Paraphrase Corpus.
AFS: Argument Facet Similarity.
SICK: Sentences Involving Composition Knowledge.
BERT: Bidirectional Encoder Representations from Transformers.
MT-DNN: Multi-Task Deep Neural Network.
REFERENCES
[1] J. M. Sanchez-Gomez, M. A. Vega-Rodríguez, C. J. Pérez. “The impact of term-weighting schemes
and similarity measures on extractive multi-document text summarization.” Expert Systems with
Applications. 2021;169:N.PAG. doi:10.1016/j.eswa.2020.114510
[2] K. Seki. “Cross-lingual text similarity exploiting neural machine translation models.” Journal of
Information Science, vol. 47, no. 3, pp. 404–418, June 2021.
[3] E. Inan. “SimiT: A Text Similarity Method Using Lexicon and Dependency Representations.” New
Generation Computing, vol. 38, no. 3, pp. 509–530, July 2020.
[4] A. Islam and D. Inkpen, “Semantic text similarity using corpus-based word similarity and string
similarity,” ACM Transactions on Knowledge Discovery from Data, 2008. [Online]. Available:
https://www.academia.edu/4418471/Semantic_text_similarity_using_corpus_based_word_similarity_
and_string_similarity. [Accessed: 28-Jul-2021].
[5] Y. Sever and G. Ercan, “Evaluating cross-lingual textual similarity on dictionary alignment problem,”
SpringerLink, 2020. [Online]. Available: https://link-springer-
com.ezaccess.libraries.psu.edu/article/10.1007/s10579-020-09498-1. [Accessed: 29-Jul-2021].
[6] Y. Arase and J. Tsujii, “Transfer fine-tuning of BERT with phrasal paraphrases,” ScienceDirect,
March 2021. [Online].
https://www.sciencedirect.com.ezaccess.libraries.psu.edu/science/article/pii/S0885230820300978.
[Accessed: 29-Jul-2021].
[7] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-
Networks,” arXiv.org, 27-Aug-2019. [Online]. Available: https://arxiv.org/abs/1908.10084.
[Accessed: 03-Jun-2021].
[8] R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and Knowledge-based Measures of Text
Semantic Similarity.” [Online]. Available:
https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.aaai06.pdf. [Accessed: 02-Jun-2021].
[9] “Semantic textual similarity,” NLPProgress. [Online]. Available:
http://nlpprogress.com/english/semantic_textual_similarity.html. [Accessed: 02-Jun-2021].
[10] X. Liu, P. He, W. Chen, and J. Gao, “Improving Multi-Task Deep Neural Networks via Knowledge
Distillation for Natural Language Understanding,” arXiv.org, 20-Apr-2019. [Online]. Available:
https://arxiv.org/abs/1904.09482. [Accessed: eh, we ca03-Jun-2021].
[11] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized
Autoregressive Pretraining for Language Understanding,” arXiv.org, 02-Jan-2020. [Online].
Available: https://arxiv.org/abs/1906.08237. [Accessed: 03-Jun-2021].
[12] C. F. Ho, M. A. A. Murad, R. A. Kadir, and S. C. Doraisamy, “Word Sense Disambiguation-based
Sentence Similarity,” ACL Anthology, 2010. [Online]. Available: https://aclanthology.org/C10-
2048/. [Accessed: 07-Jul-2021].
[13] W. B. Dolan and C. Brockett, “Automatically Constructing a Corpus of Sentential Paraphrases,” ACL
Anthology, 01-Jan-1970. [Online]. Available: https://aclanthology.org/I05-5002/. [Accessed: 22-Jul-
2021].
[14] Amita Misra, Brian Ecker, Marilyn Walker. "Measuring the Similarity of Sentential Arguments in
Dialogue". In The 17th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL), Los
Angeles, California, USA, 2016.
[15] “The SICK data set,” SICK. [Online]. Available: http://marcobaroni.org/composes/sick.html.
[Accessed: 03-Jun-2021].
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023
50
[16] Zihangdai, “zihangdai/xlnet,” GitHub, 16-Dec-2019. [Online]. Available:
https://github.com/zihangdai/xlnet/. [Accessed: 03-Jun-2021].
[17] T. Wolf, “huggingface/transformers,” GitHub, 16-Jun-2021. [Online].
Available: https://github.com/huggingface/transformers. [Accessed: 16-Jun-2021].
[18] Namisan, “namisan/mt-dnn,” GitHub, 16-Feb-2021. [Online]. Available:
https://github.com/namisan/mt-dnn/. [Accessed: 03-Jun-2021].
[19] G. Zhu and C. A. Iglesias, “Sematch: Semantic similarity framework for Knowledge Graphs,”
ScienceDirect, 15-Aug-2017. [Online]. Available: https://www-sciencedirect-
com.ezaccess.libraries.psu.edu/science/article/pii/S0950705117302447?via%3Dihub. [Accessed: 06-
Jul-2021].
[20] G. Zhu and C. A. Iglesias, “gsi-upm/sematch,” GitHub, 26-Mar-2019. [Online].
Available: https://github.com/gsi-upm/sematch. [Accessed: 06-Jul-2021].
[21] AndriyMulyar, “AndriyMulyar/semantic-text-similarity,” GitHub, 08-Oct-2019. [Online]. Available:
https://github.com/AndriyMulyar/semantic-text-similarity. [Accessed: 03-Jun-2021].

More Related Content

Similar to A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS (20)

PDF
2015-SemEval2015_poster
hpcosta
 
PPTX
IR.pptx
MahamSajid4
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
Text smilarity02 corpus_based
cyan1d3
 
PDF
Aq35241246
IJERA Editor
 
PDF
P13 corley
UKM university
 
PDF
20433-39028-3-PB.pdf
IjictTeam
 
PDF
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
cscpconf
 
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET Journal
 
PDF
Great model a model for the automatic generation of semantic relations betwee...
ijcsity
 
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
ijnlc
 
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
kevig
 
PDF
Volume 2-issue-6-2016-2020
Editor IJARCET
 
PDF
Volume 2-issue-6-2016-2020
Editor IJARCET
 
PDF
AINL 2016: Kravchenko
Lidia Pivovarova
 
PDF
P036401020107
theijes
 
PDF
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
pathsproject
 
PDF
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET Journal
 
PPTX
Text similarity measures
ankit_ppt
 
PDF
L0261075078
inventionjournals
 
2015-SemEval2015_poster
hpcosta
 
IR.pptx
MahamSajid4
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Text smilarity02 corpus_based
cyan1d3
 
Aq35241246
IJERA Editor
 
P13 corley
UKM university
 
20433-39028-3-PB.pdf
IjictTeam
 
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
cscpconf
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET Journal
 
Great model a model for the automatic generation of semantic relations betwee...
ijcsity
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
ijnlc
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
kevig
 
Volume 2-issue-6-2016-2020
Editor IJARCET
 
Volume 2-issue-6-2016-2020
Editor IJARCET
 
AINL 2016: Kravchenko
Lidia Pivovarova
 
P036401020107
theijes
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
pathsproject
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET Journal
 
Text similarity measures
ankit_ppt
 
L0261075078
inventionjournals
 

More from gerogepatton (20)

PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
PDF
6th International Conference on Advanced Machine Learning (AMLA 2025)
gerogepatton
 
PDF
OWE-CVD: An Optimized Weighted Ensemble for Heart Disease Prediction
gerogepatton
 
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
PDF
Balancing Privacy and Innovation – A VAE Framework for Synthetic Healthcare D...
gerogepatton
 
PDF
4th International Conference on Computer Science and Information Technology (...
gerogepatton
 
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
PDF
11th International Conference on Artificial Intelligence and Soft Computing (...
gerogepatton
 
PDF
The Role of Artificial Intelligence in Ensuring the Cyber Security of SCADA S...
gerogepatton
 
PDF
4th International Conference on Artificial Intelligence Advances (AIAD 2025)
gerogepatton
 
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
PDF
7th International Conference on Machine Learning & Applications (CMLA 2025)
gerogepatton
 
PDF
6th International Conference on Advances in Artificial Intelligence Technique...
gerogepatton
 
PDF
5th International Conference on AI, Machine Learning in Communications and Ne...
gerogepatton
 
PDF
13th International Conference on Artificial Intelligence, Soft Computing (AIS...
gerogepatton
 
PDF
Empowering Cloud-native Security: the Transformative Role of Artificial Intel...
gerogepatton
 
PDF
4th International Conference on Artificial Intelligence Advances (AIAD 2025)
gerogepatton
 
PDF
May 2025 - Top 10 Read Articles in Artificial Intelligence and Applications (...
gerogepatton
 
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
PDF
14th International Conference on Artificial Intelligence and Soft Computing (...
gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
6th International Conference on Advanced Machine Learning (AMLA 2025)
gerogepatton
 
OWE-CVD: An Optimized Weighted Ensemble for Heart Disease Prediction
gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
Balancing Privacy and Innovation – A VAE Framework for Synthetic Healthcare D...
gerogepatton
 
4th International Conference on Computer Science and Information Technology (...
gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
11th International Conference on Artificial Intelligence and Soft Computing (...
gerogepatton
 
The Role of Artificial Intelligence in Ensuring the Cyber Security of SCADA S...
gerogepatton
 
4th International Conference on Artificial Intelligence Advances (AIAD 2025)
gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
7th International Conference on Machine Learning & Applications (CMLA 2025)
gerogepatton
 
6th International Conference on Advances in Artificial Intelligence Technique...
gerogepatton
 
5th International Conference on AI, Machine Learning in Communications and Ne...
gerogepatton
 
13th International Conference on Artificial Intelligence, Soft Computing (AIS...
gerogepatton
 
Empowering Cloud-native Security: the Transformative Role of Artificial Intel...
gerogepatton
 
4th International Conference on Artificial Intelligence Advances (AIAD 2025)
gerogepatton
 
May 2025 - Top 10 Read Articles in Artificial Intelligence and Applications (...
gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
gerogepatton
 
14th International Conference on Artificial Intelligence and Soft Computing (...
gerogepatton
 
Ad

Recently uploaded (20)

PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
Evaluation and thermal analysis of shell and tube heat exchanger as per requi...
shahveer210504
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPT
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
GitOps_Without_K8s_Training simple one without k8s
DanialHabibi2
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Evaluation and thermal analysis of shell and tube heat exchanger as per requi...
shahveer210504
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
GitOps_Without_K8s_Training simple one without k8s
DanialHabibi2
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Hashing Introduction , hash functions and techniques
sailajam21
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Ad

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS

  • 1. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 DOI: 10.5121/ijaia.2023.14204 41 A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS Nicholas Gahman and Vinayak Elangovan Computer Science Program, Division of Science and Engineering, Penn State University at Abington, Pennsylvania, USA ABSTRACT Document similarity is an important part of Natural Language Processing and is most commonly used for plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity algorithm could have a major positive impact on the field of Natural Language Processing. This report sets out to examine the numerous document similarity algorithms, and determine which ones are the most useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based algorithms. The most effective algorithms in each category are also compared in our work using a series of benchmark datasets and evaluations that test every possible area that each algorithm could be used in. KEYWORDS Natural Language Processing, Document Similarity 1. INTRODUCTION Document similarity analysis is a Natural Language Processing (NLP) task where two or more documents are analyzed to recognize the similarities between these documents. Document similarity is heavily used in text summarization, recommender systems, plagiarism-detection as well as in search engines. Identifying the level of similarity or dissimilarity between two or more documents based on their content is the main objective of document similarity analysis. Commonly used techniques like cosine similarity, Euclidean distance, etc., compares the document’s text features to provide a similarity score between 0 and 1, where 1 indicates complete similarity and 0 indicates no similarity. Although there are numerous algorithms used for document similarity, there are no algorithms universally recognized as the most effective or efficient in a given area. This paper seeks to rectify this issue by categorizing the document similarity algorithms into 3 different types of document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based algorithms. Each category’s algorithms will be closely inspected, and several algorithms from each category will be judged to be the best of said category. Once the most effective algorithms of each category are found, they will be run through four different datasets and three different metrics. Finally, once this data is obtained, it will be analyzed to determine which of the algorithms are most effective. This paper is organized as follows: In section 2: Relevant Works, a variety of similar research work are discussed. In section 3: Proposed Methodology: a general description of three categories of document similarity algorithms is given, and the methodology and the comparison process are more thoroughly discussed. Finally, in section 4: Results, the specifics, and
  • 2. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 42 limitations of implementation of each algorithm are discussed, the results of the comparison process are given followed by conclusions in section 5. 2. RELEVANT WORKS Jesus M. Sanchez-Gomez, along with other researchers, compared term-weighting schemes and similarity measures when used for extractive multi-document text summarization [1]. The result of this research is that overall cosine similarity appears to be the most effective statistical similarity measure in this area. Kazuhiro Seki created a text similarity measure that was able to work across languages [2]. The proposal accomplished this by using two Neural Machine Translation models to build word embeddings that it can use to compute document similarity. The NMTs in question translated multiple possible translations to account for mistranslations. The result of the translations are matrices that are normalized and transformed, creating multilingual word embeddings that can be compared using cosine similarity. The proposal has significant flexibility, as the study explicitly states that it is possible to add other similarity scores onto this system. This proposal was compared to other multilingual similarity algorithms such as Doc2Vec, Sec2Vec, and the S2Net Siamese Neural Network, with the result being that it outperformed all other algorithms. When the proposal’s sentence retrieval was compared to Google Translate, they had similar results, despite the proposal having a much lower BLEU score due to its relatively small training data and models. Emrah Inam proposed a document similarity algorithm that combines the usage of word embeddings with knowledge-based measures through a semantic lexicon named ConceptNet [3]. Once the sentences are run through ConceptNet, the proposal then produces a vector representation of the transformed sentences using a breadth first traversal algorithm. Finally, a soft cosine similarity measure is used to compare the vectorized sentences. The similarity from the ConceptNet metric and the similarity from the dependency parser model is then combined to produce the final similarity score. The proposal, named SimiT, is given the Microsoft Research Paraphrase Corpus (MRPC) as an evaluation task to determine if it can detect and understand paraphrases. Its result is measured using Pearson correlation and is then compared to several other similarity algorithms’ results. Among these algorithms are basic cosine similarity, Word2Vec cosine and soft cosine, and several state-of-the-art methods such as BERT and its variations. Of the compared algorithms, SimiT performed very well, exceeding the performance of all but the state-of-the-art methods. Even among the state-of-the-art methods, SimiT was still useful due to its incredibly low run time. Table 1. Chosen dataset’s tasks and purpose. MRPC AFS SICK-R SICK-E Benchmark Purpose Testing if algorithm can detect paraphrases Testing if algorithm can detect similar arguments Testing if algorithm can see lexical similarity of sentences Testing if algorithm can see semantic similarities/differen ces between sentences Aminul Islam and Diana Inkpen collaborated on a method to give knowledge-based measures the ability to determine sentence similarity [4]. Knowledge-based measures are designed primarily to compare concepts, and while turning a word into a concept is easy it is much harder to do so for a sentence. As part of the process, the authors introduce a string similarity word method using three modifications of the longest common subsequence: The Normalized Longest
  • 3. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 43 Common Subsequence (NLCS), the Normalized Maximal Consecutive Longest Common Subsequence starting from the first character (NMCLCS1), and the Normalized Maximal Consecutive Longest Common Subsequence starting from any character (NMCLCSn). The scores from each of the three modified LCS algorithms are combined, then divided by three, such that there is an average between 0 and 1. To develop a sentence similarity algorithm that uses knowledge-based measures, the authors first process the sentences by removing all special characters, punctuation, capital letters, and stop words. At this point, the lengths of the two sentences are stored for later use. From here, they then find all of the words in both sentences that match with each other using the knowledge- based measures or string similarity measures, count the number of words where this is the case, and remove all matching words from the sentence. Then, all the remaining words are compared using the knowledge-based measure, and the results are put in a matrix whose length is equal to the length of the first sentence and width is equal to the length of the second sentence. Another matrix is created using the same method, with the string similarity method replacing the knowledge-based measure. The knowledge-based measure matrix is added with the string similarity matrix to produce a joint matrix. Next, the highest value in the joint matrix is found and added to a list, after which the column and row the value was in is deleted. This continues until the highest value is zero or until there are no rows or columns left. The final algorithm to determine the similarity of the two sentences is the number of matching words plus all the values in the list, times the length of the two sentences without stop words combined. The result is then divided by two times the length of the first sentence without stop words, times the length of the second sentence without stop words. Table 2 summarizes the advantages and disadvantages of different algorithm categories. Table 2. Algorithm categories. Advantages Disadvantages Basic Statistical Techniques Simple, easy to use Does not obtain enough semantic information to make accurate predictions about entailment vs contradiction Neural Networks Incredibly effective, achieves state-of-the-art results Incredibly computationally expensive and memory intensive, difficult to debug Knowledge/Corpus- Based Measures Can trivially obtain and process the semantic information needed for accurate predictions Relies heavily on large corpora and semantic networks to work properly, is designed for word/concept similarity and thus difficult to scale Yigit Sever and GonencErcan perform a comparison of cross-lingual semantic similarity methods by building a cross-lingual textual similarity evaluation dataset utilizing seven different languages [5]. Wordnets measure concepts, and while some concepts are specific to a given language, others are shared between languages. By linking several of these synsets together, a foundation is built for building the evaluation dataset. The resulting dataset can determine the effectiveness of both unsupervised and supervised text similarity algorithms, allowing the two categories to be easily compared. Word embeddings are victim to the hubness problem, where many of the vector embeddings are similar to other vector embeddings. To investigate how hubs such as these can affect the similarity result, the evaluation task is split into two different tasks: alignment and pseudo-retrieval.
  • 4. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 44 In addition to building the evaluation dataset, the paper also compares several text similarity models, both unsupervised and supervised. Among the algorithms tested is machine translation the monolingual baseline (MT), cosine similarity between sentence embeddings (SEMB), word mover’s distance (WMD), Winkhorn (SNK), and Siamese long-short term memory (LSTM). Concerning retrieval, WMD surprisingly performed the best, followed by SNK. LSTM performed the third best, but it was noted that it was trained using a limited number of instances, so it may be possible for it to score higher in this area. Concerning the matching task, SNK performed the best, followed by WMD, followed by SEMB. Yuki Arase and Junichi Tsuji developed a method of improving the BERT model through transfer fine tuning [6]. Their method of pretraining focused on semantic pair modeling, allowing the proposed method to have significant improvements over the normal BERT model. Furthermore, in addition to performing better than the baseline BERT model, it is also more cost- effective, as it focuses on phrase alignments, which can be automatically generated. As part of the evaluation process, the paper used two different benchmarks: the GLUE Benchmark, and the PAWS dataset. The GLUE Benchmark used consists of nine different evaluation datasets whose tasks cover various parts of natural language understanding. The most relevant tasks for the paper, however, is that of Semantic Equivalence, whose subtasks consist of paraphrase understanding and an understanding of Semantic Text Similarity (STS). The PAWS dataset, like the MRPC dataset, focuses on paraphrase understanding, but utilizes controlled word swapping and back translation to determine if an algorithm is sensitive to context and word order. In the GLUE Benchmark, the proposed method successfully performed better than their BERT counterpart in all areas except for QNLI and SST, while in the PAWS benchmark the proposed method performed better than BERT in all areas. Furthermore, the evaluation showed that the proposed method performed better when the fine-tuned training corpus used is smaller, which would make creating said training corpi much easier. 3. PROPOSED METHODOLOGY Statistical techniques are the simplest of the three types of document similarity algorithms. They compare text by first turning the sentences into vectors, and then comparing said vectors. The most used and most effective way of comparison is through cosine similarity, but other methods such as Euclidean distance are occasionally used. Of the possible preprocessors, Sentence-BERT was judged to be most ideal for the purposes of this paper. This is because it performed incredibly well among the algorithms tested for this purpose and was the second most computationally efficient algorithm tested [7]. Neural networks are another possible avenue of document similarity, and a very effective one. The basics of neural network-based techniques is that they are first fed training documents of pairs of texts that are either similar to each other or different. Gradually, the neural network learns which pairs of texts are similar and which pairs are different through understanding semantic information. Generally, the text is run through a tokenizer first, which allows the semantic information to be more understandable to the network. Some models, like BERT and XLNet, are pre-trained, which allows the network to have accurate predictions without needing long training times and massive corpora for each individual task. Figure 1 shows a typical methodology of document analysis using statistical algorithms.
  • 5. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 45 Figure 1. Methodology of statistical algorithms. Corpus-based measures find text similarity by using large corpora to determine the similarity between different words [8]. Two methods that utilize this technique are Pointwise Mutual Information, and Latent Semantic Analysis. Pointwise Mutual Information accomplishes this by determining how often the compared words appear together, while Latent Semantic Analysis accomplishes this by performing singular value decomposition on the corpus. Knowledge-based measures follow a similar path by converting words into concepts, and then using semantic networks to compare those concepts [8]. Two neural-network systems are included, as they seem to be overall most effective. These are knowledge-distilled MT-DNN, and XLNet [9]. The Multi-Task Deep Neural Network proposed in [10] is improved using knowledge distillation. Several “teacher” MT-DNNs are developed, one for each task. Each “teacher” MT-DNN then generates a set of “soft” targets, which are combined with the correct “hard” targets for their respective task. A single “student” MT-DNN then uses these combined targets as the target for each task. The end result of this process is that the “student” MT-DNN is able to significantly outperform other MT-DNNs that do not use knowledge distillation as part of their training. XLNet combines the usage of both autoregressive language modeling and autoencoding to build a pretrained model that can obtain context from both previous and future words without relying on data corruption like BERT [11]. As a result, XLNet is able to significantly outperform the baseline BERT in most NLP tasks. In addition, a statistical technique is also included, dubbed Sentence-BERT + cosine similarity. Sentence-BERT is a modification of BERT that uses siamese triplet networks to derive sentence embeddings that are both semantically meaningful and can easily be compared using cosine similarity [7]. In addition to these algorithms, a combination of the Lin knowledge-based measure proposed in [8] and a string similarity method proposed in [12] and [4] is also used. The Lin knowledge-based measure is designed specifically to understand semantic similarity, while the string similarity method was designed to understand lexical similarity. As a result, the combination of the two algorithms would be able to understand far more information than each of the algorithms individually. The datasets used will be the Microsoft Research Paraphrase Corpus (MRPC) [13], the Argument Facet Similarity (AFS) dataset [14], and the Sentences Involving Composition Knowledge (SICK) dataset [15]. The MRPC dataset is designed to test whether the algorithm can detect paraphrases, while the AFS dataset is designed to test whether the algorithm can detect similar arguments. The SICK dataset is split into two tasks, named SICK-R and SICK-E. SICK-R tests the algorithm on the sentences’ lexical similarity, while SICK-E tests whether the algorithm is capable of high-level semantic knowledge, by deciding whether the sentences are similar, have no relation, or have the opposite meaning. The evaluations used for the tasks will be the Pearson correlation and Spearman rank correlation for AFS and SICK-R, and classification accuracy for MRPC and SICK-E.
  • 6. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 46 4. RESULTS Overall, the implementation of Sentence-BERT + cosine similarity went well. For the most part, the corpuses, once downloaded, were trivial to implement. The biggest issue came from the AFS dataset, as it was stored in a comma separated value format, which made it difficult to distinguish the commas in sentences from the commas separating the data. This issue was solved by editing the dataset to make every comma in a sentence appear twice, and then use regex to split the sentences using only the single commas from the CSV format. Once the sentences were split, the double commas could be converted to single commas trivially. The evaluations were somewhat difficult to code, but once their respective formulas were found implementing them was relatively trivial. Unfortunately, due to how SICK-E’s classification works, SemanticBERT + cosine similarity is unable to properly classify the information, and a score could not be found. Unfortunately, the implementation of XLNet was not so simple. The original plan was to use the implementation linked in the original paper [16], but after much consideration the plan was changed to use hugging face transformers [17] instead, as it would be significantly easier to import and use. Initially, there were major issues with the XLNet model, as the input dimensions would consistently be off. This was solved through usage of padding. Then, training and evaluation became an issue, as running either would consistently fail from using too much GPU memory. This was solved by reducing batch steps. After this, I was able to find the accuracy of classification datasets without issue, but unfortunately, finding an XLNet model capable of regression was still a major problem. Implementing MT-DNN was a major issue from the very beginning. It took several days to figure out atensorflow import that kept going wrong required an older version of Python (3.6), and another day after that to realize the GitHub project used [18] was designed for a Linux OS. After those two problems were found, the location of implementation was switched to a cloud computing server known as Paperspace Gradient. Once accomplished, the model was able to be set up. Import errors made it difficult at times but installing them via apt install worked nicely to clear up the problems. Rewriting code in the program was often necessary to account for running python programs with python3 instead of python2, as well. Furthermore, for the MT- DNN to work quickly, reinstalling a CUDA driver that was only partially installed became necessary, which was difficult as simply deleting it did not work as there were dependencies involved. Ultimately, manually deleting the system programs that relied on said driver, then deleting the driver solved the issue. Uploading the datasets seemed simple at first, but quickly became difficult, as each task had a specific data orientation that the uploaded datasets did not match. After preprocessing the datasets again to account for this and splitting them up into train/test/dev splits, evaluation was thankfully simple. Implementing the knowledge-based measure was for the most part simple, although there were some hiccups along the way. The initial plan was to use the implementation mentioned in [8], but it was not linked anywhere in the paper and it could not be found online. Ultimately, the implementation was changed to include Sematch [19,20] as the implementation for the knowledge-based measure. Originally, the plan was to combine multiple knowledge-based measures together, but it quickly became clear that doing so would be difficult to impossible in the time remaining. Ultimately, the knowledge-based measure chosen was the Lin method, as it seemed the most effective of the individual algorithms mentioned in [8] and had an easy normalization of between 0 and 1. Having chosen the algorithm and GitHub implementation, the most difficult problem of implementation made itself known. Knowledge-based measures work by turning words into concepts and comparing those concepts, and thus they are designed for word similarity instead of sentence similarity. All the evaluation metrics compared sentence
  • 7. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 47 similarity, so it was necessary to convert word similarity to sentence similarity. Thankfully, several papers that did just that were found [12,13]. They achieved this feat by building a matrix of comparisons between each word in the two sentences. Once this was done, the papers took the highest value in the matrix, added it to a list, deleted the row and column the value was in, and continued until the highest value was zero or until there were no columns or rows left. The authors also combined the knowledge-based measure with a similarity measure so that the algorithm would be more sensitive to lexical information. It was trivial to use those papers to convert the implementations’ word similarity to sentence similarity, and from there things proceeded smoothly. Table 3. Result analysis per task/evaluation combination. SICK-R Pearson SICK-R Spearman SICK-E Accuracy AFS Pearson AFS Spearman MRPC Accuracy Sentence-BERT + Cosine Similarity 73.61% 75.038% N/A 28.034% 75.063% 68.017% MT-DNN 92.185% 88.567% 90.706% 80.860% 78.178% 88.286% XLNet2 (second implementation) 47.697% N/A 57.383% 72.752% N/A 66.236% XLNet1 (first implementation) N/A N/A 56.911% N/A N/A 66.983% Lin Measure + String Similarity 60.546% 75.038% N/A 32.273% 75.063% 70.172% During the XLNet implementation, it became clear that the transformers XLNet code being working with was designed for classification problems and could thus not handle continuous data. As such, it was decided to implement a different implementation linked in the original XLNet paper [16] as well to ensure that XLNet data concerning SICK-R and AFS could be gathered. The main issue was the matter of installing CUDA 10.0, as Paperspace Gradient automatically installed CUDA 10.2, which needed to be uninstalled before CUDA 10.0 was installed. There were also issues with the Nvidia driver, which would prevent CUDA from fully installing. In the end, the issue was resolved by removing all traces of CUDA from my system, partially installing CUDA 10.0, manually removing the nVidia driver and its dependencies, and then running apt --fix-broken install to properly install CUDA 10.0. After this, implementation largely went smoothly. The only real roadblocks after this point was that XLNet only used Pearson similarity as a metric, and it did not naturally include MRPC processing. Coding in an MRPC Processor subclass using their basic MNLI Matched Processor became necessary, after which the evaluation went smoothly. Unfortunately, adding Spearman Rank Similarity was a much more arduous task, and there was no time left to finish the job. All results used 60% train, 20% test and 20% dev datasets. Figure 2 shows the accuracy of document similarity algorithms.
  • 8. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 48 Figure 2. Result analysis per task/evaluation combination. Both Sentence-Bert + cosine similarity and the unofficial XLNet code was implemented on an Alienware m15 R4 with an Intel i7-10870H CPU, 32 GB of RAM, and a NVIDIA GForce RTX 3070 GPU. The MT-DNN, second XLNet implementation and Lin measure + string similarity was implemented on Paperspace, with an Intel Xeon E5-2623 v4 CPU, 30 GB of RAM, and a NVIDIA P5000 GPU. Of the algorithms tested, the MT-DNN is by far the most effective, as it was the only algorithm to successfully process all evaluation metrics, and it also had the highest scores in all evaluation metrics. Lin Measure + String Similarity and Semantic BERT + Cosine Similarity had roughly the same results, with Semantic BERT + Cosine Similarity performing better in the SICK-R Pearson task, but slightly worse in the AFS Pearson and MRPC tasks. The official XLNet code performed incredibly well in the AFS Pearson task but performed badly in the SICK-R Pearson task and had a slightly worse MRPC score than Lin + String Similarity and BERT + cosine similarity’s scores. However, unlike Lin + String similarity and BERT + Cosine Similarity, the second XLNet implementation was able to give a score for SICK-E, so it has a major advantage in flexibility. The first XLNet implementation had similar scores to the second XLNet implementation, but was unable to handle continuous data, so in terms of versatility and results it performed the worst of all tested algorithms. In the future, XLNet could be further investigated. The second implementation of XLNet is theoretically capable of using Spearman Rank similarity as a metric, making it the only algorithm other than MT-DNN that can process all evaluation metrics. 5. CONCLUSIONS This research gave a comprehensive study of four algorithms, Semantic BERT + Cosine Similarity, MT-DNN, XLNet, and Lin Measure + String Similarity. Using the MRPC, AFS, SICK-E and SICK-R datasets, and the Spearman Rank, Pearson, and Accuracy evaluation metrics, the research was able to look at each of the four algorithms strengths and weaknesses, and thus give an answer to the question of which document similarity algorithm is most effective.
  • 9. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 49 From the data obtained in this research, the MT-DNN is clearly the most effective document similarity algorithm tested. ACRONYMS MRPC: Microsoft Research Paraphrase Corpus. AFS: Argument Facet Similarity. SICK: Sentences Involving Composition Knowledge. BERT: Bidirectional Encoder Representations from Transformers. MT-DNN: Multi-Task Deep Neural Network. REFERENCES [1] J. M. Sanchez-Gomez, M. A. Vega-Rodríguez, C. J. Pérez. “The impact of term-weighting schemes and similarity measures on extractive multi-document text summarization.” Expert Systems with Applications. 2021;169:N.PAG. doi:10.1016/j.eswa.2020.114510 [2] K. Seki. “Cross-lingual text similarity exploiting neural machine translation models.” Journal of Information Science, vol. 47, no. 3, pp. 404–418, June 2021. [3] E. Inan. “SimiT: A Text Similarity Method Using Lexicon and Dependency Representations.” New Generation Computing, vol. 38, no. 3, pp. 509–530, July 2020. [4] A. Islam and D. Inkpen, “Semantic text similarity using corpus-based word similarity and string similarity,” ACM Transactions on Knowledge Discovery from Data, 2008. [Online]. Available: https://www.academia.edu/4418471/Semantic_text_similarity_using_corpus_based_word_similarity_ and_string_similarity. [Accessed: 28-Jul-2021]. [5] Y. Sever and G. Ercan, “Evaluating cross-lingual textual similarity on dictionary alignment problem,” SpringerLink, 2020. [Online]. Available: https://link-springer- com.ezaccess.libraries.psu.edu/article/10.1007/s10579-020-09498-1. [Accessed: 29-Jul-2021]. [6] Y. Arase and J. Tsujii, “Transfer fine-tuning of BERT with phrasal paraphrases,” ScienceDirect, March 2021. [Online]. https://www.sciencedirect.com.ezaccess.libraries.psu.edu/science/article/pii/S0885230820300978. [Accessed: 29-Jul-2021]. [7] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks,” arXiv.org, 27-Aug-2019. [Online]. Available: https://arxiv.org/abs/1908.10084. [Accessed: 03-Jun-2021]. [8] R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and Knowledge-based Measures of Text Semantic Similarity.” [Online]. Available: https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.aaai06.pdf. [Accessed: 02-Jun-2021]. [9] “Semantic textual similarity,” NLPProgress. [Online]. Available: http://nlpprogress.com/english/semantic_textual_similarity.html. [Accessed: 02-Jun-2021]. [10] X. Liu, P. He, W. Chen, and J. Gao, “Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding,” arXiv.org, 20-Apr-2019. [Online]. Available: https://arxiv.org/abs/1904.09482. [Accessed: eh, we ca03-Jun-2021]. [11] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” arXiv.org, 02-Jan-2020. [Online]. Available: https://arxiv.org/abs/1906.08237. [Accessed: 03-Jun-2021]. [12] C. F. Ho, M. A. A. Murad, R. A. Kadir, and S. C. Doraisamy, “Word Sense Disambiguation-based Sentence Similarity,” ACL Anthology, 2010. [Online]. Available: https://aclanthology.org/C10- 2048/. [Accessed: 07-Jul-2021]. [13] W. B. Dolan and C. Brockett, “Automatically Constructing a Corpus of Sentential Paraphrases,” ACL Anthology, 01-Jan-1970. [Online]. Available: https://aclanthology.org/I05-5002/. [Accessed: 22-Jul- 2021]. [14] Amita Misra, Brian Ecker, Marilyn Walker. "Measuring the Similarity of Sentential Arguments in Dialogue". In The 17th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL), Los Angeles, California, USA, 2016. [15] “The SICK data set,” SICK. [Online]. Available: http://marcobaroni.org/composes/sick.html. [Accessed: 03-Jun-2021].
  • 10. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.14, No.2, March 2023 50 [16] Zihangdai, “zihangdai/xlnet,” GitHub, 16-Dec-2019. [Online]. Available: https://github.com/zihangdai/xlnet/. [Accessed: 03-Jun-2021]. [17] T. Wolf, “huggingface/transformers,” GitHub, 16-Jun-2021. [Online]. Available: https://github.com/huggingface/transformers. [Accessed: 16-Jun-2021]. [18] Namisan, “namisan/mt-dnn,” GitHub, 16-Feb-2021. [Online]. Available: https://github.com/namisan/mt-dnn/. [Accessed: 03-Jun-2021]. [19] G. Zhu and C. A. Iglesias, “Sematch: Semantic similarity framework for Knowledge Graphs,” ScienceDirect, 15-Aug-2017. [Online]. Available: https://www-sciencedirect- com.ezaccess.libraries.psu.edu/science/article/pii/S0950705117302447?via%3Dihub. [Accessed: 06- Jul-2021]. [20] G. Zhu and C. A. Iglesias, “gsi-upm/sematch,” GitHub, 26-Mar-2019. [Online]. Available: https://github.com/gsi-upm/sematch. [Accessed: 06-Jul-2021]. [21] AndriyMulyar, “AndriyMulyar/semantic-text-similarity,” GitHub, 08-Oct-2019. [Online]. Available: https://github.com/AndriyMulyar/semantic-text-similarity. [Accessed: 03-Jun-2021].