Parallel and Distributed Algorithms for Large Text Datasets Analysis

Parallel and Distributed
Algorithms for Large
Text Datasets Analysis
Illia Ovchynnikov
Faculty of Electrical Engineering, Automatic Control and Informatics
Opole University of Technology
Supervised by
Dr Mariusz PELC
PhD Borys KUZIKOV
A thesis submitted for the degree of
Bachelor of Computer Science
Opole 2016

Rozproszone algorytmy dla
analizy du˙zych zbior´ow
danych tekstowych
Illia Ovchynnikov
Wydzial Elektrotechniki, Automatyki i Informatyki
Politechnika Opolska
Pod kierunkiem
dr hab. in˙z. Mariusz Pelc – profesor PO
PhD Borys KUZIKOV
Praca in˙zynierska
Studia I-go stopnia
Informatyka
Opole 2016

This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.

Abstract
An exponential grow of data universe raises new challenges in handling large data
sets and requires searching for new data processing solution. Technological progress
has begun lag behind the real need for computing powers. New systems for distributed
data processing appears to face this problem and being the most popular data type,
text has become one of the major aims for such systems.
The aim of this study is to explore the possibility of using distributed systems
for data processing in the context of near duplicate text detection. The study begins
with discussion of Big Data concept and moves towards a review of available software
frameworks for Big Data processing. Subsequent to this, a set of algorithms used for
determining the level of documents duplication were investigated.
The results of background research were applied to develop prototype that can be
a basis for anti-plagiarism software solution. The system performance testing results
showed implemented distributed system to be more eﬀective in analysis of large text
data sets.

Contents
1 Introduction 1
1.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Project Background 3
2.1 Big Data Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Big Data Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Systems for Big Data Analytics . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Comparison of processing frameworks . . . . . . . . . . . . . . 9
2.4 Near Duplicate Text Detection . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Document Pre-Processing . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Shingling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 Determining the Level of Duplication . . . . . . . . . . . . . . 17
3 System Implementation 20
3.1 Software Requirements Speciﬁcation . . . . . . . . . . . . . . . . . . . 20
3.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Product Perspective . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 Functional Requirements . . . . . . . . . . . . . . . . . . . . . 21
3.2 Implementation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Testing Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Systems Performance Testing and Comparison 30
5 Conclusion 33
References 34
i

List of Figures
1 Sectors gains from the use of Big Data. . . . . . . . . . . . . . . . . . 4
2 Apache Spark Architecture. . . . . . . . . . . . . . . . . . . . . . . . 7
3 Apache Spark Technology Stack. . . . . . . . . . . . . . . . . . . . . . 8
4 Spark deployment conﬁgurations in a Hadoop cluster. . . . . . . . . . 9
5 Data ﬂows in Hadoop and Spark systems. . . . . . . . . . . . . . . . 10
6 Comparison of API on in Hadoop and Spark. . . . . . . . . . . . . . . 10
7 Example of Plagiarism. . . . . . . . . . . . . . . . . . . . . . . . . . . 11
8 Example of Plagiarism after text normalization. . . . . . . . . . . . . 12
9 Similarity of two documents using ”bag of words” technique: correct
result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
10 Similarity of two documents using ”bag of words” technique: false result. 13
11 Similarity of two documents using shingling technique . . . . . . . . . 14
12 Example of Plagiarism after text normalization and shingling. . . . . 15
13 Partial visualization of shingles intersection for case study example. . 15
14 Example of String hash function in Java. . . . . . . . . . . . . . . . . 16
15 Hashing algorithm with ignoring of words order . . . . . . . . . . . . 16
16 Example of Plagiarism after text normalization, shingling and hashing 17
17 Internal data representation in Spark-enabled Anti-Plagiarism software. 18
18 Grouping documents by hashed shingle in Spark-enabled Anti-Plagiarism
software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
19 Mapping documents in Spark-enabled Anti-Plagiarism software to 1,
if their neighbors (duplicated shingles) are in tuples, 0 otherwise. . . . 19
20 Reducing tuples by key, represented by document meta data object . 19
21 UML Use cases diagram for The Plagio system . . . . . . . . . . . . . 21
22 UML Update library use case . . . . . . . . . . . . . . . . . . . . . . 21
23 UML Determining plagiarism level of the document use case . . . . . 22
24 UML Plagio class diagram . . . . . . . . . . . . . . . . . . . . . . . . 23
25 UML ShinglesAlgorithm and HashShinglesAlgorithm classes diagram 24
ii

26 UML Texts class diagram . . . . . . . . . . . . . . . . . . . . . . . . 24
27 UML Metadata and DuplicationReport structures class diagram . . . 25
28 UML GUI PlagioMain and PlagioGUI classes diagram . . . . . . . . 25
29 User interface of The Plagio system . . . . . . . . . . . . . . . . . . . 26
30 UML Sequence diagram for The Plagio System . . . . . . . . . . . . 27
31 Successfully passed unit testing of The Plagio System . . . . . . . . . 28
32 Successfully passed two-phase testing process of The Plagio System . 29
33 Systems initialization overhead test results . . . . . . . . . . . . . . . 31
34 Systems initialization overhead test visualization . . . . . . . . . . . . 31
35 Text shingling test results . . . . . . . . . . . . . . . . . . . . . . . . 31
36 Text shingling test visualization . . . . . . . . . . . . . . . . . . . . . 32
37 Full-cycle documents analysis test results . . . . . . . . . . . . . . . . 32
38 Full-cycle documents analysis test visualization . . . . . . . . . . . . 32
iii

Chapter 1
Introduction
Over the last decade, data universe has increased exponentially. According to
estimations of the International Data Corporation (IDC), data production will reach
40 Zettabytes [15]. Interacting with data sources, such as Internet, every our action
leaves a trail of information in the data universe and we saturate it with new data
every day.
The most popular data type in the universe is text. Nearly 80% of produced data
is stored in text documents [17], such as books, emails, notes, and logs. This high
rate is caused by peculiarity of our interaction with computer systems and despite the
fact, that this can change with the appearance of new human–machine interaction
interfaces in the near future, text will keep leading position in digital data flow.
Due to its nature, text data is one of the least structured data sources, that
increases the difficulty of this type of data processing. Along with its leading position
in data flow, this fact gives to text mining a high interest and a very high commercial
potential.
1.1 Statement of the Problem
With a beginning of the digital age, computing systems greatly increased their
capacity. However, even such great results in technological progress, do not allow to
keep pace with data growth and appears text data sets, so large that traditional data
processing applications are inadequate [22].
Analysis of large text data sets is not all about volume, but it is also the problem
about analysis of the text itself. Effective mining of useful information is a challenging
task as it involves dealing with the text data peculiarities: its heterogeneity, lack of
defined structure and foreignness of origin to computers. This analyses may involve
1

many disciplines, such as information retrieval, clustering and categorization, text
analysis and extraction, visualization, machine learning and data mining.
Computer department of Sumy State University has faced many of these prob-
lems with their anti-plagiarism software, that has been developed for determining the
plagiarism percentage of student documents. The continuously increasing amount of
data in its knowledge base caused significant speed degradation. Current solution
reached its scalability limit and requires changing in a data processing concept.
In order to address this problem, I am going to explore a new method of detecting
near duplicated text document to an existing example using the most straightforward
method of problem solving - splitting into subtasks, with parallelizing and distributing
of their execution.
1.2 Research Objectives
A problem of large data sets analysis is very comprehensive and there is no single
solution to all the challenges that it brings. In this work I focused on problems that
raised during development of anti-plagiarism system.
The main objective of the thesis is to conduct the research involving new archi-
tectural approaches in large data sets processing and develop scalable anti-plagiarism
system. New solution must be resistant to a rapid growth of data and exceed the
speed of current software solution.
It is equally important to confirm effectiveness of a new system through a quality
comparison and testing. Both, new and current software, will be tested in different
conditions with comparison of quantitative indicators.
2

Chapter 2
Project Background
This chapter describes the general context of this study to solidify project back-
ground and carry out more specific and relevant research. It is divided into two main
sections, focused on modern approaches in dealing with new generation of digital data
- Big Data and theoretical framework for near duplicate text detection.
2.1 Big Data Concept
A Big Data term describes large volumes of structured and unstructured data com-
ing from many different sources. It’s gaining more public attention on the background
of new challenges in data processing. Big Data pushes traditional data management
techniques to their limits what gives rise to novel, more formalized approaches.
Big Data is not all about size, it’s about the value within the data. In a ”3D Data
Management: Controlling Data Volume, Velocity and Variety” research report [10],
analyst from META Group defined new challenges in data managing and exploded
Big Data concept into three dimensions: data volume, velocity and variety.
Data volume. Many hardware and software are being started to generate a large
amount of data. There are estimates that Google processes about 24 petabytes [9]
per day, including software logs, our avatars and messages, statistical and service
information. Storing such big data flows requires special conditions and new solu-
tions for storing large volumes of data. Data volume is closely related and directly
proportional to another Big Data dimension - data velocity.
Data velocity. To cope with the constant increasing in the number of data it isn’t
enough just to expand the storage space. It is also necessary to process information
with high velocity and extract valuable data without informational noise.
Data variety. Because of different data sources, formats and quality, the problem
of heterogeneity and unstructured data arises. Modern companies store and process
3

huge volumes of diverse information: checks, transactions, web traffic and call records
in the call-center, publications, emails, journals equipment, sensor readings, software
logs and much more.
The science of Big Data is very young but has great prospects. Authors of ”Big
data: The next frontier for innovation, competition, and productivity” McKinsey
report conducted a study about potential of different sectors in the United States
to capture value from Big Data. They used index that combined five metrics: (1)
the amount of data available for use and analysis; (2) variability in performance; (3)
number of stakeholders (customers and suppliers) with which an organization deals
on average; (4) transaction intensity; and (5) turbulence inherent in a sector [8, p.
8]. Results, described in Figure 1, had showed that introduction of Big Data analysis
systems in different sectors will lead very strong productivity growth.
Figure 1: Sectors gains from the use of Big Data.
Source: US Bureau of Labor Statistics; McKinsey Global Institute analysis [10].
Big Data has already become the important factor of manufacturing, marketing,
business analytics and science. For example, the website ancestry.com tries to trace
history of all mankind, based on all currently available data: handwritten notes,
books and even DNA analysis. To date, they have managed to raise about five billion
profiles of people who lived in very different historical periods, and 45 million family
trees [2] describing relationships within families.
4

Thus, when we talk about Big Data, we understand that it is related to three
aspects: a large amount of information, its diversity and the need to process data
very quickly. On the other hand, this term is often understood as a very specific set
of approaches and technologies to address these challenges. Large text data sets are
a particular case for Big Data concept and take over all its challenges.
2.2 Big Data Technologies
Big Data requires exceptional technologies for efficient and fast processing of large
data sets in a reasonable time. There are many combinations of hardware and software
that allow you to create effective Big Data solutions for various business disciplines,
from social media and mobile applications to intelligent analysis and visualization of
business data. Some techniques have been developed for far smaller data sets and
successfully have been adapted for the use for Big Data analysis, others have been
developed recently for effective extracting data from large data sets.
Almost every method of processing large data is at the intersection of several
disciplines such as Statistics, Mathematics and Computer Science. Despite the rel-
ative youth of researching topic there many different Big Data processing methods.
However, there is no single approach or method to handle large data sets. Some
algorithms are suitable for high velocity data while others better cope with text data.
The most common techniques for processing large data, proposed by researchers
from McKinsey Global Institute [8, p. 27-31] are:
• Data Mining. A set of techniques for data extraction that combines associa-
tion rule learning, cluster analysis, regression analysis and classification meth-
ods;
• Crowdsourcing. A technique that involves the transfer of certain production
functions for categorization and data enrichment;
• Data Fusion and Integration. A set of techniques for integrating heteroge-
neous data from multiple sources to enable in-depth analysis;
• Machine learning, including training with or without teacher and using of
models built on the basis of statistical analysis;
• Artificial neural network and Genetic algorithms;
• Pattern recognition and Visualization;
5

The research focuses on Data Mining and Cluster analysis approaches based on
a distributed computing system, where processing of large volumes of data requires
not one high-performance machine, but a group of computers, where data would be
clustered and spread between working nodes.
2.3 Systems for Big Data Analytics
New challenges require creating new approaches to building systems for data pro-
cessing. Big Data gives rise to new software frameworks and programming models,
which can cope with large data flows.
Most of Big Data Analytics systems based on divide and conquer algorithm design
paradigm and divide a problem to smaller tasks until the individual problems can be
solved independently. Received results are combined to answer the original question.
To implement this behavior, distributed systems for storing and processing data were
created. Such systems can effectively conduct calculations on multiple computers in
a cluster in parallel manner.
In this section, two popular systems for Big Data Analytics are described and
compared from the perspective of improving of the existing anti-plagiarism system.
2.3.1 Apache Hadoop
Apache Hadoop is a software framework, which offers full-stack ecosystem for
distributed data storing and processing. Hadoop solution allows to create highly
scalable distributed systems across clusters of computers. It was designed with a
fundamental assumption that during data processing losses are unavoidable and they
should be handled by framework automatically [7]. This approach allows to build
high-available services without thinking about failures.
Apache Hadoop framework is highly modularized into independent modules, that
are responsible for different aspects of distributed environment:
6

Figure 2: Apache Spark Architecture.
Source: Hortonworks Inc. Apache Hadoop YARN: Present and Future [7].
Hadoop data core is represented by distributed file system, called Hadoop Dis-
tributed File System (HDFS), that provides scalable and fault-tolerant data stor-
age spread across Hadoop cluster. This level provides data distribution over network,
communication between nodes and data replicates. To achieve reliability, HDFS splits
big files into smaller parts and replicate them across data cluster. In case of failure,
HDFS recovers lost data from replication copies.
On the next abstraction level is Hadoop YARN - the data operation system.
This system is responsible for computing resource management and task scheduling.
It manages the way applications use resources and provides higher level of API for
interaction with Hadoop-enabled cluster.
The processing part of the framework is represented by implementation of MapRe-
duce programming model, called Hadoop MapReduce. This framework allows
designed for disk-based parallel processing of data stored in Hadoop cluster.
MapReduce model is a sequence of map() method, that performs data sort-
ing and filtering, and reduce() method, that summarizes results. Hadoop MapRe-
duce supplies chunks of files from HDFS via YARN data operation system and pass
through MapReduce model, where defined map() and reduce() methods extract valu-
able information.
From version 2.0, after introducing new architecture and decoupling of modules,
Apache Hadoop created a favorable environment for appearing additional software,
7

such as Apache Hive and Apache HBase, that can be installed on top of Hadoop
cluster [20].
2.3.2 Apache Spark
From its very beginning in 2009 Apache Spark focuses on large-scale data pro-
cessing [3] on top of Hadoop ecosystem [13]. The closest Spark’s relative in Apache
Hadoop framework is Hadoop MapReduce, but in contrast to its disk-based MapRe-
duce paradigm, Spark relies on in-memory algorithms.
Spark deﬁnes a concept of cluster’s memory - shared memory across nodes in
cluster, where Spark’s in-memory primitives are executed. In addition to typical data
processing scenarios similar to MapReduce, Spark’s approach to perform calculations
in cluster’s memory allows to implement stream processing and SQL techniques, inter-
active and analytic queries, machine learning problem solving and graphs processing.
These functionalities are divided into four components, displayed in Figure 3.
Figure 3: Apache Spark Technology Stack.
Source: Apache SparkTM- Lightning-Fast Cluster Computing [3].
Apache Spark introduces an abstraction that describes partitioned data across
many nodes, called Resilient Distributed Datasets (RDD). It is very similar to stan-
dard data collections available in the most programming languages, such as Java
Collections, but with one diﬀerence that their elements can be stored on many com-
puters in cluster. Such architectural solutions decrease complexity of interacting with
distributed data because operating RDD is similar to manipulating logical data col-
lections. The high-level API for RDD transformations is provided in Java, Scala,
Python and R programming languages.
Apache Spark requires a cluster manager and distributed data storage [21]. It can
use Hadoop YARN or Apache Mesos for cluster managing, but has native standalone
Spark cluster also. Spark can access data in Hadoop Distributed File System (HDFS),
8

Cassandra, HBase, Hive and any Hadoop data source [3] and use them as a distributed
data storage.
A typical Spark in Hadoop cluster deployment configurations are described in Fig-
ure 4. However, there are many more possible combinations with different distributed
data storages and cluster managers.
Figure 4: Spark deployment configurations in a Hadoop cluster.
Source: Ion Stoica. Spark and Hadoop: Working Together [16].
It’s worth to say that even though it is still young enough, it’s used at a wide
range of organizations, such as Amazon, IBM, NASA, Yahoo!, eBay and others [4].
Its popularity it received through the rapid in-memory calculation and being a general
engine for wide range of large-scale data processing [3].
2.3.3 Comparison of processing frameworks
Apache Spark developers claim their processing framework runs programs up to
100x faster than Apache Hadoop MapReduce [3]. Such high result is achieved due to
fundamental difference in computing approach: Hadoop MapReduce is a disk-based
algorithm which interact with disk after each map() or reduce() operations, while
Apache Spark tries to perform calculations entirely in cluster’s memory and refers to
disk-based storage only if the data is too big to fit into the memory. These algorithm
behaviors are visually presented in the Figure 5.
9

Figure 5: Data flows in Hadoop and Spark systems.
Source: Nicole Hemsoth. Flink Sparks Next Wave of Distributed Data Processing [6].
One of the biggest Spark’s advantage is API. It offers over 80 high-level opera-
tions [3] for transforming, filtering and grouping data. Instead, MapReduce model is
based on two operations map() and reduce() that do not always fulfills all require-
ments for data transformation.
Figure 6: Comparison of API on in Hadoop and Spark.
Apache Hadoop documentation is much more mature than Apache Spark. Never-
theless, Spark’s documentation is easy to understand, has faster learning curve and
is full of basic examples, that allows you to see it in action.
Equally important feature of Apache Spark is a pseudo-distributed mode, that
doesn’t require distributed data storage and can run locally with no configuration.
10

This mode is intensively used for developing purposes without access to physical clus-
ter. Moreover, Apache Spark comes with interactive shell for running commands [12].
Considering richer API, faster computations and easier development processes,
Apache Spark has been chosen as a basis for researches in this thesis.
2.4 Near Duplicate Text Detection
Plagiarism has become a serious problem in education, industry and the scientific
community within recent years. The rapid development of the Internet in addition
to increasing computer literacy contributes to penetration of plagiarism in various
spheres of human activity. Data spreads over the Internet instantly and respecting
copyrights becomes increasingly difficult and sometimes even impossible.
The relevance of the problem raises the need to develop methods of near dupli-
cate text detection. In this section a set of algorithms is described that is used for
determining the level of duplication between two excerpts from the Figure 7 as a case
study.
Similarly, when he leaps into the
open grave at Ophelia’s funeral,
ranting in high heroic terms, he is
acting out for Laertes, and perhaps
for himself as well, the folly of ex-
cessive, melodramatic expressions of
grief.
And when he leaps in ...
Ophelia’s open grave ranting in
High [Heroic] Terms, Hamlet is
acting out the folly of excessive,
melodramatic expressions of grief.
Original fragment Plagiarism
Figure 7: Example of Plagiarism.
Source: Examples of Plagiarism - Academic Integrity at Princeton University [1]
2.4.1 Document Pre-Processing
Text is not just a set of words. Any text document can contain different types
of text data: numbers, special characters, service symbols, punctuation and more.
However, not all text data are important. Such data can confuse algorithms or impair
results of text processing. That’s why nearly every algorithm that works with text
data, begins its work with text normalization.
The process of text normalization focuses on transforming text into a single canon-
ical form [18] to make further processing more precise.
The most common techniques of text normalizations are:
11

Case distinction normalization. Case-sensitive algorithms treat same letters
in opposite case differently. For example, the word ”Apple” is not equal to ”apple”
there. Despite the lexical identity, technical representation of these words is different.
Case distinction normalization is based on converting all characters in the text to
upper- or lower-case to make case-sensitive algorithms ignore text case.
Text formatting cleaning. Text decorating has no lexical meaning and is used
exclusively for improving visual appearance, therefore text normalization algorithms
usually escapes text formatting.
Unwanted symbols cleaning. This technique focuses on cleaning text from
undesirable symbols that have no semantic meaning for algorithm but may noise it
with unnecessary information or even broke its normal behavior. The composition of
the unwanted symbols list depends on algorithms requirements but it’s very common
to clean text from whitespace, special characters and punctuation.
Stop words clean up. Stop words are the most common words in natural
language. Such stop words as ”actually”, ”too”, ”very”, ”a”, ”the” have no weight
in the sentence therefore could be ignored to reduce text noise.
These common techniques allow to clean text from undesirable text noise and focus
further processing algorithms on valuable information only. The result of applying
these methods to text examples of case study (Figure 7) are described in Figure 8
similarly when leaps open grave
ophelias funeral ranting high heroic
terms acting out laertes perhaps well
folly excessive melodramatic expres-
sions grief
when leaps ophelias open grave
ranting high heroic terms ham-
let acting out folly excessive melo-
dramatic expressions grief
Figure 8: Example of Plagiarism after text normalization.
2.4.2 Shingling
To find duplicate documents their parts should be compared with others according
to defined rules. The most straightforward method is to treat document as a ”bag of
words”, that means a document ”a bump on the log in the hole in the bottom of the
sea” will be exploded into set of words similar to this [11]:
[a, in, of, on, the, log, sea, bump, hole, bottom]
12

To determine document similarity, a Jaccard index can be used. It’s defined as
the amount of documents common words divided by the number of words that belong
to both sets [19]. A comparison of ”a frog on the bump on the log in the hole in the
bottom of the sea” with the document above is visualized in Figure 9 using Euler
diagram.
Figure 9: Similarity of two documents using ”bag of words” technique: correct result
Source: Dan Lecocq. Near-Duplicate Detection [11].
According to the diagram, these documents are almost identical. Nevertheless, this
technique is quite unreliable. Considering new comparison of ”Your mother drives
you in the car” and ”In mother Russia, car drives you!” [11] documents depicted in
Figure 10.
Figure 10: Similarity of two documents using ”bag of words” technique: false result.
Despite the fact, that these documents have completely different meaning, the
”bag of words” technique shows they are very similar. This implies the biggest flaw
of this methods - ignoring the context of words.
13

To fix issues that were identified in the previous method, text chunks should be
endowed with surrounding context. This can by achieved with a so-called shingling
approach, when documents are exploded in a set of overlapping phrases (shingles).
Considering last example, document can be described as a set of shingles like this:
[’your mother drives’, ’mother drives you’, ’drives you in’, ’you in the’,
’in the car’]
As a result of applying shingles approach it’s clearly visible in Figure 11 that
compared documents are quite different, as we expected before.
Figure 11: Similarity of two documents using shingling technique
Continuing analysis of the case study, two normalized text fragments have been
exploded into a sets of shingles, described in Figure 12 and partially visualized in
Figure 13.
14

[similarly when leaps,
when leaps open,
leaps open grave,
open grave ophelias,
grave ophelias funeral,
ophelias funeral ranting,
funeral ranting high,
ranting high heroic,
high heroic terms,
heroic terms acting,
terms acting out,
acting out laertes,
out laertes perhaps,
laertes perhaps well,
perhaps well folly,
well folly excessive,
folly excessive melodramatic,
excessive melodramatic expressions,
melodramatic expressions grief]
[when leaps ophelias,
leaps ophelias open,
ophelias open grave,
open grave ranting,
grave ranting high,
ranting high heroic,
high heroic terms,
heroic terms hamlet,
terms hamlet acting,
hamlet acting out,
acting out folly,
out folly excessive,
folly excessive melodramatic,
excessive melodramatic expres-
sions,
melodramatic expressions grief]
Figure 12: Example of Plagiarism after text normalization and shingling.
Figure 13: Partial visualization of shingles intersection for case study example.
2.4.3 Hashing
Newly created shingles are suﬃcient to detect similar parts in documents. How-
ever, there is a scope for algorithm optimization using hashing.
Hashing is transformation of an object into ﬁxed-length value that represents the
original object. Java supports such transformation out-of-the-box for all its objects.
15

The hashCode() method, which implements hash function, returns integer represen-
tation of object, as is shown in Figure 14.
”Cats rule this world”
”Funny video”
new int[] 2, 3,6,21
432893073
-228607931
1829164700
Original text Object hash code
Figure 14: Example of String hash function in Java.
Hashing the shingles can solve multiple problems at once:
Decreasing memory consumption. In distributed environment it’s very im-
portant to decrease an amount of data that is transferred over computers in cluster.
Java String implementation take 3 times more memory than Integer value [14].
Increasing of computation speed. Operations on text data are more compli-
cated and slower. Integers handling is more lightweight and precise.
Opens new opportunities for improvements. Manipulating with hashes do
not only speed up computations but also adds new operations that can be applied
such as arithmetic addition and subtraction.
New hashing algorithm has improved calculation speed, memory consumption and
become resistant to changing words order in shingles. Improved hashing algorithm is
illustrated in Figure 15.
similarly when leaps
similarly 1970083000 +
= 2076577277when + 3648314 +
leaps + 102845963
Figure 15: Hashing algorithm with ignoring of words order
Through text shingles as number representation, new algorithm uses fundamen-
tal mathematical property - commutative law, to ignore word position in individual
shingles.
The resulting table is shown in Figure 16:
16

[2076577277,
109911951,
204879450,
-324321700,
-837324887,
42438961,
471996614,
-239171645,
-1107300931,
1761513003,
-1312590107,
-1482876578,
-738707801,
-735172569,
-577523273,
1585625505,
-1979663435,
-1901721856,
1007501756]
[-319860910,
-320091550,
-324321700,
1080413148,
1080197940,
-239171645,
-1107300931,
1959864098,
1757666974,
1647527013,
-1325227282,
1582090273,
-1979663435,
-1901721856,
1007501756]
Figure 16: Example of Plagiarism after text normalization, shingling and hashing
2.4.4 Determining the Level of Duplication
The ﬁnal stage is to deﬁne the level of plagiarism in document based on already
received information - hashed shingles. This phase entirely depends on the type of
system where the calculations are performed.
Due to the choice of Apache Spark processing framework as a result of the com-
parison in Section 2.3.3 new algorithm for calculation the level of duplication will rely
on Spark‘s high-level transformation API.
Internally, system stores hashed shingles and the corresponding document in tu-
ples. You may treat them as an associative array distributed over nodes in cluster
the way it’s partially shown in Figure 17, where key is hash of a shingle and value is
document’s metadata.
17

(2076577277, {name=original, totalShingles=19}),
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(-1979663435, {name=original, totalShingles=19}),
(-1901721856, {name=original, totalShingles=19}),
(1007501756, {name=original, totalShingles=19});
(-319860910, {name=plagiarism, totalShingles=15, marked=true}),
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tuple (shingle; document meta data object)
Figure 17: Internal data representation in Spark-enabled Anti-Plagiarism software.
The marked ﬂags determines whether documents will be added into Duplication
Report or not.
The next step focuses on grouping elements by tuple’s key. where its value becomes
a set with ids of the documents that contain the hashed shingle.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(-1979663435, {name=original, totalShingles=19}, {name=plagiarism,
totalShingles=15, marked=true}),
(-1901721856, {name=original, totalShingles=19}, {name=plagiarism,
totalShingles=15, marked=true}),
(1007501756, {name=original, totalShingles=19}, {name=plagiarism,
totalShingles=15, marked=true});
Tuple (shingle; set of document meta data objects)
Figure 18: Grouping documents by hashed shingle in Spark-enabled Anti-Plagiarism
software.
After that, system maps each value from a set of tuple’s values to 1, if there are
several document metadata objects (meaning the shingle appears in these documents)
and 0 otherwise. Filtering of non-marked document meta objects occurs here also.
18

({name=plagiarism, totalShingles=15, marked=true}, 1),
({name=plagiarism, totalShingles=15, marked=true}, 1);
Tuple (document meta data object, 1 (have coincides) / 0)
Figure 19: Mapping documents in Spark-enabled Anti-Plagiarism software to 1, if
their neighbors (duplicated shingles) are in tuples, 0 otherwise.
The resulting tuple list is reduced by tuple’s key using addition operation. Strictly
speaking, equal document metadata objects were united at the time when their values
were summed up.
Tuple (unique document meta data object, total coincides)
Figure 20: Reducing tuples by key, represented by document meta data object
As a result of previous operation all necessary information for duplication level
calculation have left. It can be easily calculated using this formula:
D =
C
S
× 100%
Where:
C : is a total coincides amount of hashes
S : is a number of shingles(hashes) in entire document
Using result from and the formula above, the level of plagiarism of document
named ”plagiarism” can be calculated this way:
D =
6
15
× 100% = 40%
19

Chapter 3
System Implementation
This chapter describes practical part of thesis and implements the main object of
this research - develop scalable anti-plagiarism system. The chapter is divided into
four sections to meet main requirements of System Development Life Cycle (SDLC)
and covers all topics on the development of the software solution.
3.1 Software Requirements Speciﬁcation
3.1.1 Purpose
Software system will be an Anti-Plagiarism solution for Computer department of
Sumy State University. The purpose is to develop scalable anti-plagiarism system
that must be resistant to rapid data growth and exceed speed of current software
solution.
Previous software solution has become extremely slow for overgrown relational
database and single threaded algorithm. The task was to review the approach to
data processing and storing.
3.1.2 Product Perspective
The Plagio system is a software solution for ﬁnding near duplicate text documents
using distributed processing solutions, such as Apache Spark. The scope of the project
encompasses both core-side and user-side (GUI) functionalities. The diagram below
illustrates user interaction with the Plagio system.
20

Figure 21: UML Use cases diagram for The Plagio system
3.1.3 Functional Requirements
The main functionalities of anti-plagiarism are determination of plagiarism level
and documents database enrichment. This section outlines each use cases separately.
Use case: Update library
Figure 22: UML Update library use case
Actors: Administrator
Precondition: Apache Spark cluster is available and accessible form user’s ma-
chine.
Basic ﬂow of the event:
1. Administrator inputs address of Apache Spark Master Node, documents library
and input documents paths, shingles’ size, enables/disables text normalization
and enables library update mode.
2. Inputted data are validated.
3. GUI passes input data to the Plagio engine which processes documents on
Apache Spark cluster.
21

4. New documents are cached in distributed storage, GUI shows confirmation mes-
sage
Alternative flows:
2a Inputed data aren’t validated.
[2a1] GUI shows error message.
Use case: Determine plagiarism level of the document
Figure 23: UML Determining plagiarism level of the document use case
Actors: Administrator
Precondition: Apache Spark cluster is available and accessible form user’s ma-
chine.
Basic flow of the event:
1. Administrator inputs address of Apache Spark Master Node, documents library
and input documents paths, shingles’ size and enables/disables text normaliza-
tion and document caching.
2. Inputed data are validated.
3. GUI pases input data to the Plagio engine which processes documents on Apache
Spark cluster.
4. Duplication reports are printed. New documents are cached in distributed stor-
age, if user enabled document caching mode. GUI shows confirmation message
Alternative flows:
2a Inputed data aren’t validated.
[2a1] GUI shows error message.
Design Constraints:
1. Low system requirements for client machine.
2. Fault tolerance.
22

3.2 Implementation Stage
With scalability in mind, the Plagio system has been implemented using Apache
Spark on top of Hadoop Distributed File System (HDFS). General application is
represented by a driver program which governs Spark cluster in accordance to imple-
mented algorithm. HDFS is responsible for storing shingles library across nodes in
cluster.
The implemented system consists of two parts: the core engine and GUI interface.
The core part is represented by class Plagio and is responsible for managing
Apache Spark cluster in aspects of data retrieving, processing and calculations. It
was developed in the form of software library with simple API to keep ﬂexibility and
facilitate its use in other projects.
Figure 24: UML Plagio class diagram
The shingles algorithm is represented by two classes: ShinglesAlgorithm and
HashShinglesAlgorithm. The ﬁrst class implements classical algorithm of text shin-
gling, while the second expands its capabilities with shingles hashing algorithm de-
scribed in Section 2.4.3.
23

Figure 25: UML ShinglesAlgorithm and HashShinglesAlgorithm classes diagram
The Texts class implements text normalizations algorithms: text formatting, spe-
cial characters and stopwords cleaning.
Figure 26: UML Texts class diagram
There are two data structures: Metadata and DuplicationReport. The ﬁrst is used
in the algorithm for storing metadata about documents, such as document name and
shingles number. The results of plagiarism search are written into DuplicationRepot,
where information like duplication level and amount of coincides are stored.
24

Figure 27: UML Metadata and DuplicationReport structures class diagram
The GUI module is an intermediary between user and the core library which is
represented with two clases: PlagioGUI and PlagioMain. The ﬁrst class describes
structure of the interface using Swing GUI library, while PlagioMain contains a glue
code between Plagio class and user input from PlagioGUI.
Figure 28: UML GUI PlagioMain and PlagioGUI classes diagram
25

Figure 29: User interface of The Plagio system
Interface provides three working modes: library update, duplication report and
combined mode. The library update mode performs no plagiarism detection and is
responsible for enriching knowledge base with new documents, while the main re-
sponsibility of duplication report and combined modes are to do plagiarism detection
and prepare duplication reports. In addition, the combined mode updates knowledge
base the same way library update mode do. The workﬂow of this mode described in
Figure 30 in the form of UML sequence diagram.
26

Figure 30: UML Sequence diagram for The Plagio System
3.3 Testing Stage
Every Plagio build pases two-phase testing process. If the build pases both stages
it can be released to production.
The ﬁrst testing stage focuses on checking core functionalities: library update and
determining level of plagiarism. Thanks to Spark’s built-in local standalone mode,
it’s possible to test Spark-enabled programs on the virtual local cluster. Based on
27

the data, acquired during manual determination of plagiarism level in Section 2.4,
a test case using unit testing methodology has been implemented. This test case
simulates submitting of two text excerpts speciﬁed in Section 2.4 to the system and
asks Plagio to provide duplication reports. Test are considered to be passed if the level
of plagiarism and number of shingles coincides match up with manually calculated
values.
This test is run before packaging Plagio’s distribution by Maven build automation
tool. Successfully passed test returns the following results:
-----------------------------------------------------------------------
T E S T S
-----------------------------------------------------------------------
Running eu.ioservices.plagio.test.PlagioTest
-> testEmptyLib() test
[DuplicationReport{docCoincidences=0, duplicationLevel=0.0,
documentMetadata=DocumentMetadata{documentId=’file:/C:/orig.txt’,
totalShingles=19, isMarked=true}}]
-> testLibrary() test
[DuplicationReport{docCoincidences=6, duplicationLevel=40.0,
documentMetadata=DocumentMetadata{documentId=’file:/C:/plag.txt’,
totalShingles=15, isMarked=true}}]
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
-----------------------------------------------------------------------
BUILD SUCCESS
-----------------------------------------------------------------------
Figure 31: Successfully passed unit testing of The Plagio System
The second testing stage checks Plagio’s user interface to ensure that it meets
its speciﬁcations. Assuming that Plagio’s GUI is a mediator between user and core
system. The main interface responsibility is to validate user input and pass data to
core for further processing. Using the same input data as in the test above, GUI
test repeats steps from core unit testing paying attention to the correct interaction
between the components. Results of successfully passed test must match up with unit
testing results from the test above:
28

Figure 32: Successfully passed two-phase testing process of The Plagio System
29

Chapter 4
Systems Performance Testing and
Comparison
The purpose of this chapter is to test the current anti-plagiarism solution and The
Plagio System considering rapid growth of data in the future. The aim of testing was
not to test systems in equal conditions but to conduct performance comparison of
approaches in more general context.
Conducted tests in this section were performed on a platforms based on IntelR
CoreTM
i5-3317U 1.7 GHz & 10 GB RAM and IntelR
CoreTM
i5-3210M 2.5 GHz & 8
GB RAM. As the test data 500 MB of free e-books from Project Gutenberg in plain
text were used.
Apache Spark cluster has been deployed on both machines, while the Plagio Sys-
tem controls it from weaker platform, based on IntelR
CoreTM
i5. Sumy State Uni-
versity anti-plagiarism application was tested with only one computer because it does
not support distributed mode.
Regarding test results, it was assumed, that they will show Apache Spark based
solution to be excessive for small text data sets. The decision was made due to the fact
that Spark must handle distributed environment and perform data distributing over
the nodes, data synchronization and task scheduling which lead to additional data
transfer and computation overhead. Moreover, available testing environment does
not meet Spark’s minimum requirements and it doesn’t reveal all its performance
capabilities.
The ﬁrst test measured the time systems need to initiate themselves to be ready
for data processing:
30

Systemstart #
1 2 3 4 5 6
Plagio 3.10s 1.70s 1.65s 1.75s 1.60s 1.65s
SSU 0.01s 0.01s 0.01s 0.01s 0.01s 0.01s
Figure 33: Systems initialization overhead test results
Figure 34: Systems initialization overhead test visualization
The results clearly show the superiority of SSU Anti-plagiarism, which loads al-
most instantly, while Spark-enabled solution needs time to initiate its context and
connect to Apache Spark master node in cluster.
The second test focused exclusively on data processing with shingling algorithm:
SystemDocuments
20 MB 50 MB 150 MB 250 MB 500 MB
Plagio 1028s 2002.12s 4621.53s 6621.53s 11041.51s
SSU 956.1s 2013.36s 6240.3s 9297.2s 19756.33s
Figure 35: Text shingling test results
31

Figure 36: Text shingling test visualization
The results conﬁrm our assumption that Spark-enabled Plagio solution may be
excessive for small text data sets but the amount of data began to increase and
exceeded 100 MB it surpassed his opponent.
The latest test examined performance in the most common use case: shingling,
checking for plagiarism and caching new shingles:
SystemDocuments
20 MB 50 MB 150 MB 250 MB 500 MB
Plagio 1364.12s 2712.4s 8249.1s 13725.4s 28941.7s
SSU 2747.5s 6137.3s 21192.1s 38369.3s 102252.8s
Figure 37: Full-cycle documents analysis test results
Figure 38: Full-cycle documents analysis test visualization
Due to insuﬃcient algorithm, SSU Anti-Plagiarism solution compares each hashed
shingles with records in relational database what resulted in a hight overhead from
SQL queries.
32

Chapter 5
Conclusion
With instantly growing data flow, it has become extremely important to extract
only valuable information on time. Unfortunately, data flow increases faster than
technological progress and modern computers fall behind very large data sets. That
is where distributed data processing systems go to the front line.
The research project focuses on developing software solution for near duplicate
text detection based on distributed systems for data processing. Effectiveness of the
implemented system has been confirmed through a quality comparison and testing.
According to test results, the implemented system proved effective in comparison
to anti-plagiarism developed in Sumy State University. Moreover, research results
confirmed the initial assumption that Spark-enabled solution may be inefficient for
small data sets but it showed decent result even for small data. These two systems
can be used together to achieve best computation speed for both small and big data
sets.
Despite the positive results of system performance testing there are opportunities
to improve current version of the prototype. Among possible improvements there are
optimizing of shingling algorithm and load distribution factor for more effective file
chunks distributing over the cluster’s nodes.
33

References
[1] Examples of Plagiarism - academic integrity at Princeton University. https:
//www.princeton.edu/pr/pub/integrity/pages/plagiarism/, 2011. [Online;
accessed January 22, 2016].
[2] Ancestry.com LLC fourth quarter and full year 2012 ﬁnancial results. 2013. http:
//corporate.ancestry.com/press/press-releases/2013/03/ancestrycom-
llc-reports-fourth-quarter-and-full-year-2012-financial-results/.
[3] Apache SparkTM
- lightning-fast cluster computing. http://spark.apache.org/,
2016. [Online; accessed January 18, 2016].
[4] Reynold Xin Andy Konwinski, Sean Owen. Powered by spark - spark - apache
software foundation. https://cwiki.apache.org/confluence/display/SPARK/
Powered+By+Spark, 2015. [Online; accessed January 20, 2016].
[5] DIS group. Hadoop technology. http://www.dis-group.ru/files/
hadoop ot dis group.pdf, 2013. [Online; accessed January 15, 2016].
[6] Nicole Hemsoth. Flink sparks next wave of distributed data process-
ing. http://www.nextplatform.com/2015/02/22/flink-sparks-next-wave-
of-distributed-data-processing/, 2015. [Online; accessed January 20, 2016].
[7] Hortonworks Inc. Apache Hadoop YARN: Present and Future. http://
hortonworks.com/blog/apache-hadoop-yarn/. [Online; accessed January 15,
2016].
[8] Brad Brown Jacques Bughin Richard Dobbs Charles Roxburgh Angela
Hung Byers James Manyika, Michael Chui. Big data: The next frontier
for innovation, competition, and productivity. McKinsey Global Institute
Reports, 2011. http://www.mckinsey.com/insights/business technology/
big data the next frontier for innovation.
34

[9] Sanjay Ghemawat Jeﬀrey Dean. Mapreduce: simpliﬁed data processing on large
clusters. Communications of the ACM - 50th anniversary issue: 1958 - 2008,
2008. http://dl.acm.org/citation.cfm?doid=1327452.1327492.
[10] Doug Laney. 3D Data Management: Controlling data volume, velocity and
variety. Technical report, META Group, 2001.
[11] Dan Lecocq. Near-duplicate detection. https://moz.com/devblog/near-
duplicate-detection/, 2015. [Online; accessed January 23, 2016].
[12] Saggi Neumann. Spark vs. Hadoop MapReduce · Xplenty. https:
//www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/,
2014. [Online; accessed January 21, 2016].
[13] Madhukara Phatak. History of Apache Spark : Journey from Academia to Indus-
try. http://blog.madhukaraphatak.com/history-of-spark/, 2015. [Online;
accessed January 18, 2016].
[14] Vladimir Roubtsov. Java Tip 130: Do you know your data size?
http://www.javaworld.com/article/2077496/testing-debugging/java-
tip-130--do-you-know-your-data-size-.html, 2002. [Online; accessed
January 24, 2016].
[15] SIGNIANT. The historical growth of data: Why we need a faster transfer solu-
tion for large data sets. http://www.signiant.com/articles/file-transfer/
the-historical-growth-of-data-why-we-need-a-faster-transfer-
solution-for-large-data-sets/, 2015. [Online; accessed January 5, 2016].
[16] Ion Stoica. Spark and Hadoop: Working together. https://databricks.com/
blog/2014/01/21/spark-and-hadoop.html, 2014. [Online; accessed January
20, 2016].
[17] Ah-Hwee Tan. Text Mining: The state of the art and the chal-
lenges. http://www.javaworld.com/article/2077496/testing-debugging/
java-tip-130--do-you-know-your-data-size-.html, 2002. [Online; accessed
January 14, 2016].
[18] Wikipedia. Text normalization- Wikipedia, the free encyclopedia. https://
en.wikipedia.org/wiki/Text normalization, 2013. [Online; accessed January
22, 2016].
35

[19] Wikipedia. Jaccard index - Wikipedia, the free encyclopedia. https://
en.wikipedia.org/wiki/Jaccard index, 2015. [Online; accessed January 23,
2016].
[20] Wikipedia. Apache Hadoop - Wikipedia, the free encyclopedia. https://
en.wikipedia.org/wiki/Apache Hadoop, 2016. [Online; accessed January 17,
2016].
[21] Wikipedia. Apache Spark - Wikipedia, the free encyclopedia. https://
en.wikipedia.org/wiki/Apache Spark, 2016. [Online; accessed January 19,
2016].
[22] Wikipedia. Big data - Wikipedia, the free encyclopedia. http://
en.wikipedia.org/wiki/Big data, 2016. [Online; accessed January 7, 2016].
36

Parallel and Distributed Algorithms for Large Text Datasets Analysis

More Related Content

What's hot (18)

Similar to Parallel and Distributed Algorithms for Large Text Datasets Analysis (20)

Recently uploaded (20)

Parallel and Distributed Algorithms for Large Text Datasets Analysis