SlideShare a Scribd company logo
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

THE COMPARISON OF THE TEXT
CLASSIFICATION METHODS TO BE USED
FOR THE ANALYSIS OF MOTION DATA IN
DLP ARCHITECT
Murat TOPALOĞLU
Kesan Yusuf Çapraz School of Applied Sciences, Trakya University, Kesan,
22880, Turkey

ABSTRACT
Text classification is used for the purpose of preventing the leakage of the data which is highly important
within the institution through unallowed ways. The results obtained from the text classification process
should be integrated into the DLP architecture immediately. The data flowing through the net requires
instant control and the flow of the sensitive data should be prevented. The use of the machinery learning
methods is required to perform the text classification which will be integrated into the DLP architecture.
The experimental results of the comparison of text classification methods to be used in the interface written
on the ICAP protocol have been prepared in the networked architecture developed for the DLP system.
Also, the choice of the text classification method to be used in the instant control of the sensitive data has
been carried out. The DLP text classification architecture developed helps decide the classification method
through the examination of the data in motion. The method to be chosen for the text classification is applied
to the ICAP protocol, and the analysis of the sensitive data and confidentiality are provided.

KEYWORDS
Decision support systems, Data Leak Prevention, Data in Motion, Security, ICAP

1.

Introduction

This study involves the experimental results of the comparative text classification methods to be
used for the interface which will be written in the ICAP (Internet Content Adaption Protocol) in
the networked architecture which has been developed for DLP (Data Loss Prevention). In
addition, this study looks into the methods aiming at the classification of the data in motion in
data loss prevention. The networked architecture developed for DLP controls the data flow using
ICAP. Our purpose is to choose the most suitable text classification method for the interface
which will be written in the ICAP with C programming language. In this study, the classification
method which will be chosen and programmed will be determined.
Text classification is the name given to the process in which the written documents are divided
into certain classes depending on their contents [1]. The aim in the text classification is to
determine which preset category the data will be included in by taking the features of the data into
DOI : 10.5121/ijcsit.2013.5507

107
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

consideration. It is highly required that the data be used under the conditions the institute allows
and the probable damage to the data arrangement of the institution be minimized. The aim
followed in the text classification is to determine which category, sensitive, confidential, or
normal, the data will be included in.
DLP text classification allows for the classification of the data shared on the institutional network
and prevents the data of high importance for the institution automatically. The DLP architecture
developed involves the comparison of performance values of the classification algorithms through
feature extraction techniques and weighting methods used in the text classification.

2.

Materials and Methods

The software “text2arfff” which was developed by Amasyalı et al. performs the feature extraction
of the texts with various methods, digitize the texts with the help of the weighting methods, and
converts the texts to ARFF format which is the input file of WEKA [2] program, and it has been
used to form arrf files in the system developed [3].
16 different feature vectors of 2-gram and words which were obtained through the use of parsing
method have been extracted using grammatical and statistical features like K-Means algorithm,
one of the clustering methods, and classification has been managed through these feature vectors.
Whether the document contains sensitive data or not will be determined with the use of Naive
Bayes, which is one of the machinery learning methods, Support Vector Machine (SVM), knearest neighbor algorithm (IBK) and decision trees (J48). 10-time cross validity and education
choice have been used for DLP architecture and f-measure value has been used to determine the
performance of classifiers.

2.1.

Data Set

The DLP architecture determines whether the data flow will be allowed according to the
outcomes obtained from the text classification categories. The sizes of the documents used consist
of 800 documents in English ranging from 1 kb to 36 kb. The sensitive data are composed of 400
documents labeled as secret, confidential, and sensitive which involves warfare correspondence
belonging to the USA [4]. The other normal data have been taken from 400 pieces of news
concerning economics, sports, social, and other subjects. The classification begins with the 10 %
of these data. 40 documents involving sensitive data (class1) and 40 documents involving normal
data (class2) have been put in these two classes, and the number has been increased by five other
documents each time. For instance, while the number of the documents was 40 for the first try, it
was 45 for the second try and 50 for the third try. Finally, the number of the documents for the
last try reached 400 for each classes, adding up to 800 in total. The classification operation has
been achieved through using the two feature vector (2-gram and words) separately for each data
set. During the preparation of the feature vectors, maximum frequency value for 2-gram and
words has been changed as 10 and 50 and it has been decided that the frequency value for the k
value has will be taken as 50, which is the constant value.

108
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

3.

Body

While weighting was being carried out, the features of the arrf files in which text2arrf software
was used were determined according to the choices given in Table – 1.
Table – 1.Parameter choices of the experiments performed

Experiments

Method

Tf / Tfxidf

Frequency

k

Experiment 1

2-Gram

Tfxidf

10

50

Experiment 2

2-Gram

Tfxidf

50

50

Experiment 3

Words

Tf

10

50

Experiment 4

Words

Tf

50

50

The choices given above have been applied in four machinery learning methods, which are Naïve
Bayes, SVM, IBK and J48. K values of the K nearest neighbor algorithm, being changed from 1
to 30, has been calculated separately. The k value which gives the best outcome has been
calculated separately for all the experiments. nu-SVC classification has been used as SVM type in
Support Vector Machines.
In the first experiment, the education clusters obtained through Text2arrf software are learnt by
the classification algorithms in the experiments carried out with the use of WEKA software. The
average f-measure outcomes obtained as a result of this learning process are shown in Figure – 1.
According to these outcomes, the comparison of the classifiers and the evaluation of the
performances are conducted.

109
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

Figure – 1 Experiment 1 Result

In Table – 2, the highest and the lowest accuracy rates of the classifiers used in the experiment 1
at the end of the learning process are shown. According to this table, the Naive Bayes classifier's
rate of classifying the samples accurately is higher than other classifiers.
Table – 2.The highest and the lowest accuracy rates of the classifiers according to the experiment 1
The percentage of the accurately classified

Naive Bayes

J48

SVM

IBK

Maximum (Max)

96,8852

94,1538

96,25

84,4444

Minimum (Min)

85,7143

76,25

75,5932

69,2105

samples

In the algorithms above, all parameters have been used with their assumed values. K value has
been chosen 4 only for the k nearest neighbor algorithm value.
Of the algorithms, the highest success belongs to Naive Bayes. The existence of a waving
structure has been observed in Support Vector Machines and they have ranked as the second
algorithm with the learning rate. Decision trees are more stable than the support vector trees and
110
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

they have ranked third with their learning rate. The worst learning structure is found in k nearest
neighbor algorithm.
The average f-measure outcomes obtained at the end of the learning process in the second
experiment are shown in Figure – 2. According to these outcomes, the comparison of the
classifiers and the evaluation of the performances are conducted.

Figure – 2 Experiment 2 Result

In Table – 3, the highest and the lowest accuracy rates of the classifiers used in the experiment 2
at the end of the learning process are shown. According to this table, the IBK classifiers rate of
classifying the samples accurately is higher than other classifiers.
Table – 3.The highest and the lowest accuracy rates of the classifiers according to the experiment 2
The percentage of the accurately classified samples

Naive Bayes

J48

SVM

IBK

Maximum (Max)

95,9649

94,4444

96,25

96,25

Minimum (Min)

89,1667

75

75,5932

80,4255

Support Vector Machines have produced the same outcomes as in the experiment 1. The decision
trees have also produced a successful outcome as its frequency value of the experiment 2 of Table
111
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

– 1 has been increased to 50 and it has worked in a more stable manner. K nearest neighbor value
has been chosen as k = 7. In comparison to experiment 1, K nearest algorithm has increased its
success and yielded a better outcome.
The average f-measure outcomes obtained at the end of the learning process in the third
experiment are shown in Figure – 3. According to these outcomes, the comparison of the
classifiers and the evaluation of the performances are conducted.

Figure – 3 Experiment 3 Result

In Table – 4, the highest and the lowest accuracy rates of the classifiers used at the end of the
learning process are shown. According to this table, the Naive Bayes classifier's rate of
classifying the samples accurately is higher than other classifiers.
Table – 4.The highest and the lowest accuracy rates of the classifiers according to the experiment 3
The percentage of the accurately classified

Naive Bayes

J48

SVM

IBK

Maximum (Max)

98,75

93

96,5909

90

Minimum (Min)

92,6316

81,6667

76,1538

61,8182

samples

112
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

As for the method, words have been used instead of n-gram and of all the algorithms; the highest
success belongs to Naive Bayes. According to the words method, except for the k nearest
neighbor algorithm, the values of the all other algorithms have increased. The value of the k
nearest neighbor algorithm has been determined as k = 1. As to k nearest neighbor algorithm, as
the number of texts has become more, the success of classification has decreased.
The average f-measure outcomes obtained at the end of the learning process in the fourth
experiment are shown in Figure – 4. According to these outcomes, the comparison of the
classifiers and the evaluation of the performances are conducted.

Figure – 4 Experiment 4 Result

In Table – 5, the highest and the lowest accuracy rates of the classifiers used in the experiment 4
at the end of the learning process are shown. According to this table, the Naive Bayes classifier's
rate of classifying the samples accurately is higher than other classifiers.
Table – 5.The highest and the lowest accuracy rates of the classifiers according to the experiment 4
The percentage of the accurately classified samples

Naive Bayes

J48

SVM

IBK

Maximum (Max)

98,75

92,5

96,9118

95,375

Minimum (Min)

93,3333

81,9231

73,5714

72,7273

Words have been used instead of N-gram again and frequency value has been changed to 50
during the formation of arrf files. Naive Bayes has had the highest success and stable learning
value among the algorithms. In the support vector machines, the waving lessens and the success
113
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

rate reaches the highest point. In the decision trees, the lowest success rate is seen at the
experiment 4; however, learning continues in an increasing manner after a certain point. K nearest
neighbor algorithm's success is as high as it was in the experiment 2. This algorithm has shown a
successful increase starting from the lowest learning level. The value of K nearest neighbor point
has been chosen as K = 12. As stated in Table – 1, the success of the k nearest neighbor increased
when the k value was increased to 50 in the formation arrf files of the experiment.

4.

Discussion

There exist the following problems and discussions for the algorithms to be used in the DLP text
classification architecture.
The differences among the algorithms may be due to the following reasons;
 Carrying out functions affecting the model extraction in the operations like the
qualification choice made at the level of data preprocessing and data completion, which
can influence analysis outcomes.
 The data formed in different ways of preprocessing with different analysis results.
The factors affecting the classification can be as followings;




Differences in algorithms,
Features special for the data set,
Incompatibility between the method and the problems.

The features special for the data set can be as followings;



Class ambiguity,
Insufficient number of samples

Class ambiguity indicates the situations in which no distinction can be made with the features
given within the classification problem using any classification algorithm. Another factor which
makes classification more difficult is the scarcity of data. Classifying the situations which are not
exemplified with enough examples to limit the generalization mechanism of the classifiers is most
likely to be randomly done. Naïve Bayes, which is a linear classifier with normal distribution
assumption [5], can turn into a nonlinear classifier when a kernel density estimator is used [6].

5.

Conclusion

Naive Bayes has given the highest learning of DLP text classification architect using the words
weighting. However, the high learning rate has not changed for the words weighting which was
prepared with an altered k value. The happy graph of the k nearest neighbor classifier has been
achieved at the intended level on the education data prepared with the experiment 4 in the DLP
text classification architecture. Yet, its success level does not prove as successful as that of Naive
Bayes.

114
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013

When all these outcomes have been analyzed, it has been determined that experiment 4
environment and Naive Bayes classifier will be based in the DLP to be developed due to the fact
that Naive Bayes classifier has achieved 98.75 accuracy over the education data prepared with
words feature extraction method and the f-measure/data graph, or the happy graph, has a less
wavy structure. Therefore, during the preparation of education data, words will be chosen as the
feature extraction method, idf will be chosen as the weighting method, 50 will be the frequency
value, and finally 50 will be for k, which is the repetition number. Following that, Naive Bayes
has been employed as a classifier in learning education data process and assumed parameters of
this classifier has been using in the WEKA environment. Besides these, happy graph provides us
with important information about the performance of the classifier. In parallel to this, with 550
education data instead of 800, the previous accuracy level of the classified has been
approximately caught and there has been a considerable increase in its performance.

References
[1]
[2]
[3]
[4]
[5]
[6]

Jackson P. & Moulinier I. (2002). “Natural language processing for online applications: text retrieval,
extraction, and categorization”. Amsterdam.
Weka Project – http://sourceforge.net/projects/weka/
Amasyalı, M. F., Davletov, F., Torayew, A., & Çiftçi, Ü. (2010). “text2arff: Türkçe Metinler İçin
Özellik
Çıkarım Yazılımı”. SİU 2010. Diyarbakır.
Torture Archive – http://www.aladin0.wrlc.org/gsdl/cgi-bin/library?c=torture&a=q
Manning, C. D., Raghavan, P., & Schütze, H. (2008). “Introduction to Information Retrieval”.
Cambridge University Press. New York.
John, G. H., & Langley, P. (1995). “Estimating Continuous Distributions in Bayesian Classifiers”,
Proceedings of the eleventh Conference on Uncertainty in Artificial Intelligence. pp. 338-345.

115

More Related Content

What's hot (16)

PDF
Improving the performance of Intrusion detection systems
yasmen essam
 
PDF
2013 feature selection for intrusion detection using nsl kdd
Van Thanh
 
PDF
D1803012022
IOSR Journals
 
PDF
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
PDF
A Hierarchical Feature Set optimization for effective code change based Defec...
IOSR Journals
 
PDF
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
CSCJournals
 
PDF
N045038690
IJERA Editor
 
PDF
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET Journal
 
PDF
Query optimization to improve performance of the code execution
Alexander Decker
 
PDF
11.query optimization to improve performance of the code execution
Alexander Decker
 
PDF
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
Editor IJCATR
 
PDF
New feature selection based on kernel
journalBEEI
 
PDF
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
cscpconf
 
PDF
ANALYSIS OF MACHINE LEARNING ALGORITHMS WITH FEATURE SELECTION FOR INTRUSION ...
IJNSA Journal
 
PDF
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
IJDMS
 
PDF
Finding Bad Code Smells with Neural Network Models
IJECEIAES
 
Improving the performance of Intrusion detection systems
yasmen essam
 
2013 feature selection for intrusion detection using nsl kdd
Van Thanh
 
D1803012022
IOSR Journals
 
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
A Hierarchical Feature Set optimization for effective code change based Defec...
IOSR Journals
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
CSCJournals
 
N045038690
IJERA Editor
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET Journal
 
Query optimization to improve performance of the code execution
Alexander Decker
 
11.query optimization to improve performance of the code execution
Alexander Decker
 
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
Editor IJCATR
 
New feature selection based on kernel
journalBEEI
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
cscpconf
 
ANALYSIS OF MACHINE LEARNING ALGORITHMS WITH FEATURE SELECTION FOR INTRUSION ...
IJNSA Journal
 
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
IJDMS
 
Finding Bad Code Smells with Neural Network Models
IJECEIAES
 

Viewers also liked (12)

PPTX
BIOSTER Technology Research Institute
Data Science Institute - Imperial College London
 
PDF
Substation design-guideliness
OSMAN SAYGIN AKKAYA
 
PPTX
consolepark
usconsolepark
 
DOCX
Abstract challehges of iwb
Abdelmoneim Adam
 
PDF
The inflammation component of Alzheimer's Disease
Data Science Institute - Imperial College London
 
PPTX
Reyes de España: Carlos I y Carlos II.
benedetti6
 
PDF
Waugh
mehdi miri
 
PPTX
DIY Facebook Marketing Hotter Than Your Neighbors Ferrari
Roel Manarang
 
BIOSTER Technology Research Institute
Data Science Institute - Imperial College London
 
Substation design-guideliness
OSMAN SAYGIN AKKAYA
 
consolepark
usconsolepark
 
Abstract challehges of iwb
Abdelmoneim Adam
 
The inflammation component of Alzheimer's Disease
Data Science Institute - Imperial College London
 
Reyes de España: Carlos I y Carlos II.
benedetti6
 
Waugh
mehdi miri
 
DIY Facebook Marketing Hotter Than Your Neighbors Ferrari
Roel Manarang
 
Ad

Similar to The comparison of the text classification methods to be used for the analysis of motion data in dlp architect (20)

PDF
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
PDF
Potato Leaf Disease Detection Using Machine Learning
IRJET Journal
 
PDF
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
Eswar Publications
 
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR
 
PDF
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
IJCSEA Journal
 
PDF
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
IJCSEA Journal
 
PDF
SENSITIVITY ANALYSIS OF INFORMATION RETRIEVAL METRICS
ijcsit
 
PDF
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
cscpconf
 
PDF
J48 and JRIP Rules for E-Governance Data
CSCJournals
 
PDF
Comparative study of various supervisedclassification methodsforanalysing def...
eSAT Publishing House
 
PDF
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET Journal
 
PDF
Artificial Intelligence based Pattern Recognition
Dr. Amarjeet Singh
 
PDF
F017533540
IOSR Journals
 
PDF
Automatic Text Classification Of News Blog using Machine Learning
IRJET Journal
 
PDF
Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction
IOSR Journals
 
PDF
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
PDF
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
ijaia
 
PDF
Investigating the Effect of BD-CRAFT to Text Detection Algorithms
gerogepatton
 
PDF
A Firefly based improved clustering algorithm
IRJET Journal
 
PDF
Myanmar Alphabet Recognition System Based on Artificial Neural Network
ijtsrd
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
Potato Leaf Disease Detection Using Machine Learning
IRJET Journal
 
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
Eswar Publications
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR
 
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
IJCSEA Journal
 
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
IJCSEA Journal
 
SENSITIVITY ANALYSIS OF INFORMATION RETRIEVAL METRICS
ijcsit
 
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
cscpconf
 
J48 and JRIP Rules for E-Governance Data
CSCJournals
 
Comparative study of various supervisedclassification methodsforanalysing def...
eSAT Publishing House
 
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET Journal
 
Artificial Intelligence based Pattern Recognition
Dr. Amarjeet Singh
 
F017533540
IOSR Journals
 
Automatic Text Classification Of News Blog using Machine Learning
IRJET Journal
 
Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction
IOSR Journals
 
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
ijaia
 
Investigating the Effect of BD-CRAFT to Text Detection Algorithms
gerogepatton
 
A Firefly based improved clustering algorithm
IRJET Journal
 
Myanmar Alphabet Recognition System Based on Artificial Neural Network
ijtsrd
 
Ad

Recently uploaded (20)

PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
July Patch Tuesday
Ivanti
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 

The comparison of the text classification methods to be used for the analysis of motion data in dlp architect

  • 1. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 THE COMPARISON OF THE TEXT CLASSIFICATION METHODS TO BE USED FOR THE ANALYSIS OF MOTION DATA IN DLP ARCHITECT Murat TOPALOĞLU Kesan Yusuf Çapraz School of Applied Sciences, Trakya University, Kesan, 22880, Turkey ABSTRACT Text classification is used for the purpose of preventing the leakage of the data which is highly important within the institution through unallowed ways. The results obtained from the text classification process should be integrated into the DLP architecture immediately. The data flowing through the net requires instant control and the flow of the sensitive data should be prevented. The use of the machinery learning methods is required to perform the text classification which will be integrated into the DLP architecture. The experimental results of the comparison of text classification methods to be used in the interface written on the ICAP protocol have been prepared in the networked architecture developed for the DLP system. Also, the choice of the text classification method to be used in the instant control of the sensitive data has been carried out. The DLP text classification architecture developed helps decide the classification method through the examination of the data in motion. The method to be chosen for the text classification is applied to the ICAP protocol, and the analysis of the sensitive data and confidentiality are provided. KEYWORDS Decision support systems, Data Leak Prevention, Data in Motion, Security, ICAP 1. Introduction This study involves the experimental results of the comparative text classification methods to be used for the interface which will be written in the ICAP (Internet Content Adaption Protocol) in the networked architecture which has been developed for DLP (Data Loss Prevention). In addition, this study looks into the methods aiming at the classification of the data in motion in data loss prevention. The networked architecture developed for DLP controls the data flow using ICAP. Our purpose is to choose the most suitable text classification method for the interface which will be written in the ICAP with C programming language. In this study, the classification method which will be chosen and programmed will be determined. Text classification is the name given to the process in which the written documents are divided into certain classes depending on their contents [1]. The aim in the text classification is to determine which preset category the data will be included in by taking the features of the data into DOI : 10.5121/ijcsit.2013.5507 107
  • 2. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 consideration. It is highly required that the data be used under the conditions the institute allows and the probable damage to the data arrangement of the institution be minimized. The aim followed in the text classification is to determine which category, sensitive, confidential, or normal, the data will be included in. DLP text classification allows for the classification of the data shared on the institutional network and prevents the data of high importance for the institution automatically. The DLP architecture developed involves the comparison of performance values of the classification algorithms through feature extraction techniques and weighting methods used in the text classification. 2. Materials and Methods The software “text2arfff” which was developed by Amasyalı et al. performs the feature extraction of the texts with various methods, digitize the texts with the help of the weighting methods, and converts the texts to ARFF format which is the input file of WEKA [2] program, and it has been used to form arrf files in the system developed [3]. 16 different feature vectors of 2-gram and words which were obtained through the use of parsing method have been extracted using grammatical and statistical features like K-Means algorithm, one of the clustering methods, and classification has been managed through these feature vectors. Whether the document contains sensitive data or not will be determined with the use of Naive Bayes, which is one of the machinery learning methods, Support Vector Machine (SVM), knearest neighbor algorithm (IBK) and decision trees (J48). 10-time cross validity and education choice have been used for DLP architecture and f-measure value has been used to determine the performance of classifiers. 2.1. Data Set The DLP architecture determines whether the data flow will be allowed according to the outcomes obtained from the text classification categories. The sizes of the documents used consist of 800 documents in English ranging from 1 kb to 36 kb. The sensitive data are composed of 400 documents labeled as secret, confidential, and sensitive which involves warfare correspondence belonging to the USA [4]. The other normal data have been taken from 400 pieces of news concerning economics, sports, social, and other subjects. The classification begins with the 10 % of these data. 40 documents involving sensitive data (class1) and 40 documents involving normal data (class2) have been put in these two classes, and the number has been increased by five other documents each time. For instance, while the number of the documents was 40 for the first try, it was 45 for the second try and 50 for the third try. Finally, the number of the documents for the last try reached 400 for each classes, adding up to 800 in total. The classification operation has been achieved through using the two feature vector (2-gram and words) separately for each data set. During the preparation of the feature vectors, maximum frequency value for 2-gram and words has been changed as 10 and 50 and it has been decided that the frequency value for the k value has will be taken as 50, which is the constant value. 108
  • 3. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 3. Body While weighting was being carried out, the features of the arrf files in which text2arrf software was used were determined according to the choices given in Table – 1. Table – 1.Parameter choices of the experiments performed Experiments Method Tf / Tfxidf Frequency k Experiment 1 2-Gram Tfxidf 10 50 Experiment 2 2-Gram Tfxidf 50 50 Experiment 3 Words Tf 10 50 Experiment 4 Words Tf 50 50 The choices given above have been applied in four machinery learning methods, which are Naïve Bayes, SVM, IBK and J48. K values of the K nearest neighbor algorithm, being changed from 1 to 30, has been calculated separately. The k value which gives the best outcome has been calculated separately for all the experiments. nu-SVC classification has been used as SVM type in Support Vector Machines. In the first experiment, the education clusters obtained through Text2arrf software are learnt by the classification algorithms in the experiments carried out with the use of WEKA software. The average f-measure outcomes obtained as a result of this learning process are shown in Figure – 1. According to these outcomes, the comparison of the classifiers and the evaluation of the performances are conducted. 109
  • 4. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 Figure – 1 Experiment 1 Result In Table – 2, the highest and the lowest accuracy rates of the classifiers used in the experiment 1 at the end of the learning process are shown. According to this table, the Naive Bayes classifier's rate of classifying the samples accurately is higher than other classifiers. Table – 2.The highest and the lowest accuracy rates of the classifiers according to the experiment 1 The percentage of the accurately classified Naive Bayes J48 SVM IBK Maximum (Max) 96,8852 94,1538 96,25 84,4444 Minimum (Min) 85,7143 76,25 75,5932 69,2105 samples In the algorithms above, all parameters have been used with their assumed values. K value has been chosen 4 only for the k nearest neighbor algorithm value. Of the algorithms, the highest success belongs to Naive Bayes. The existence of a waving structure has been observed in Support Vector Machines and they have ranked as the second algorithm with the learning rate. Decision trees are more stable than the support vector trees and 110
  • 5. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 they have ranked third with their learning rate. The worst learning structure is found in k nearest neighbor algorithm. The average f-measure outcomes obtained at the end of the learning process in the second experiment are shown in Figure – 2. According to these outcomes, the comparison of the classifiers and the evaluation of the performances are conducted. Figure – 2 Experiment 2 Result In Table – 3, the highest and the lowest accuracy rates of the classifiers used in the experiment 2 at the end of the learning process are shown. According to this table, the IBK classifiers rate of classifying the samples accurately is higher than other classifiers. Table – 3.The highest and the lowest accuracy rates of the classifiers according to the experiment 2 The percentage of the accurately classified samples Naive Bayes J48 SVM IBK Maximum (Max) 95,9649 94,4444 96,25 96,25 Minimum (Min) 89,1667 75 75,5932 80,4255 Support Vector Machines have produced the same outcomes as in the experiment 1. The decision trees have also produced a successful outcome as its frequency value of the experiment 2 of Table 111
  • 6. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 – 1 has been increased to 50 and it has worked in a more stable manner. K nearest neighbor value has been chosen as k = 7. In comparison to experiment 1, K nearest algorithm has increased its success and yielded a better outcome. The average f-measure outcomes obtained at the end of the learning process in the third experiment are shown in Figure – 3. According to these outcomes, the comparison of the classifiers and the evaluation of the performances are conducted. Figure – 3 Experiment 3 Result In Table – 4, the highest and the lowest accuracy rates of the classifiers used at the end of the learning process are shown. According to this table, the Naive Bayes classifier's rate of classifying the samples accurately is higher than other classifiers. Table – 4.The highest and the lowest accuracy rates of the classifiers according to the experiment 3 The percentage of the accurately classified Naive Bayes J48 SVM IBK Maximum (Max) 98,75 93 96,5909 90 Minimum (Min) 92,6316 81,6667 76,1538 61,8182 samples 112
  • 7. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 As for the method, words have been used instead of n-gram and of all the algorithms; the highest success belongs to Naive Bayes. According to the words method, except for the k nearest neighbor algorithm, the values of the all other algorithms have increased. The value of the k nearest neighbor algorithm has been determined as k = 1. As to k nearest neighbor algorithm, as the number of texts has become more, the success of classification has decreased. The average f-measure outcomes obtained at the end of the learning process in the fourth experiment are shown in Figure – 4. According to these outcomes, the comparison of the classifiers and the evaluation of the performances are conducted. Figure – 4 Experiment 4 Result In Table – 5, the highest and the lowest accuracy rates of the classifiers used in the experiment 4 at the end of the learning process are shown. According to this table, the Naive Bayes classifier's rate of classifying the samples accurately is higher than other classifiers. Table – 5.The highest and the lowest accuracy rates of the classifiers according to the experiment 4 The percentage of the accurately classified samples Naive Bayes J48 SVM IBK Maximum (Max) 98,75 92,5 96,9118 95,375 Minimum (Min) 93,3333 81,9231 73,5714 72,7273 Words have been used instead of N-gram again and frequency value has been changed to 50 during the formation of arrf files. Naive Bayes has had the highest success and stable learning value among the algorithms. In the support vector machines, the waving lessens and the success 113
  • 8. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 rate reaches the highest point. In the decision trees, the lowest success rate is seen at the experiment 4; however, learning continues in an increasing manner after a certain point. K nearest neighbor algorithm's success is as high as it was in the experiment 2. This algorithm has shown a successful increase starting from the lowest learning level. The value of K nearest neighbor point has been chosen as K = 12. As stated in Table – 1, the success of the k nearest neighbor increased when the k value was increased to 50 in the formation arrf files of the experiment. 4. Discussion There exist the following problems and discussions for the algorithms to be used in the DLP text classification architecture. The differences among the algorithms may be due to the following reasons;  Carrying out functions affecting the model extraction in the operations like the qualification choice made at the level of data preprocessing and data completion, which can influence analysis outcomes.  The data formed in different ways of preprocessing with different analysis results. The factors affecting the classification can be as followings;    Differences in algorithms, Features special for the data set, Incompatibility between the method and the problems. The features special for the data set can be as followings;   Class ambiguity, Insufficient number of samples Class ambiguity indicates the situations in which no distinction can be made with the features given within the classification problem using any classification algorithm. Another factor which makes classification more difficult is the scarcity of data. Classifying the situations which are not exemplified with enough examples to limit the generalization mechanism of the classifiers is most likely to be randomly done. Naïve Bayes, which is a linear classifier with normal distribution assumption [5], can turn into a nonlinear classifier when a kernel density estimator is used [6]. 5. Conclusion Naive Bayes has given the highest learning of DLP text classification architect using the words weighting. However, the high learning rate has not changed for the words weighting which was prepared with an altered k value. The happy graph of the k nearest neighbor classifier has been achieved at the intended level on the education data prepared with the experiment 4 in the DLP text classification architecture. Yet, its success level does not prove as successful as that of Naive Bayes. 114
  • 9. International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 When all these outcomes have been analyzed, it has been determined that experiment 4 environment and Naive Bayes classifier will be based in the DLP to be developed due to the fact that Naive Bayes classifier has achieved 98.75 accuracy over the education data prepared with words feature extraction method and the f-measure/data graph, or the happy graph, has a less wavy structure. Therefore, during the preparation of education data, words will be chosen as the feature extraction method, idf will be chosen as the weighting method, 50 will be the frequency value, and finally 50 will be for k, which is the repetition number. Following that, Naive Bayes has been employed as a classifier in learning education data process and assumed parameters of this classifier has been using in the WEKA environment. Besides these, happy graph provides us with important information about the performance of the classifier. In parallel to this, with 550 education data instead of 800, the previous accuracy level of the classified has been approximately caught and there has been a considerable increase in its performance. References [1] [2] [3] [4] [5] [6] Jackson P. & Moulinier I. (2002). “Natural language processing for online applications: text retrieval, extraction, and categorization”. Amsterdam. Weka Project – http://sourceforge.net/projects/weka/ Amasyalı, M. F., Davletov, F., Torayew, A., & Çiftçi, Ü. (2010). “text2arff: Türkçe Metinler İçin Özellik Çıkarım Yazılımı”. SİU 2010. Diyarbakır. Torture Archive – http://www.aladin0.wrlc.org/gsdl/cgi-bin/library?c=torture&a=q Manning, C. D., Raghavan, P., & Schütze, H. (2008). “Introduction to Information Retrieval”. Cambridge University Press. New York. John, G. H., & Langley, P. (1995). “Estimating Continuous Distributions in Bayesian Classifiers”, Proceedings of the eleventh Conference on Uncertainty in Artificial Intelligence. pp. 338-345. 115