SlideShare a Scribd company logo
Identifying and classifying
unknown Network Disruption
Introduction
Since the evolution of modern technology and with the drastic increase in the scale of network communication
more and more network disruptions in traffic and private protocols have been taking place. Identifying and
classifying the unknown network disruptions can provide support and even help to maintain the backup
systems. Furthermore, Research on Identifying and classifying the unknown network disruptions can help us
overcome the problem of detecting an illegal network monitoring, intrusion detection, analysis of the network,
and providing day-to-day analysis of the network can eventually help us to ensure the network behaviour. This
Network Disruptions can be identified in many ways such as: The traditional method using fixed port numbers
can be easily cheated by changing the port numbers in the system. Deep Packet Inspection is a widely used
protocol identification technique that is been used at present, although it is widely used by organizations around
the world, this has its limitations such as resource consumption might be very high when we deal with its
feature database.
Problem Statement
The main objective of our problem is to predict the network fault severity at a particular location based on the
log data available. The project has been done by the data collected from the Kaggle data repositories, consisting
of various features which help us determine the network fault severity in the network. The datasets/log files
which were used here are event_type.csv, log_feature.csv, resource_type.csv, severity_type.csv.
The target class variable Severity type has 3 classes such as 0,1,2, representing the fault severity of the network.
“Fault severity” is a measurement of actually reported faults from users of the network and is the target variable.
Related Works
• Hong et al. proposed an application layer protocol that combines the traditional Deep packet Inspection and
clustering methods which can effectively classify and identify the unknown application layer protocols which
can intern help to protect from network disruptions.
• Peng et al. proposed a way of classifying and identifying the network disruptions using mathematical statistics
to calculate the k value, the cluster initial center of the K-Means Clustering Algorithm.
• Similarly, Zhang et. Al. proposed a way of identifying and classifying the network by combining the
traditional AGNES Hierarchical clustering algorithm with the features of bitstream data frames. This method
has been proven for automatically identifying the number of clusters and classifying the unknown bitstream
data frames.
Contribution of objective
• As the world is dynamically evolving towards the new age of technology at the users using different networks
increasing minute by minute, more and more network disruptions emerge and can pose a very serious threat to
the organizations.
• An artificial intelligence method was used to explore autonomous classification and identification of unknown
network protocols in this paper to reduce the time and labor cost of network disruption classification and
identification. In this paper, firstly, we are taking a dataset having each row corresponding to a location and a
time point. This data is pre-processed and modeled using three Machine learning algorithms. As a result, we
see which algorithm gives the best accuracy among the three that we have used.
Block Diagram
Testing
Dataset
Training
Dataset
Algorithm Evaluation
Model
Production
data
Data
Prediction
Machine Learning Workflow
We can define the machine learning workflow in 5 stages.
• Gathering data
• Data pre-processing
• Researching the model that will be best for the type of data
• Training and testing the model
• Evaluation
The machine learning model is nothing but a piece of code; which an engineer or data scientist models by
training it with the data according to the need of the project and making the model learn through the data and
allowing it to predict or give the solution that we want whenever we ask it to give. So, whenever we give our
model the new data which we want it to predict, we will get the predicted value according to the model training,
the trained model might or might not perform well on the test data that we want it to predict, due to various
reasons, so before trying to train any model we need to make sure that the algorithm that is going to use is
appropriate for the desired class that we want to predict and based on the data that we are using.
Supervised Learning
Supervised learning is a branch of machine learning where for each row in the dataset, each row is tagged with a
particular label known as the target class. Supervised Learning is categorized into 2 other categories which are
“Classification” and “Regression”.
Classification:
• The classification problem is when the target variable is categorical (i.e., the output variable consists of
classes such as —Class A or B or something else, there might be 2 classes or more than 2 classes.).
Regression:
• While a Regression problem is when the target variable is continuous (i.e., the output is numeric),
Regression problem can be easily termed as the problem where we have to forecast about the future or what
we do not know right now, it can be anything (Example: House Price Prediction, Stock market trends)
Unsupervised
Unsupervised Learning is another branch of Machine Learning where we won’t be having any labels for each
row of our data unlike supervised learning, so in this case, the model will try to segregate things based on the
features and the data available. In simple terms it segregates the data in terms of clusters, the most important
thing in unsupervised learning is the curse of finding the optimal k value (the number of clusters we would like
to make).
Clustering:
• Clustering is a process of learning to assign labels to examples by leveraging an unlabelled dataset, Because
the dataset is completely unlabelled, deciding on whether the learned model is optimal is much more
complicated than in supervised learning.
Overview of the Machine Learning Models
Supervised Unsupervised
Classification Regression Clustering
SVM
K-Nearest Neighbors
Naïve Bayes
Decision Tree,
Random Forest
Neural Networks
DBSCAN
Linear Regression
SVR, GPR
Ensemble Methods
Decision Tree
Neural Networks Hierarchical
Gaussian Mixture
K-Means
HDBSCAN
Machine
Learning
Training and Testing the model.
• Before building any machine learning Project, training is the most important part, where we train our model
using the data available and make the machine learn and understand the data, after which when the model has
learned from the data, we provide the model with another dataset to evaluate how good our model is
performing, if it is performing well, we then test the model using test data, where we get to know the final
performance of our model, which can be measure using various metrics, such as Accuracy, recall, precision,
and through classification report.
• This whole process of building and deploying a model is done using 3 different datasets which are split using
train_test_split(), which are ‘Training data’, ‘Validation data’, and ‘Testing data’.
Methodologies
Dataset’s descriptions:
∙ event_type.csv: type of event related to the main dataset
∙ log_feature.csv - features extracted from log files
∙ resource_type.csv: resource type related to the main dataset
∙ severity_type.csv: severity type of a warning message coming from the log
All the above CSV's except train.csv, test.csv, and sample_submission.csv, have been merged to make it has a
single CSV file based on a specific primary key.
Algorithms
The Random Forest Classifier
• Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It is
one of the widely used algorithms after Decision tree which perform well with any kind of dataset, be it
classification or regression. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem, and at the end, the results are either made an average of all
the classifiers or mode of all the classifiers.
• The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
Note: This might not be applicable top every case that we use.
Decision Tree
A Decision tree, as the name suggests, creates a branch of nodes, where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and the last nodes are termed as the leaf nodes meaning
there cannot be any nodes attached to them, and each leaf node (terminal node) holds a class label. The decision
tree is one of the most popular algorithms in machine learning, it can be sued for both classification and
regression, similar to a random forest, there are some exceptions to decision tree also, in terms of data scaling
and data transformation, since decision tree works like a flowchart in the form of branches doing data
transformation and scaling might be optional.
Gradient Boosting
• Gradient boosting is a technique used in the development of predictive models. The method is most
commonly used in regression and classification procedures. Prediction models are frequently depicted as
decision trees for selecting the best prediction. Gradient boosting, like other boosting methods, presents model
building in stages while allowing the generalization and optimization of differentiable loss functions.
• The below diagram explains how gradient boosted trees are trained for regression problems.
Data Overview
Identifying and classifying unknown Network Disruption
Visual Analysis
Algorithm Results
Random Forest Classifier
Decision Tree Classifier
Gradient Boosting
Conclusion and Future Scope
• As per the main objective of the project is to classify and identify the unknown network disruptions based on
ML algorithms is being discussed throughout the project. Through this method, first, we have extracted the
disrupted data information of the network traffic. Then the dataset is being sent for cleaning and data
pre-processing to bring the data to the same scale which should be understandable to the machine and in the
process of that we have merged all the files as one file to get a better understanding of the data to further help
us classify and identify the fault severity. Finally, feature engineering is done to intelligently select the feature
vectors to efficiently and accurately realize the classification and identification of unknown network
disruption. This method made full use of the advantage of Machine Learning algorithms. Based on ensuring
the classification and identification accuracy, it avoided the complex steps of manually extracting features and
reduced the training time of the intelligent algorithm as well as the amount of labelled data required.
• As part of the future scope, we hope to try out different algorithms to optimize the feature output process,
increase the feature similarity of the same disruption data and widen the differences between different
disruption data to improve the model's representation capability. We will also do further research on encrypted
traffic, and try to use neural networks to find the potential characteristics of encrypted data.
References
1. Hong Z, Gong Q, Feng W, Li Y. Unknown Application Layer Protocol Identification Based on Adaptive
Clustering. Computer Engineering and Applications. 2020, 56(05): 109-117.
2. Zhang F, Zhou H, Zhang J, Liu Y, Zhang C. A protocol classification algorithm based on improved AGNES.
Computer Engineering and Science, 2017,39 (04): 796-803.
3. Li R, Xiao X, Ni S, et al. Byte segment neural network for network traffic classification[C]//2018 IEEE/ACM
26th International Symposium on Quality of Service (IWQoS). IEEE, 2018: 1-10.
4. Guo L. Research on Multi-Business Identification Technology Oriented High-Speed Network Management
and Control. Doctor, The PLA Information Engineering University, Zhengzhou, Henan, China, 2012.
5. Wang W, Zhu M, Zeng X, et al. Malware traffic classification using convolutional neural network for
representation learning[C]//2017 International Conference on Information Networking (ICOIN). IEEE, 2017:
712-717.
6. Feng W, Hong Z, Wu L, Fu M. Review of network protocol identification techniques. Computer Applications.
2019, 39: 3604-3614.
About TechieYan Technologies
Project trainings, engineering workshops, internships, and laboratory setup are all things we offer. We work on
projects related to robotics, python, deep learning, artificial intelligence, IoT, embedded systems, matlab, hfss
pcb design, vlsi, and ieee current projects.
Address: 16-11-16/V/24, Sri Ram Sadan, Moosarambagh, Hyderabad 500036
Phone: 91 7075575787
Website: https://techieyantechnologies.com

More Related Content

Similar to Identifying and classifying unknown Network Disruption (20)

PPTX
Machine learning Chapter three (16).pptx
jamsibro140
 
PPTX
Machine learning Method and techniques
MarkMojumdar
 
PPTX
Predictive analytics
Dinakar nk
 
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
PDF
Choosing a Machine Learning technique to solve your need
GibDevs
 
PPTX
Machine Learning Workshop
Osman Ramadan
 
PPTX
Unit 4 Classification of data and more info on it
randomguy1722
 
PDF
Introduction to conventional machine learning techniques
Xavier Rafael Palou
 
PPTX
ML SFCSE.pptx
NIKHILGR3
 
PPTX
TE_B_10_INTERNSHIP_PPT_ANIKET_BHAVSAR.pptx
AbhijeetDhanrajSalve
 
PDF
Python Code for Classification Supervised Machine Learning.pdf
Avjinder (Avi) Kaler
 
PPTX
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
RaflyRizky2
 
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
PPTX
Unit 2-ML.pptx
Chitrachitrap
 
PPTX
supervised machine learning algorithms support vector machine
pranalisonawane8600
 
PPTX
Classification.pptx
Dr. Amanpreet Kaur
 
PPTX
Machine learning and types
Padma Metta
 
PPTX
Machine Learning & Predictive Maintenance
Arnab Biswas
 
PDF
IJCSI-10-6-1-288-292
HARDIK SINGH
 
Machine learning Chapter three (16).pptx
jamsibro140
 
Machine learning Method and techniques
MarkMojumdar
 
Predictive analytics
Dinakar nk
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
Choosing a Machine Learning technique to solve your need
GibDevs
 
Machine Learning Workshop
Osman Ramadan
 
Unit 4 Classification of data and more info on it
randomguy1722
 
Introduction to conventional machine learning techniques
Xavier Rafael Palou
 
ML SFCSE.pptx
NIKHILGR3
 
TE_B_10_INTERNSHIP_PPT_ANIKET_BHAVSAR.pptx
AbhijeetDhanrajSalve
 
Python Code for Classification Supervised Machine Learning.pdf
Avjinder (Avi) Kaler
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
RaflyRizky2
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
Unit 2-ML.pptx
Chitrachitrap
 
supervised machine learning algorithms support vector machine
pranalisonawane8600
 
Classification.pptx
Dr. Amanpreet Kaur
 
Machine learning and types
Padma Metta
 
Machine Learning & Predictive Maintenance
Arnab Biswas
 
IJCSI-10-6-1-288-292
HARDIK SINGH
 

More from jagan477830 (20)

PDF
Exciting IoT projects for your final year.pdf
jagan477830
 
PDF
Innovative IoT-Based Projects to Revolutionize Everyday Life.pdf
jagan477830
 
PDF
IoT based mini projects.pdf
jagan477830
 
PDF
Mini Projects for Computer Science Engineering.pdf
jagan477830
 
PDF
Mini Projects for Electronics and Communication Engineering.pdf
jagan477830
 
PDF
Mini Projects for Computer Science Engineering Students.pdf
jagan477830
 
PDF
Overview of Embedded Systems Projects Examples.pdf
jagan477830
 
PDF
The Future of CSE Projects_ Emerging Technologies to Watch Out For.pdf
jagan477830
 
PDF
A Comprehensive Guide of Python Final Year Projects with Source Code.pdf
jagan477830
 
PDF
Top AI project ideas for engineering students.pdf
jagan477830
 
PDF
How to Choose the Perfect Mtech Project Topic for Your Interests and Career G...
jagan477830
 
PDF
Beginner-Friendly IoT Arduino Projects to Try.pdf
jagan477830
 
PDF
Sentiment Analysis on social networking sites.pptx.pdf
jagan477830
 
PDF
Machine Learning statistical model using Transportation data
jagan477830
 
PDF
Diabetes Prediction Using Machine Learning
jagan477830
 
PDF
Lung Cancer Detection using transfer learning.pptx.pdf
jagan477830
 
PDF
Detection of Retinal pigmentosa in paediatric age
jagan477830
 
PDF
credit card fraud detection
jagan477830
 
PDF
Journey of TechieYan Technologies
jagan477830
 
PDF
Mini Projects for ECE Students with Low Cost in Hyderabad
jagan477830
 
Exciting IoT projects for your final year.pdf
jagan477830
 
Innovative IoT-Based Projects to Revolutionize Everyday Life.pdf
jagan477830
 
IoT based mini projects.pdf
jagan477830
 
Mini Projects for Computer Science Engineering.pdf
jagan477830
 
Mini Projects for Electronics and Communication Engineering.pdf
jagan477830
 
Mini Projects for Computer Science Engineering Students.pdf
jagan477830
 
Overview of Embedded Systems Projects Examples.pdf
jagan477830
 
The Future of CSE Projects_ Emerging Technologies to Watch Out For.pdf
jagan477830
 
A Comprehensive Guide of Python Final Year Projects with Source Code.pdf
jagan477830
 
Top AI project ideas for engineering students.pdf
jagan477830
 
How to Choose the Perfect Mtech Project Topic for Your Interests and Career G...
jagan477830
 
Beginner-Friendly IoT Arduino Projects to Try.pdf
jagan477830
 
Sentiment Analysis on social networking sites.pptx.pdf
jagan477830
 
Machine Learning statistical model using Transportation data
jagan477830
 
Diabetes Prediction Using Machine Learning
jagan477830
 
Lung Cancer Detection using transfer learning.pptx.pdf
jagan477830
 
Detection of Retinal pigmentosa in paediatric age
jagan477830
 
credit card fraud detection
jagan477830
 
Journey of TechieYan Technologies
jagan477830
 
Mini Projects for ECE Students with Low Cost in Hyderabad
jagan477830
 
Ad

Recently uploaded (20)

PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PDF
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PDF
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PPTX
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
Dimensions of Societal Planning in Commonism
StefanMz
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
Ad

Identifying and classifying unknown Network Disruption

  • 2. Introduction Since the evolution of modern technology and with the drastic increase in the scale of network communication more and more network disruptions in traffic and private protocols have been taking place. Identifying and classifying the unknown network disruptions can provide support and even help to maintain the backup systems. Furthermore, Research on Identifying and classifying the unknown network disruptions can help us overcome the problem of detecting an illegal network monitoring, intrusion detection, analysis of the network, and providing day-to-day analysis of the network can eventually help us to ensure the network behaviour. This Network Disruptions can be identified in many ways such as: The traditional method using fixed port numbers can be easily cheated by changing the port numbers in the system. Deep Packet Inspection is a widely used protocol identification technique that is been used at present, although it is widely used by organizations around the world, this has its limitations such as resource consumption might be very high when we deal with its feature database.
  • 3. Problem Statement The main objective of our problem is to predict the network fault severity at a particular location based on the log data available. The project has been done by the data collected from the Kaggle data repositories, consisting of various features which help us determine the network fault severity in the network. The datasets/log files which were used here are event_type.csv, log_feature.csv, resource_type.csv, severity_type.csv. The target class variable Severity type has 3 classes such as 0,1,2, representing the fault severity of the network. “Fault severity” is a measurement of actually reported faults from users of the network and is the target variable.
  • 4. Related Works • Hong et al. proposed an application layer protocol that combines the traditional Deep packet Inspection and clustering methods which can effectively classify and identify the unknown application layer protocols which can intern help to protect from network disruptions. • Peng et al. proposed a way of classifying and identifying the network disruptions using mathematical statistics to calculate the k value, the cluster initial center of the K-Means Clustering Algorithm. • Similarly, Zhang et. Al. proposed a way of identifying and classifying the network by combining the traditional AGNES Hierarchical clustering algorithm with the features of bitstream data frames. This method has been proven for automatically identifying the number of clusters and classifying the unknown bitstream data frames.
  • 5. Contribution of objective • As the world is dynamically evolving towards the new age of technology at the users using different networks increasing minute by minute, more and more network disruptions emerge and can pose a very serious threat to the organizations. • An artificial intelligence method was used to explore autonomous classification and identification of unknown network protocols in this paper to reduce the time and labor cost of network disruption classification and identification. In this paper, firstly, we are taking a dataset having each row corresponding to a location and a time point. This data is pre-processed and modeled using three Machine learning algorithms. As a result, we see which algorithm gives the best accuracy among the three that we have used.
  • 7. Machine Learning Workflow We can define the machine learning workflow in 5 stages. • Gathering data • Data pre-processing • Researching the model that will be best for the type of data • Training and testing the model • Evaluation
  • 8. The machine learning model is nothing but a piece of code; which an engineer or data scientist models by training it with the data according to the need of the project and making the model learn through the data and allowing it to predict or give the solution that we want whenever we ask it to give. So, whenever we give our model the new data which we want it to predict, we will get the predicted value according to the model training, the trained model might or might not perform well on the test data that we want it to predict, due to various reasons, so before trying to train any model we need to make sure that the algorithm that is going to use is appropriate for the desired class that we want to predict and based on the data that we are using.
  • 9. Supervised Learning Supervised learning is a branch of machine learning where for each row in the dataset, each row is tagged with a particular label known as the target class. Supervised Learning is categorized into 2 other categories which are “Classification” and “Regression”. Classification: • The classification problem is when the target variable is categorical (i.e., the output variable consists of classes such as —Class A or B or something else, there might be 2 classes or more than 2 classes.). Regression: • While a Regression problem is when the target variable is continuous (i.e., the output is numeric), Regression problem can be easily termed as the problem where we have to forecast about the future or what we do not know right now, it can be anything (Example: House Price Prediction, Stock market trends)
  • 10. Unsupervised Unsupervised Learning is another branch of Machine Learning where we won’t be having any labels for each row of our data unlike supervised learning, so in this case, the model will try to segregate things based on the features and the data available. In simple terms it segregates the data in terms of clusters, the most important thing in unsupervised learning is the curse of finding the optimal k value (the number of clusters we would like to make). Clustering: • Clustering is a process of learning to assign labels to examples by leveraging an unlabelled dataset, Because the dataset is completely unlabelled, deciding on whether the learned model is optimal is much more complicated than in supervised learning.
  • 11. Overview of the Machine Learning Models Supervised Unsupervised Classification Regression Clustering SVM K-Nearest Neighbors Naïve Bayes Decision Tree, Random Forest Neural Networks DBSCAN Linear Regression SVR, GPR Ensemble Methods Decision Tree Neural Networks Hierarchical Gaussian Mixture K-Means HDBSCAN Machine Learning
  • 12. Training and Testing the model. • Before building any machine learning Project, training is the most important part, where we train our model using the data available and make the machine learn and understand the data, after which when the model has learned from the data, we provide the model with another dataset to evaluate how good our model is performing, if it is performing well, we then test the model using test data, where we get to know the final performance of our model, which can be measure using various metrics, such as Accuracy, recall, precision, and through classification report. • This whole process of building and deploying a model is done using 3 different datasets which are split using train_test_split(), which are ‘Training data’, ‘Validation data’, and ‘Testing data’.
  • 13. Methodologies Dataset’s descriptions: ∙ event_type.csv: type of event related to the main dataset ∙ log_feature.csv - features extracted from log files ∙ resource_type.csv: resource type related to the main dataset ∙ severity_type.csv: severity type of a warning message coming from the log All the above CSV's except train.csv, test.csv, and sample_submission.csv, have been merged to make it has a single CSV file based on a specific primary key.
  • 14. Algorithms The Random Forest Classifier • Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It is one of the widely used algorithms after Decision tree which perform well with any kind of dataset, be it classification or regression. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem, and at the end, the results are either made an average of all the classifiers or mode of all the classifiers. • The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. Note: This might not be applicable top every case that we use.
  • 15. Decision Tree A Decision tree, as the name suggests, creates a branch of nodes, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and the last nodes are termed as the leaf nodes meaning there cannot be any nodes attached to them, and each leaf node (terminal node) holds a class label. The decision tree is one of the most popular algorithms in machine learning, it can be sued for both classification and regression, similar to a random forest, there are some exceptions to decision tree also, in terms of data scaling and data transformation, since decision tree works like a flowchart in the form of branches doing data transformation and scaling might be optional.
  • 16. Gradient Boosting • Gradient boosting is a technique used in the development of predictive models. The method is most commonly used in regression and classification procedures. Prediction models are frequently depicted as decision trees for selecting the best prediction. Gradient boosting, like other boosting methods, presents model building in stages while allowing the generalization and optimization of differentiable loss functions. • The below diagram explains how gradient boosted trees are trained for regression problems.
  • 23. Conclusion and Future Scope • As per the main objective of the project is to classify and identify the unknown network disruptions based on ML algorithms is being discussed throughout the project. Through this method, first, we have extracted the disrupted data information of the network traffic. Then the dataset is being sent for cleaning and data pre-processing to bring the data to the same scale which should be understandable to the machine and in the process of that we have merged all the files as one file to get a better understanding of the data to further help us classify and identify the fault severity. Finally, feature engineering is done to intelligently select the feature vectors to efficiently and accurately realize the classification and identification of unknown network disruption. This method made full use of the advantage of Machine Learning algorithms. Based on ensuring the classification and identification accuracy, it avoided the complex steps of manually extracting features and reduced the training time of the intelligent algorithm as well as the amount of labelled data required. • As part of the future scope, we hope to try out different algorithms to optimize the feature output process, increase the feature similarity of the same disruption data and widen the differences between different disruption data to improve the model's representation capability. We will also do further research on encrypted traffic, and try to use neural networks to find the potential characteristics of encrypted data.
  • 24. References 1. Hong Z, Gong Q, Feng W, Li Y. Unknown Application Layer Protocol Identification Based on Adaptive Clustering. Computer Engineering and Applications. 2020, 56(05): 109-117. 2. Zhang F, Zhou H, Zhang J, Liu Y, Zhang C. A protocol classification algorithm based on improved AGNES. Computer Engineering and Science, 2017,39 (04): 796-803. 3. Li R, Xiao X, Ni S, et al. Byte segment neural network for network traffic classification[C]//2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). IEEE, 2018: 1-10. 4. Guo L. Research on Multi-Business Identification Technology Oriented High-Speed Network Management and Control. Doctor, The PLA Information Engineering University, Zhengzhou, Henan, China, 2012. 5. Wang W, Zhu M, Zeng X, et al. Malware traffic classification using convolutional neural network for representation learning[C]//2017 International Conference on Information Networking (ICOIN). IEEE, 2017: 712-717. 6. Feng W, Hong Z, Wu L, Fu M. Review of network protocol identification techniques. Computer Applications. 2019, 39: 3604-3614.
  • 25. About TechieYan Technologies Project trainings, engineering workshops, internships, and laboratory setup are all things we offer. We work on projects related to robotics, python, deep learning, artificial intelligence, IoT, embedded systems, matlab, hfss pcb design, vlsi, and ieee current projects. Address: 16-11-16/V/24, Sri Ram Sadan, Moosarambagh, Hyderabad 500036 Phone: 91 7075575787 Website: https://techieyantechnologies.com