SlideShare a Scribd company logo
HYDERABAD
OCTOBER 13, 14. 2015
Classifying issues from SR text
descriptions in Azure ML
George Simov
Data Scientist
Agenda
• Text Analytics concepts and terms
• Azure ML capabilities for text classification
• Implementation Details
• Spam Detection Model – binary classification
• Model for classifying issues from SR text descriptions – multi-class
classification
• Operationalization of the model
Text Analytics
Def: The term text analytics describes a set of linguistic, statistical, and machine learning
techniques that model and structure the information content of textual sources for
business intelligence, exploratory data analysis, research, or investigation.
Text Classification
• Binary Classification (for example: Spam Detection)
• Multiclass Classification (for example: Product classification by text description)
Text Clustering
• Grouping same or similar text documents based on distance/similarity function (usually cos
similarity in vector-space model)
Sentiment Analysis
• Identify and extract subjective information in source materials
• Positive, Negative, Neutral
Name Entity Recognition
• Subtask of information extraction that seeks to locate and classify elements in text into
pre-defined categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.
Text Representation – transform text into numerical vectors
Bag Of Words Model (Vector Space Model)
• Each dimension (axis) corresponds to a document feature.
• Features: words or phrases (bag of words model)
• TF (term frequency): number of occurrences of each word in a document
• TFIDF (term frequency inverted term frequency) table : weight assigned to each term describing a document:
Wij = TF * IDF = tfij * log (N / dfi)
TF – Term Frequency
IDF – Inverse Document Frequency
Wij – Weight of the i-th term in j-th document
tfij – Frequency of the i-th term in document j
N – The total number of documents in the collection
dfi – The number of documents containing the i-th term
• N-grams – representing text features
Example: Text classification is an important area in text analytics
2-grams:
Text classification | classification is | is an | an important | important area | area in | in text | text analytics
Example
Azure ML Text Classification Workflow
Step 1. Data Preparation
SQL queries, Excel, R, …
Result:
Label Text
1 This is a spam
0 This is a text that is not a spam
1 This is another spam
Step 2. Text Preprocessing
- Lower case, Remove stop words, Remove numbers, Stemming, Synonyms…anything that might be helpful
Implementation in AML – R script module
process.text <- function(textVector, b_tolower, b_removeWords, b_stemDocument, b_removeNumbers )
{
library("tm")
print("replace special characters with space")
textVector <- gsub("[^0-9a-z]", " ", textVector, ignore.case = TRUE)
if(b_removeNumbers == TRUE){
textVector <- gsub("[^a-z]", " ", textVector, ignore.case = TRUE)
}
textVector <- gsub("s+", " ", textVector)
textVector <- gsub("^s", "", textVector)
textVector <- gsub("s$", "", textVector)
…………………………………..
theCorpus <- Corpus(VectorSource(textVector))
if(b_tolower == TRUE){
print("tolower ....")
theCorpus <- tm_map(theCorpus, content_transformer(tolower))
}
if(b_removeWords == TRUE){
print("remove stopwords ....")
theCorpus <- tm_map(theCorpus, removeWords, stopwords("english"))
}
if(b_stemDocument == TRUE){
print("word stemming ....")
theCorpus <- tm_map(theCorpus, stemDocument, "english")
}
………………………………………………………………………………………………………….
Step 3. Feature Representation and Extraction – 2 AML modules
- Feature Hashing
Parameters: Hashing bit size, N-grams
- Filter Based Feature Selection
- Properties
Step 4. Train Model
- A lot of binary and multiclass learners: linear regression, logistic regression, boosted decision tree, SVM, decision forest, ….
Step 5. Evaluate model
- Cross Validate Model
- Score Model
- Evaluate Model
Step 6. Visualization of the results and numerical metrics
- Binary Classifiers – Precision, Recall, F1 Score, AUC, AUC graphics
- Multiclass – Confusion table, custom script for precision/recall calculations
Spam Detection in answers.microsoft.com forums
Business Scenario:
Automatic spam detection in answer.microsoft.com threads.
Today there are many volunteers and MS FTEs who spend a lot
of time and efforts to clean up the forums from spam
messages. The solution is automatic spam detection.
Example POC : Spam detection in AML, based on the message
content.
Spam Detection Model
Spam Detection Data
Spam Detection Experiment Results – test data
Predicting Products/Issues by SR problem description
Business Scenario:
The Azure support portal (Ibiza) wans to get rid of the user selections for the product and the
problem/issue, because users make mistakes or select “Other” when they are confused what to select. This
leads to SR miss-routing and hence slowing down the process of the issue resolving. (We have seen up to 9
SR transfers during the SR life cycle).
Azure Support Portal
Customer ‘accuracy’ compared
to SE selection:
• ~ 75% - Service (level 0)
• ~ 50% - Feature (level 1)
• ~ 25% - Issue (level 2)
Why?
• Too many topics, customers
cannot discriminate
• Poorly defined topics
• Customers seldom traverse up
the tree to find more relevant
topics
• Customer don’t know how to
classify their symptoms
• Enter anything to talk to assisted
support
Consequence
• Less self-help, more support
volume
• Poor routing, more MPI,
Support Topic Taxonomy
Note: Current MOP experience. POR is UE be replaced
with text input only ‘Maven’ UI
Current Office 365 online support experience
Note: Current MOP experience. POR
is UE be replaced with text input
only ‘Maven’ UI
Predicting O365 Products by Problem Description
• Predicting O365 Products by Problem Description - RESULTS
Predicting O365 Products by Problem Description - RESULTS
Predicting O365 Issues by Problem Description - Model
Predicting O365 Issues by Problem Description - RESULTS
Category/Issue Frequencies Result after execution of the preprocessing R script
Predicting O365 Issues by Problem Description:
Cross-Validation RESULTS
Predicting O365 Issues by Problem Description – Analysis
1. Accuracy is not high enough to get rid completely of the problem descriptions
2. Idea about the functionality based on the results from the ML model:
- Sort/Rank the Products and the Problems/Issues in the selection list boxes by probability returned from the
ML model.
Expected Result: Decrease of the wrong selections based on the assumption that the user will find the correct
selection options at the beginning of the list.
This is an example of usefulness of the ML models even when they cannot solve a problem completely.
Operationalization
1. Azure ML creates automatically REST Web service
2. Azure ML provides an easy way to deploy the production version of the model on a production environment.
3. Performance – slower than TLC
4. Poor debugging capabilities.
5. Poor code instrumentation/troubleshooting capabilities
6. Scalability – deployment on a limited set of machine (16)
Consider all above proc/cons when making decision to have AML production model.
Thank you

More Related Content

What's hot (19)

PPT
Mining Product Reputations On the Web
feiwin
 
PDF
Darshan sem4 140703_ooad_2014 (diagrams)
Gajeshwar Bahekar
 
PPTX
Machine learning introduction
Anas Jamil
 
PPTX
Object Oriented Software Development revision slide
fauza jali
 
PPT
Usecase Presentation
Rungsun Promprasith
 
PPT
1Introduction to OOAD
Shahid Riaz
 
PDF
Object oriented software engineering concepts
Komal Singh
 
PDF
Use case diagrams
mohamed tahoon
 
PPTX
Object Oriented Approach for Software Development
Rishabh Soni
 
PDF
Db lec 02_new
Ramadan Babers, PhD
 
PPTX
Introduction to OOAD
Saraswati Saud
 
PPT
A2 databases
c.west
 
PPT
Uml use casediagrams assignment help
www.myassignmenthelp.net
 
PPTX
Adt
MrSaem
 
PPT
Domain object model
university of education,Lahore
 
PDF
Syllabus ms
bikram ...
 
PPT
Object Oriented Analysis and Design
university of education,Lahore
 
PPTX
Ooad ppt
Radhika Yadav
 
PPTX
Logical Design and Conceptual Database Design
Er. Nawaraj Bhandari
 
Mining Product Reputations On the Web
feiwin
 
Darshan sem4 140703_ooad_2014 (diagrams)
Gajeshwar Bahekar
 
Machine learning introduction
Anas Jamil
 
Object Oriented Software Development revision slide
fauza jali
 
Usecase Presentation
Rungsun Promprasith
 
1Introduction to OOAD
Shahid Riaz
 
Object oriented software engineering concepts
Komal Singh
 
Use case diagrams
mohamed tahoon
 
Object Oriented Approach for Software Development
Rishabh Soni
 
Db lec 02_new
Ramadan Babers, PhD
 
Introduction to OOAD
Saraswati Saud
 
A2 databases
c.west
 
Uml use casediagrams assignment help
www.myassignmenthelp.net
 
Adt
MrSaem
 
Domain object model
university of education,Lahore
 
Syllabus ms
bikram ...
 
Object Oriented Analysis and Design
university of education,Lahore
 
Ooad ppt
Radhika Yadav
 
Logical Design and Conceptual Database Design
Er. Nawaraj Bhandari
 

Similar to ClassifyingIssuesFromSRTextAzureML (20)

PDF
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 
PPTX
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
PPTX
Hotel Review Classification(NLP Classification) PPT
Ashwini Salwadgi
 
PDF
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira
 
PPT
NEXiDA at OMG June 2009
Claudio Rubbiani
 
PDF
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
PPTX
machine learning workflow with data input.pptx
jasontseng19
 
PDF
Text Document Classification System
IRJET Journal
 
PPTX
Multi-modal sources for predictive modeling using deep learning
Sanghamitra Deb
 
PDF
Data ops: Machine Learning in production
Stepan Pushkarev
 
PDF
Software Design And Analysis Ii Lecture Notes Cuny Csci235 Itebooks
leemonadsiz
 
PPTX
DataScience SG | Undergrad Series | 26th Sep 19
Yong Siang (Ivan) Tan
 
PDF
data-science-lifecycle-ebook.pdf
Danilo Cardona
 
PDF
The Power of Auto ML and How Does it Work
Ivo Andreev
 
DOCX
Mca1040 system analysis and design
smumbahelp
 
PDF
UNIT1- OBJECT ORIENTED PROGRAMMING IN JAVA- AIML IT-SPPU
ApurvaLaddha
 
PDF
Machine Learning Data Life Cycle in Production (Week 2 feature engineering...
Ajay Taneja
 
PDF
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
PPTX
Networking chapter jkl; dfghyubLec 1.pptx
adnanshaheen425
 
PDF
What are the Unique Challenges and Opportunities in Systems for ML?
Matei Zaharia
 
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Hotel Review Classification(NLP Classification) PPT
Ashwini Salwadgi
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira
 
NEXiDA at OMG June 2009
Claudio Rubbiani
 
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
machine learning workflow with data input.pptx
jasontseng19
 
Text Document Classification System
IRJET Journal
 
Multi-modal sources for predictive modeling using deep learning
Sanghamitra Deb
 
Data ops: Machine Learning in production
Stepan Pushkarev
 
Software Design And Analysis Ii Lecture Notes Cuny Csci235 Itebooks
leemonadsiz
 
DataScience SG | Undergrad Series | 26th Sep 19
Yong Siang (Ivan) Tan
 
data-science-lifecycle-ebook.pdf
Danilo Cardona
 
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Mca1040 system analysis and design
smumbahelp
 
UNIT1- OBJECT ORIENTED PROGRAMMING IN JAVA- AIML IT-SPPU
ApurvaLaddha
 
Machine Learning Data Life Cycle in Production (Week 2 feature engineering...
Ajay Taneja
 
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Networking chapter jkl; dfghyubLec 1.pptx
adnanshaheen425
 
What are the Unique Challenges and Opportunities in Systems for ML?
Matei Zaharia
 
Ad

ClassifyingIssuesFromSRTextAzureML

  • 2. Classifying issues from SR text descriptions in Azure ML George Simov Data Scientist
  • 3. Agenda • Text Analytics concepts and terms • Azure ML capabilities for text classification • Implementation Details • Spam Detection Model – binary classification • Model for classifying issues from SR text descriptions – multi-class classification • Operationalization of the model
  • 4. Text Analytics Def: The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. Text Classification • Binary Classification (for example: Spam Detection) • Multiclass Classification (for example: Product classification by text description) Text Clustering • Grouping same or similar text documents based on distance/similarity function (usually cos similarity in vector-space model) Sentiment Analysis • Identify and extract subjective information in source materials • Positive, Negative, Neutral Name Entity Recognition • Subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
  • 5. Text Representation – transform text into numerical vectors Bag Of Words Model (Vector Space Model) • Each dimension (axis) corresponds to a document feature. • Features: words or phrases (bag of words model) • TF (term frequency): number of occurrences of each word in a document • TFIDF (term frequency inverted term frequency) table : weight assigned to each term describing a document: Wij = TF * IDF = tfij * log (N / dfi) TF – Term Frequency IDF – Inverse Document Frequency Wij – Weight of the i-th term in j-th document tfij – Frequency of the i-th term in document j N – The total number of documents in the collection dfi – The number of documents containing the i-th term • N-grams – representing text features Example: Text classification is an important area in text analytics 2-grams: Text classification | classification is | is an | an important | important area | area in | in text | text analytics
  • 7. Azure ML Text Classification Workflow Step 1. Data Preparation SQL queries, Excel, R, … Result: Label Text 1 This is a spam 0 This is a text that is not a spam 1 This is another spam
  • 8. Step 2. Text Preprocessing - Lower case, Remove stop words, Remove numbers, Stemming, Synonyms…anything that might be helpful Implementation in AML – R script module process.text <- function(textVector, b_tolower, b_removeWords, b_stemDocument, b_removeNumbers ) { library("tm") print("replace special characters with space") textVector <- gsub("[^0-9a-z]", " ", textVector, ignore.case = TRUE) if(b_removeNumbers == TRUE){ textVector <- gsub("[^a-z]", " ", textVector, ignore.case = TRUE) } textVector <- gsub("s+", " ", textVector) textVector <- gsub("^s", "", textVector) textVector <- gsub("s$", "", textVector) ………………………………….. theCorpus <- Corpus(VectorSource(textVector)) if(b_tolower == TRUE){ print("tolower ....") theCorpus <- tm_map(theCorpus, content_transformer(tolower)) } if(b_removeWords == TRUE){ print("remove stopwords ....") theCorpus <- tm_map(theCorpus, removeWords, stopwords("english")) } if(b_stemDocument == TRUE){ print("word stemming ....") theCorpus <- tm_map(theCorpus, stemDocument, "english") } ………………………………………………………………………………………………………….
  • 9. Step 3. Feature Representation and Extraction – 2 AML modules - Feature Hashing Parameters: Hashing bit size, N-grams - Filter Based Feature Selection - Properties Step 4. Train Model - A lot of binary and multiclass learners: linear regression, logistic regression, boosted decision tree, SVM, decision forest, …. Step 5. Evaluate model - Cross Validate Model - Score Model - Evaluate Model Step 6. Visualization of the results and numerical metrics - Binary Classifiers – Precision, Recall, F1 Score, AUC, AUC graphics - Multiclass – Confusion table, custom script for precision/recall calculations
  • 10. Spam Detection in answers.microsoft.com forums Business Scenario: Automatic spam detection in answer.microsoft.com threads. Today there are many volunteers and MS FTEs who spend a lot of time and efforts to clean up the forums from spam messages. The solution is automatic spam detection. Example POC : Spam detection in AML, based on the message content.
  • 13. Spam Detection Experiment Results – test data
  • 14. Predicting Products/Issues by SR problem description Business Scenario: The Azure support portal (Ibiza) wans to get rid of the user selections for the product and the problem/issue, because users make mistakes or select “Other” when they are confused what to select. This leads to SR miss-routing and hence slowing down the process of the issue resolving. (We have seen up to 9 SR transfers during the SR life cycle).
  • 16. Customer ‘accuracy’ compared to SE selection: • ~ 75% - Service (level 0) • ~ 50% - Feature (level 1) • ~ 25% - Issue (level 2) Why? • Too many topics, customers cannot discriminate • Poorly defined topics • Customers seldom traverse up the tree to find more relevant topics • Customer don’t know how to classify their symptoms • Enter anything to talk to assisted support Consequence • Less self-help, more support volume • Poor routing, more MPI, Support Topic Taxonomy Note: Current MOP experience. POR is UE be replaced with text input only ‘Maven’ UI
  • 17. Current Office 365 online support experience Note: Current MOP experience. POR is UE be replaced with text input only ‘Maven’ UI
  • 18. Predicting O365 Products by Problem Description
  • 19. • Predicting O365 Products by Problem Description - RESULTS
  • 20. Predicting O365 Products by Problem Description - RESULTS
  • 21. Predicting O365 Issues by Problem Description - Model
  • 22. Predicting O365 Issues by Problem Description - RESULTS Category/Issue Frequencies Result after execution of the preprocessing R script
  • 23. Predicting O365 Issues by Problem Description: Cross-Validation RESULTS
  • 24. Predicting O365 Issues by Problem Description – Analysis 1. Accuracy is not high enough to get rid completely of the problem descriptions 2. Idea about the functionality based on the results from the ML model: - Sort/Rank the Products and the Problems/Issues in the selection list boxes by probability returned from the ML model. Expected Result: Decrease of the wrong selections based on the assumption that the user will find the correct selection options at the beginning of the list. This is an example of usefulness of the ML models even when they cannot solve a problem completely.
  • 25. Operationalization 1. Azure ML creates automatically REST Web service 2. Azure ML provides an easy way to deploy the production version of the model on a production environment. 3. Performance – slower than TLC 4. Poor debugging capabilities. 5. Poor code instrumentation/troubleshooting capabilities 6. Scalability – deployment on a limited set of machine (16) Consider all above proc/cons when making decision to have AML production model.

Editor's Notes

  • #17: But the reality is, we ask the customer to select from too many topics, many which are confused with others. Customers ability to reliably select the right symptom falls to 25% when compared what the Support Engineer would choose. (NOTE: we are moving to PFAs for a better ‘ground truth’) While the symptom tree (called support topics) is only 3 levels deep, it is very broad and growing as new products and features are introduced to O365. As you can see, the top ‘Service’ level has 19 classes. Each service has between 3 and 23 issue groups (we call features), and each feature bucket contains anywhere from 4 to 38 issues. Overall, there are 1,300 possible topics to choose from! The dilemma is how to surface a reduced but relevant taxonomy to a customer. What’s the cost? When the customer does not properly self-classify, they can’t be provided the best self-help. The customers consequently submits a service request or makes a call to assisted support where the cost per service request is high. In correct symptom self classification also increases the chance for miss-routes - the wrong team getting the request. Even should one argue that customers self-selects at a 80% accuracy rather than 25%, the costs, with a volume of 100,000 cases per month is in the millions per year.
  • #18: Here is a screen shot of the existing customer experience in Office 365, where the customer first selects their top service level, then the feature and symptom level, and then describe their issue.
  • #27: # of apps support big data / data lake solutions (COSMOS/HDI etc) # of apps enabled for near real time services # of apps supporting data insights # of applications supporting self service capabilities