ClassifyingIssuesFromSRTextAzureML

HYDERABAD
OCTOBER 13, 14. 2015

Classifying issues from SR text
descriptions in Azure ML
George Simov
Data Scientist

Agenda
• Text Analytics concepts and terms
• Azure ML capabilities for text classification
• Implementation Details
• Spam Detection Model – binary classification
• Model for classifying issues from SR text descriptions – multi-class
classification
• Operationalization of the model

Text Analytics
Def: The term text analytics describes a set of linguistic, statistical, and machine learning
techniques that model and structure the information content of textual sources for
business intelligence, exploratory data analysis, research, or investigation.
Text Classification
• Binary Classification (for example: Spam Detection)
• Multiclass Classification (for example: Product classification by text description)
Text Clustering
• Grouping same or similar text documents based on distance/similarity function (usually cos
similarity in vector-space model)
Sentiment Analysis
• Identify and extract subjective information in source materials
• Positive, Negative, Neutral
Name Entity Recognition
• Subtask of information extraction that seeks to locate and classify elements in text into
pre-defined categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.

Text Representation – transform text into numerical vectors
Bag Of Words Model (Vector Space Model)
• Each dimension (axis) corresponds to a document feature.
• Features: words or phrases (bag of words model)
• TF (term frequency): number of occurrences of each word in a document
• TFIDF (term frequency inverted term frequency) table : weight assigned to each term describing a document:
Wij = TF * IDF = tfij * log (N / dfi)
TF – Term Frequency
IDF – Inverse Document Frequency
Wij – Weight of the i-th term in j-th document
tfij – Frequency of the i-th term in document j
N – The total number of documents in the collection
dfi – The number of documents containing the i-th term
• N-grams – representing text features
Example: Text classification is an important area in text analytics
2-grams:
Text classification | classification is | is an | an important | important area | area in | in text | text analytics

Azure ML Text Classification Workflow
Step 1. Data Preparation
SQL queries, Excel, R, …
Result:
Label Text
1 This is a spam
0 This is a text that is not a spam
1 This is another spam

Step 2. Text Preprocessing
- Lower case, Remove stop words, Remove numbers, Stemming, Synonyms…anything that might be helpful
Implementation in AML – R script module
process.text <- function(textVector, b_tolower, b_removeWords, b_stemDocument, b_removeNumbers )
{
library("tm")
print("replace special characters with space")
textVector <- gsub("[^0-9a-z]", " ", textVector, ignore.case = TRUE)
if(b_removeNumbers == TRUE){
textVector <- gsub("[^a-z]", " ", textVector, ignore.case = TRUE)
}
textVector <- gsub("s+", " ", textVector)
textVector <- gsub("^s", "", textVector)
textVector <- gsub("s$", "", textVector)
…………………………………..
theCorpus <- Corpus(VectorSource(textVector))
if(b_tolower == TRUE){
print("tolower ....")
theCorpus <- tm_map(theCorpus, content_transformer(tolower))
}
if(b_removeWords == TRUE){
print("remove stopwords ....")
theCorpus <- tm_map(theCorpus, removeWords, stopwords("english"))
}
if(b_stemDocument == TRUE){
print("word stemming ....")
theCorpus <- tm_map(theCorpus, stemDocument, "english")
}
………………………………………………………………………………………………………….

Step 3. Feature Representation and Extraction – 2 AML modules
- Feature Hashing
Parameters: Hashing bit size, N-grams
- Filter Based Feature Selection
- Properties
Step 4. Train Model
- A lot of binary and multiclass learners: linear regression, logistic regression, boosted decision tree, SVM, decision forest, ….
Step 5. Evaluate model
- Cross Validate Model
- Score Model
- Evaluate Model
Step 6. Visualization of the results and numerical metrics
- Binary Classifiers – Precision, Recall, F1 Score, AUC, AUC graphics
- Multiclass – Confusion table, custom script for precision/recall calculations

Spam Detection in answers.microsoft.com forums
Business Scenario:
Automatic spam detection in answer.microsoft.com threads.
Today there are many volunteers and MS FTEs who spend a lot
of time and efforts to clean up the forums from spam
messages. The solution is automatic spam detection.
Example POC : Spam detection in AML, based on the message
content.

Spam Detection Experiment Results – test data

Predicting Products/Issues by SR problem description
Business Scenario:
The Azure support portal (Ibiza) wans to get rid of the user selections for the product and the
problem/issue, because users make mistakes or select “Other” when they are confused what to select. This
leads to SR miss-routing and hence slowing down the process of the issue resolving. (We have seen up to 9
SR transfers during the SR life cycle).

Customer ‘accuracy’ compared
to SE selection:
• ~ 75% - Service (level 0)
• ~ 50% - Feature (level 1)
• ~ 25% - Issue (level 2)
Why?
• Too many topics, customers
cannot discriminate
• Poorly defined topics
• Customers seldom traverse up
the tree to find more relevant
topics
• Customer don’t know how to
classify their symptoms
• Enter anything to talk to assisted
support
Consequence
• Less self-help, more support
volume
• Poor routing, more MPI,
Support Topic Taxonomy
Note: Current MOP experience. POR is UE be replaced
with text input only ‘Maven’ UI

Current Office 365 online support experience
Note: Current MOP experience. POR
is UE be replaced with text input
only ‘Maven’ UI

Predicting O365 Products by Problem Description

• Predicting O365 Products by Problem Description - RESULTS

Predicting O365 Products by Problem Description - RESULTS

Predicting O365 Issues by Problem Description - Model

Predicting O365 Issues by Problem Description - RESULTS
Category/Issue Frequencies Result after execution of the preprocessing R script

Predicting O365 Issues by Problem Description:
Cross-Validation RESULTS

Predicting O365 Issues by Problem Description – Analysis
1. Accuracy is not high enough to get rid completely of the problem descriptions
2. Idea about the functionality based on the results from the ML model:
- Sort/Rank the Products and the Problems/Issues in the selection list boxes by probability returned from the
ML model.
Expected Result: Decrease of the wrong selections based on the assumption that the user will find the correct
selection options at the beginning of the list.
This is an example of usefulness of the ML models even when they cannot solve a problem completely.

Operationalization
1. Azure ML creates automatically REST Web service
2. Azure ML provides an easy way to deploy the production version of the model on a production environment.
3. Performance – slower than TLC
4. Poor debugging capabilities.
5. Poor code instrumentation/troubleshooting capabilities
6. Scalability – deployment on a limited set of machine (16)
Consider all above proc/cons when making decision to have AML production model.

ClassifyingIssuesFromSRTextAzureML

More Related Content

What's hot (19)

Similar to ClassifyingIssuesFromSRTextAzureML (20)

ClassifyingIssuesFromSRTextAzureML

Editor's Notes