SlideShare a Scribd company logo
Introduction to
                      Text Mining
                       & Support
                   Vector Machines
                         (SVM)



                    Dr. Anton Heijs
                         CEO
    Treparel
 Delftechpark 26
  2628 XH Delft        July 2012
The Netherlands
www.treparel.com
KMX enables information and knowledge professionals
to gain faster, reliable, more precise insights in large
complex unstructured data sets allowing them to make
better informed decisions.




                   Treparel is a leading technology solution provider in
                         Big Data Text Analytics & Visualization

Treparel KMX – All rights reserved 2012   www.treparel.com                 2
Topics covered in this presentation


         • Who is Treparel?
         • Introduction in Text Mining
         • What is Automated Classification & Clustering?
         • Introducing Support Vector Machines




Treparel KMX – All rights reserved 2012   www.treparel.com   3
Nexus of Forces: Social, Cloud, Mobile, Information
         IT Market shift driving Big Data challenges
                                                                                 Copyright: Gartner, 2011




                 80% of data is Unstructured (Documents, Text, Images, Graphs)



Treparel KMX – All rights reserved 2012     www.treparel.com                                 4
About Treparel

         • Delft, The Netherlands, 2006.
         • Treparel is an innovative technology solution provider in Big Data
           Analytics, Text Mining and Visualization.
         • KMX is an integrated data analysis toolset which provide faster,
           reliable intelligent insights in large complex unstructured data sets to
           allow companies to make better informed decisions.
         • Clients: Philips, Bayer, Abbott, European Patent Office, European
           Commission
         • Part of Research Centers and University ecosystem; TU Delft,
           Universities of Paris and Sao Paulo
         • More info: www.treparel.com




Treparel KMX – All rights reserved 2012   www.treparel.com                        5
Positioning of Treparel’s KMX technology

Text Acquisition & Preparation   Analysis and processing         Output and display
‘Seek’                           ‘Model’                         ‘Adapt’


External sources                                                 Reporting &
                             Text preprocessing
Patents                                                          Presentation
Legal
                                                                 Media and publishing
Research                     Indexing                            databases
Media / Publishers
                                                                 Content management
Other sources                Clustering                          systems
Documents
Websites                                                         Line-of-business
                             Classification                      applications
Blogs
Newsfeeds                                                        Research applications
Email                        Semantic Analysis
Application notes                                                Search engines
Search results
Social networks                                    Visualization


            Information extraction (entities, facts, relationships, concepts, patents)
                        Management, Development and Configuration
                                                                    Copyright: Gartner, J. Popkin 2010
Getting to know the basics

        PART A: Intro in Text Mining
        • The Data (text & image) Mining evolution
        • What is Data Mining: in or out-side the database
        • The Data Mining process
        • Two types of Data Mining tasks: Predictive and Descriptive
        • Two modes of Data Mining tasks: Supervised and Unsupervised
        • The most important algorithms per category


        PART B: SVM
        • Machine Learning & Support Vector Machines (SVM)
        • What makes SVM unique
        • When and How to deploy SVM
        • Case Studies & Examples


Treparel KMX – All rights reserved 2012   www.treparel.com              7
The Data/Text/Image mining evolution
         The Road ahead
                                                                                               Future
            High                                                                                        Enterprise
                                                                               Today                    Text Analytics
                                                                                  Analytical
                                                                                  Modeling
                                                                 1995 - 2000

                                                                        SVM
                                                                        Predictive
                                                                        Modeling
             Application Value




                                               1980’s

                                     Traditional
                                                               “Easy-to-Use”
                                     Data Mining
                                                                Data Mining
                                                                   Tools
                                                               1980’s


                                                                                                            1990’s
                                                                   OLAP                   Query and
                                                                                          Reporting
             Low

                                 Hard to use                                                            Easy to Use
                                                         Usability

Treparel KMX – All rights reserved 2012                 www.treparel.com                                                 8
Knowledge Mining
         Different levels of depth in knowledge discovery

          Visualization (Adapt)



                                                                    Models of semantic data


                                                  Models of data


                           Models of meta data


                                                   Data Mining      Knowledge
         Filtered data
                                                   Text Mining      Discovery
                           Meta Data               Graph Mining


          Data Collection (Seek)

                                                                      Time
Treparel KMX – All rights reserved 2012          www.treparel.com                             9
What is Data Mining?
           Getting to know the basics
        • Most businesses have an enormous amount of data, with a great deal of
          information hiding within it; The data is also growing faster then the knowledge
          which is now extracted from the data, which leads to a growing gap between
          data and knowledge.
        • Data mining provides a way to automatically extract information buried in the
          data.
        • Data Mining creates mathematical models which describe patterns in large,
          complex collections of data.
        • Patterns elude traditional statistical approaches to analysis because of the large
          number of attributes, the complexity of the patterns, or the difficulty to perform
          the analysis
        • Mining the data directly in the database has advantages:
          less data movement, more data security, one source of the
          data
        • Basically 2 Types of Data exist:
              – Structured (tables & numbers) – 20% of data volume
              – Un-Structured (text, images) - 80% of data volume




Treparel KMX – All rights reserved 2012        www.treparel.com                          10
The Data & Text Mining process
            Automating the mining steps; adding new features

                    Understanding the knowledge mining value chain




                                   Data                                              Model
              Data                 Preparation    Algorithm   Model       Model      generation
                                   &                                      De-        (All models) &   Visualization
              Collection &                        Selection   Building
              Understanding        Cleansing                  & Testing   ployment   coordination




                                                                          Treparel's Focus
                                                                          & Core competence


                                  Traditional Players


Treparel KMX – All rights reserved 2012
2 types of Data Mining Functions
         Predictive Data Mining (supervised):
         •    Are used to predict a value; they require the specification of a
              target (known outcome)
         •    Targets are either binary attributes (indicating yes/no) decisions or
              multi-class targets indicating a preferred alternative (color of
              sweater, salary range).
         •    Constructs one or more models; these models are used to predict
              outcomes for data sets
         Descriptive Data Mining (Unsupervised):
         •    Are used to find the intrinsic structure, relations, or affinities in
              data.
         •    Describes a data set in a concise way and presents interesting
              characteristics of the data
         •    The functions are: clustering, association models, and feature
              extraction

Treparel KMX – All rights reserved 2012   www.treparel.com                       12
How does Automated Classification & Clustering
         works?
         • Consists of dividing the items that make up a collection into
           categories or classes.
         • The goal is to accurately predict the target class for each
           record in new data.
         • Algorithms for classification: different algorithms for
           different problems
                  Naïve Bayes
                  Adaptive Bayes Network
                  Support Vector Machine
                  Decision Tree


            Classification is used in: customer segmentation, sentiment
                analysis, competitive analysis, business modeling, credit
                 analysis, Smart content, Fraud and terrorist detection,
                        Diagnosis support, Patent & Drug discovery
Treparel KMX – All rights reserved 2012     www.treparel.com          13
Text Mining algorithms and features

         Feature                  Naive Bayes         Adaptive        Suport Vector     Decision Tree
                                                      Bayes           Machine
                                                      Network
         Speed                    Very fast           Fast            Fast with         Fast
                                                                      active learning
         Accuracy                 Good in many        Good in many    Significant       Good in many
                                  domains             domains                           domains

         Transparancy             No rules (black Rules for           No rules (black Rules
                                  box)                                box)

         Missing value            Missing value       Missing value   Sparse Data       Missing value
         intrepretation




Treparel KMX – All rights reserved 2012           www.treparel.com                               14
What is Support Vector Machine Learning?
        State of the Art algorithm
        • SVM is a state of the art classification and regression algorithm
        • The SVM optimization procedure maximizes predictive accuracy
          while automatically avoiding over-fitting the training data
        • SVM projects the input data into a kernel space. Then it builds a
          linear model in this kernel space
        • SVM performs well with real world applications such as
          classifying text, recognizing hand-written characters, classifying
          images, as well as bioinformatics and bio sequence analysis.
        • SVM are the standard tools for machine learning and data mining




Treparel KMX – All rights reserved 2012   www.treparel.com                     15
What is Support Vector Machine Learning?
                 Classical Data Mining vs SVM

                     Classical Statistics            SVM - Support Vector Machines

                   Hypothesis on Data                  Study of the model family:
                    distribution                         the VC dimension

                   Large number of dimensions          Number of dimensions can be
                    implies large number of model        very high because generalization
                    parameters which leads to            is controlled
                    generalization problems


                   Modeling seeks to get the best      Modeling seeks to get the best
                    Fit                                  compromise between Fit and
                                                         Robustness


                   Manual iterations and time          Automation is possible
                    are necessary



Treparel KMX –
All rights
reserved 2012
What makes SVM such a unique technology?
         • Strong theoretical foundation (Vapnik-Chervonenkis theory)
         • There is no upper limit on the number of attributes ; Only constraint is
           the hardware
         • Good generalization to novel data
         • SVM is the preferred algorithm for sparse data
         • Algorithm of choice for challenging high-dimensional data
         • SVM supports active learning.
               – SVM models grow as the size of the training set increases, big data
                 sets would be difficult to handle.
               – Aative learning forces the SVM algorithm to restrict learning to the
                 most informative training examples.
         • SVM automatically selects a kernel
         • You can control both the model quality (accuracy) and the performance
           (build time)

Treparel KMX – All rights reserved 2012   www.treparel.com                        17
What makes SVM unique?
         SVM gives you control over the models
                  Robustness
                          High
                    Robustness




                                   Under Fit Model                              Robust Model
                                   High Robustness                              Low Training Error Low Test
                                   Training Error = Test Error                  Error




                          Low                                                   Over Fit Model
                    Robustness
                                                                                Low Robustness
                                                                                No Training Error, High Test
                                                                                Error
                                 Low accuracy                                                      High accuracy
                                                                                                               Quality of fit
Treparel KMX – All rights reserved 2012                          www.treparel.com                                         18
What makes SVM unique?
         SVM gives you control over the models




                                 Need more training data                 Safe to Deploy
                         High
            Robustness



                                 (rows)



                                Need more data
                                                                Need more variables
                                (rows/columns)
                         Low




                                                                (columns) or different model
                                or different model type         type

                                            Low                              High

                                                           Quality

Treparel KMX – All rights reserved 2012               www.treparel.com                         19
Treparel is a leading technology solution provider
       in Big Data Text Analytics & Visualization


                                              Treparel
                                           Delftechpark 26
                                            2628 XH Delft
                                          The Netherlands
                                          www.treparel.com


Treparel KMX – All rights reserved 2012      www.treparel.com   20

More Related Content

What's hot (11)

PDF
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
Insight Technology, Inc.
 
PPTX
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Rio Info
 
PPTX
Data Mining - The Big Picture!
Khalid Salama
 
PDF
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
PPTX
Data mining presentation.ppt
neelamoberoi1030
 
PDF
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
PPTX
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
MIT College Of Engineering,Pune
 
PPT
Knowledge discovery thru data mining
Devakumar Jain
 
PDF
Machine Learning Deep Learning AI and Data Science
Venkata Reddy Konasani
 
PDF
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
PDF
Machine Learning - Intro
Giorgio Alfredo Spedicato
 
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
Insight Technology, Inc.
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Rio Info
 
Data Mining - The Big Picture!
Khalid Salama
 
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Data mining presentation.ppt
neelamoberoi1030
 
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
MIT College Of Engineering,Pune
 
Knowledge discovery thru data mining
Devakumar Jain
 
Machine Learning Deep Learning AI and Data Science
Venkata Reddy Konasani
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
Machine Learning - Intro
Giorgio Alfredo Spedicato
 

Viewers also liked (20)

PDF
Support Vector Machines for Classification
Prakash Pimpale
 
PPT
Support Vector Machines
nextlib
 
PDF
Lecture12 - SVM
Albert Orriols-Puig
 
PPT
Support Vector machine
Anandha L Ranganathan
 
PPTX
Support Vector Machine
Shao-Chuan Wang
 
PPTX
Support vector machine
Musa Hawamdah
 
PPTX
Support Vector Machine without tears
Ankit Sharma
 
PPTX
Image Classification And Support Vector Machine
Shao-Chuan Wang
 
PPTX
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
osify
 
PPTX
Sentiment analysis using naive bayes classifier
Dev Sahu
 
PDF
Linear regression without tears
Ankit Sharma
 
PDF
Sentiment Analysis
Data Science Society
 
PPTX
Support Vector Machine(SVM) with Iris and Mushroom Dataset
Pawandeep Kaur
 
PPTX
Sentiment Analysis Using Machine Learning
Nihar Suryawanshi
 
PPTX
Sentiment Analysis
Ankur Tyagi
 
PDF
09 Machine Learning - Introduction Support Vector Machines
Andres Mendez-Vazquez
 
PPT
k Nearest Neighbor
butest
 
PPT
How Sentiment Analysis works
CJ Jenkins
 
PDF
Backpropagation in Convolutional Neural Network
Hiroshi Kuwajima
 
Support Vector Machines for Classification
Prakash Pimpale
 
Support Vector Machines
nextlib
 
Lecture12 - SVM
Albert Orriols-Puig
 
Support Vector machine
Anandha L Ranganathan
 
Support Vector Machine
Shao-Chuan Wang
 
Support vector machine
Musa Hawamdah
 
Support Vector Machine without tears
Ankit Sharma
 
Image Classification And Support Vector Machine
Shao-Chuan Wang
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
osify
 
Sentiment analysis using naive bayes classifier
Dev Sahu
 
Linear regression without tears
Ankit Sharma
 
Sentiment Analysis
Data Science Society
 
Support Vector Machine(SVM) with Iris and Mushroom Dataset
Pawandeep Kaur
 
Sentiment Analysis Using Machine Learning
Nihar Suryawanshi
 
Sentiment Analysis
Ankur Tyagi
 
09 Machine Learning - Introduction Support Vector Machines
Andres Mendez-Vazquez
 
k Nearest Neighbor
butest
 
How Sentiment Analysis works
CJ Jenkins
 
Backpropagation in Convolutional Neural Network
Hiroshi Kuwajima
 
Ad

Similar to Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012 (20)

PPTX
Teradata Big Data London Seminar
Hortonworks
 
PDF
Big Data Beyond Hadoop*: Research Directions for the Future
Odinot Stanislas
 
PPTX
Crowd-Sourced Intelligence Built into Search over Hadoop
DataWorks Summit
 
PPT
Getting Cloud Architecture Right the First Time Ver 2
David Linthicum
 
PPTX
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Ted Dunning
 
PDF
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
PDF
Data Search Searching And Finding Information In Unstructured And Structured ...
Erik Fransen
 
PPTX
Data Mining
swami920
 
PDF
A Trading-Based Knowledge Representation Metamodel for Management Information...
Applied Computing Group
 
PDF
The Comprehensive Approach: A Unified Information Architecture
Inside Analysis
 
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 
PDF
NIEM and Oracle Overview October 2011
Bizagi Inc
 
PPTX
Metadata Use Cases You Can Use
dmurph4
 
PPTX
Metadata Use Cases
dmurph4
 
PDF
Unity: Because the Sum is Greater than the Parts
Inside Analysis
 
PDF
Web 2.0 And The End Of DITA
Joe Gollner
 
PPTX
MapR lucidworks joint webinar
Ted Dunning
 
PDF
Scalable Computing Labs (SCL).
Mindtree Ltd.
 
PPTX
MapR LucidWorks Joint Webinar 121211
MapR Technologies
 
PPTX
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
European Data Forum
 
Teradata Big Data London Seminar
Hortonworks
 
Big Data Beyond Hadoop*: Research Directions for the Future
Odinot Stanislas
 
Crowd-Sourced Intelligence Built into Search over Hadoop
DataWorks Summit
 
Getting Cloud Architecture Right the First Time Ver 2
David Linthicum
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Ted Dunning
 
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
Data Search Searching And Finding Information In Unstructured And Structured ...
Erik Fransen
 
Data Mining
swami920
 
A Trading-Based Knowledge Representation Metamodel for Management Information...
Applied Computing Group
 
The Comprehensive Approach: A Unified Information Architecture
Inside Analysis
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 
NIEM and Oracle Overview October 2011
Bizagi Inc
 
Metadata Use Cases You Can Use
dmurph4
 
Metadata Use Cases
dmurph4
 
Unity: Because the Sum is Greater than the Parts
Inside Analysis
 
Web 2.0 And The End Of DITA
Joe Gollner
 
MapR lucidworks joint webinar
Ted Dunning
 
Scalable Computing Labs (SCL).
Mindtree Ltd.
 
MapR LucidWorks Joint Webinar 121211
MapR Technologies
 
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
European Data Forum
 
Ad

Recently uploaded (20)

PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 

Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012

  • 1. Introduction to Text Mining & Support Vector Machines (SVM) Dr. Anton Heijs CEO Treparel Delftechpark 26 2628 XH Delft July 2012 The Netherlands www.treparel.com
  • 2. KMX enables information and knowledge professionals to gain faster, reliable, more precise insights in large complex unstructured data sets allowing them to make better informed decisions. Treparel is a leading technology solution provider in Big Data Text Analytics & Visualization Treparel KMX – All rights reserved 2012 www.treparel.com 2
  • 3. Topics covered in this presentation • Who is Treparel? • Introduction in Text Mining • What is Automated Classification & Clustering? • Introducing Support Vector Machines Treparel KMX – All rights reserved 2012 www.treparel.com 3
  • 4. Nexus of Forces: Social, Cloud, Mobile, Information IT Market shift driving Big Data challenges Copyright: Gartner, 2011 80% of data is Unstructured (Documents, Text, Images, Graphs) Treparel KMX – All rights reserved 2012 www.treparel.com 4
  • 5. About Treparel • Delft, The Netherlands, 2006. • Treparel is an innovative technology solution provider in Big Data Analytics, Text Mining and Visualization. • KMX is an integrated data analysis toolset which provide faster, reliable intelligent insights in large complex unstructured data sets to allow companies to make better informed decisions. • Clients: Philips, Bayer, Abbott, European Patent Office, European Commission • Part of Research Centers and University ecosystem; TU Delft, Universities of Paris and Sao Paulo • More info: www.treparel.com Treparel KMX – All rights reserved 2012 www.treparel.com 5
  • 6. Positioning of Treparel’s KMX technology Text Acquisition & Preparation Analysis and processing Output and display ‘Seek’ ‘Model’ ‘Adapt’ External sources Reporting & Text preprocessing Patents Presentation Legal Media and publishing Research Indexing databases Media / Publishers Content management Other sources Clustering systems Documents Websites Line-of-business Classification applications Blogs Newsfeeds Research applications Email Semantic Analysis Application notes Search engines Search results Social networks Visualization Information extraction (entities, facts, relationships, concepts, patents) Management, Development and Configuration Copyright: Gartner, J. Popkin 2010
  • 7. Getting to know the basics PART A: Intro in Text Mining • The Data (text & image) Mining evolution • What is Data Mining: in or out-side the database • The Data Mining process • Two types of Data Mining tasks: Predictive and Descriptive • Two modes of Data Mining tasks: Supervised and Unsupervised • The most important algorithms per category PART B: SVM • Machine Learning & Support Vector Machines (SVM) • What makes SVM unique • When and How to deploy SVM • Case Studies & Examples Treparel KMX – All rights reserved 2012 www.treparel.com 7
  • 8. The Data/Text/Image mining evolution The Road ahead Future High Enterprise Today Text Analytics Analytical Modeling 1995 - 2000 SVM Predictive Modeling Application Value 1980’s Traditional “Easy-to-Use” Data Mining Data Mining Tools 1980’s 1990’s OLAP Query and Reporting Low Hard to use Easy to Use Usability Treparel KMX – All rights reserved 2012 www.treparel.com 8
  • 9. Knowledge Mining Different levels of depth in knowledge discovery Visualization (Adapt) Models of semantic data Models of data Models of meta data Data Mining Knowledge Filtered data Text Mining Discovery Meta Data Graph Mining Data Collection (Seek) Time Treparel KMX – All rights reserved 2012 www.treparel.com 9
  • 10. What is Data Mining? Getting to know the basics • Most businesses have an enormous amount of data, with a great deal of information hiding within it; The data is also growing faster then the knowledge which is now extracted from the data, which leads to a growing gap between data and knowledge. • Data mining provides a way to automatically extract information buried in the data. • Data Mining creates mathematical models which describe patterns in large, complex collections of data. • Patterns elude traditional statistical approaches to analysis because of the large number of attributes, the complexity of the patterns, or the difficulty to perform the analysis • Mining the data directly in the database has advantages: less data movement, more data security, one source of the data • Basically 2 Types of Data exist: – Structured (tables & numbers) – 20% of data volume – Un-Structured (text, images) - 80% of data volume Treparel KMX – All rights reserved 2012 www.treparel.com 10
  • 11. The Data & Text Mining process Automating the mining steps; adding new features Understanding the knowledge mining value chain Data Model Data Preparation Algorithm Model Model generation & De- (All models) & Visualization Collection & Selection Building Understanding Cleansing & Testing ployment coordination Treparel's Focus & Core competence Traditional Players Treparel KMX – All rights reserved 2012
  • 12. 2 types of Data Mining Functions Predictive Data Mining (supervised): • Are used to predict a value; they require the specification of a target (known outcome) • Targets are either binary attributes (indicating yes/no) decisions or multi-class targets indicating a preferred alternative (color of sweater, salary range). • Constructs one or more models; these models are used to predict outcomes for data sets Descriptive Data Mining (Unsupervised): • Are used to find the intrinsic structure, relations, or affinities in data. • Describes a data set in a concise way and presents interesting characteristics of the data • The functions are: clustering, association models, and feature extraction Treparel KMX – All rights reserved 2012 www.treparel.com 12
  • 13. How does Automated Classification & Clustering works? • Consists of dividing the items that make up a collection into categories or classes. • The goal is to accurately predict the target class for each record in new data. • Algorithms for classification: different algorithms for different problems  Naïve Bayes  Adaptive Bayes Network  Support Vector Machine  Decision Tree Classification is used in: customer segmentation, sentiment analysis, competitive analysis, business modeling, credit analysis, Smart content, Fraud and terrorist detection, Diagnosis support, Patent & Drug discovery Treparel KMX – All rights reserved 2012 www.treparel.com 13
  • 14. Text Mining algorithms and features Feature Naive Bayes Adaptive Suport Vector Decision Tree Bayes Machine Network Speed Very fast Fast Fast with Fast active learning Accuracy Good in many Good in many Significant Good in many domains domains domains Transparancy No rules (black Rules for No rules (black Rules box) box) Missing value Missing value Missing value Sparse Data Missing value intrepretation Treparel KMX – All rights reserved 2012 www.treparel.com 14
  • 15. What is Support Vector Machine Learning? State of the Art algorithm • SVM is a state of the art classification and regression algorithm • The SVM optimization procedure maximizes predictive accuracy while automatically avoiding over-fitting the training data • SVM projects the input data into a kernel space. Then it builds a linear model in this kernel space • SVM performs well with real world applications such as classifying text, recognizing hand-written characters, classifying images, as well as bioinformatics and bio sequence analysis. • SVM are the standard tools for machine learning and data mining Treparel KMX – All rights reserved 2012 www.treparel.com 15
  • 16. What is Support Vector Machine Learning? Classical Data Mining vs SVM Classical Statistics SVM - Support Vector Machines  Hypothesis on Data  Study of the model family: distribution the VC dimension  Large number of dimensions  Number of dimensions can be implies large number of model very high because generalization parameters which leads to is controlled generalization problems  Modeling seeks to get the best  Modeling seeks to get the best Fit compromise between Fit and Robustness  Manual iterations and time  Automation is possible are necessary Treparel KMX – All rights reserved 2012
  • 17. What makes SVM such a unique technology? • Strong theoretical foundation (Vapnik-Chervonenkis theory) • There is no upper limit on the number of attributes ; Only constraint is the hardware • Good generalization to novel data • SVM is the preferred algorithm for sparse data • Algorithm of choice for challenging high-dimensional data • SVM supports active learning. – SVM models grow as the size of the training set increases, big data sets would be difficult to handle. – Aative learning forces the SVM algorithm to restrict learning to the most informative training examples. • SVM automatically selects a kernel • You can control both the model quality (accuracy) and the performance (build time) Treparel KMX – All rights reserved 2012 www.treparel.com 17
  • 18. What makes SVM unique? SVM gives you control over the models Robustness High Robustness Under Fit Model Robust Model High Robustness Low Training Error Low Test Training Error = Test Error Error Low Over Fit Model Robustness Low Robustness No Training Error, High Test Error Low accuracy High accuracy Quality of fit Treparel KMX – All rights reserved 2012 www.treparel.com 18
  • 19. What makes SVM unique? SVM gives you control over the models Need more training data Safe to Deploy High Robustness (rows) Need more data Need more variables (rows/columns) Low (columns) or different model or different model type type Low High Quality Treparel KMX – All rights reserved 2012 www.treparel.com 19
  • 20. Treparel is a leading technology solution provider in Big Data Text Analytics & Visualization Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.com Treparel KMX – All rights reserved 2012 www.treparel.com 20