SlideShare a Scribd company logo
Value Mining: How Entity Extraction
         Informs Analysis



                     June 2012 | Andrew Strite
Agenda
• Big Data and Document Analysis
• Case Study: Federal Agency
  – Problem Definition
  – Open Analytics & Entity Extraction
  – Reporting and Visualization
  – Results Assessment
• Questions
The Big Data Problem
   Data is becoming the new raw material of
 business: an economic input almost on par with
                capital and labor.

“Every day I wake up and ask, „how can I flow data
 better, manage data better, analyze data better?”

          Rollin Ford, the CIO of Wal-Mart
Solution: Document Analysis
 "Document Analysis refers to
 computer-assisted analysis of large numbers
 of documents in order to answer questions
 about the content of a document set.”
 Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• The goal is to:
  – Extract Entities (people, places, things)
  – Create Associations between entities (in the
    form of noun-verb-noun), e.g.:
     •   John Doe lives in Washington, D.C
     •   John Doe is married to Jane Doe
     •   John Doe is a Virgo
     •   John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When and
  Where into a unified data structure that
  supports data analytics and visualization.
Who                                When
people, organizations,             past, present, future
facilities, company                dates

What                               Where
events, summaries,                 city, state, country,
facts, themes                      coordinate
Document Analysis
Case Study: Federal Agency
Overview
A Federal client produced reports for other DoD
components and wanted to know:

      “Did our reports meet customer needs?”

First step: assess historical reporting

      “What were teams writing about and when?”
Problem: Unstructured Data
• Plenty of raw data, but no way to get at it
  – 6K+ unstructured documents
  – 15+ file types
  – No standard formats
     • Teams (Who)
     • Dates (When)
     • Topics (What)
  – Some content not relevant
Early Attempts
• Initial client attempts to solve the problem
  mostly involved manual review
  – High document volume = labor intensive
  – Assessing relevance = skilled labor
     • Total process tied up skilled analysts for hundreds
       of man-hours.


• Manual review prone to error
  – Incomplete attempts corrupted data
Solution: Open Analytics
• Process to design and implement
  analytical solutions
• Joins open tools and agile engineering
  techniques
• Goal is to enable organizations to quickly
  deliver smart analysis and enable top line
  growth
Mechanism: Infinit.e
      Infinit.e is a
        scalable
    framework for                                             Visualizing
                                                  Analyzing
                                     Retrieving
                       Enriching
             Storing
Collecting
                                   Unstructured documents
                                              &
                                     Structured records
Infinit.e Concept

20% Structured
     • Log files
     • Databases
     • Apps        •   Documents
                   •   Presentations
                   •   Spreadsheets
                   •   Meeting notes
                   •   Email
                   •   IM chats
                   •   Reports
                   •   Social

         80% Unstructured              •   Entities
                                       •   Events
                                       •   Facts
                                       •   Sentiment
Unstructured and Structured Data       •   Geospatial
                                       •   Temporal
                                       •   Themes
Infinit.e Data Model
        Duke and Progress announced merger plans in
        January 2012
            Bernanke, 57 said in his testimony price increases
            “have begun to moderate” after a jump in oil costs   Who
                                                                 people, organizations,
            earlier this year
                                                                 facilities, company

                                                                 What
               Tablet ownership levels hit 18% in China, the     events, summaries,
                UK and US versus 3% in November 2010             facts, themes

                                                                 When
       <Incident>                                                past, present, future
         <uid>20101043423</uid>                                  dates
         <subject>1 person killed in armed attack by
        suspected Boko Haram in Maiduguri, Borno,
                                                                 Where
        Nigeria</subject>                                        city, state, country,
         <multipleDays>No</multipleDays>                         coordinate
         <eventDate>06/04/2011</eventDate>
       </Incident>
Applying Infinit.e
 “What were teams writing about and when?”




                Open Analytic and Agile Intelligence architecture
Harvested Entities
Reporting and Visualization
• Queries performed on the data, providing
  breakouts by team, topic, and dates
• Flexible visualization
  – Built-in visualization framework
  – Multiple export options
Finding Value
• Over the course of 2.5 weeks, we applied
  the entity-based data model to our client’s
  document analysis problem
• Major advantages to this approach were:
  – Agility
  – Precision
  – Relevance
Agility
• Automation reduced processing time:
  – Manual processing time: ~480 hours
  – Automated processing time: 2-3 hours


• Speed enabled iterative development
  – Extraction adapted alongside analysts’
    understanding of data
  – Positive feedback loop
Precision
• Entity definitions created from original data
  – Definitions improved based on feedback
• Automation ensures uniform application
  across data set
              entity1
                         TOPIC1   TOPIC2    TOPIC1
              entity2             entity3   entity1
                         TOPIC2
               entity3
Relevance
• Entity extraction informs quality control
  – Duplicates identified based on similar entities
  – Exclude documents based on missing entities
  – Minimizes risk of data corruption
  – Reduced need for analyst review        Duplicates




                                          Missing Meta-Data
The Results
• Extracted entities became key meta-data
  6K+ unstructured documents became…
  …3.5K documents with value to the study
The Results
• Our client was able to complete the
  research shortly after final extraction
• Confidence in methodology and results
  bolstered the value of recommendations
• Considering similar approaches for future
  projects
Bottom Line
Using document analysis significantly…

  … reduces the time to ingest data.
  … cuts right to relevant information.
  … builds a framework for future analysis.
Thank You!

          Andrew Strite
        www.ikanow.com
       astrite@ikanow.com
          301.513.1384

More Related Content

PDF
BIG DATA WORKBOOK OCT 2015
Fiona Lew
 
PDF
Data-Ed Webinar: Demystifying Big Data
DATAVERSITY
 
PPTX
Data Management: Tips & Tools
Stephanie Wright
 
PPTX
Social databases - A brief overview
Iván Sanchez Vera
 
PDF
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Connected Data World
 
PPTX
MongoDC - Ikanow April 2012 Meetup
ikanow
 
PDF
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
DATAVERSITY
 
PDF
Data science governance and GDPR
Andy Petrella
 
BIG DATA WORKBOOK OCT 2015
Fiona Lew
 
Data-Ed Webinar: Demystifying Big Data
DATAVERSITY
 
Data Management: Tips & Tools
Stephanie Wright
 
Social databases - A brief overview
Iván Sanchez Vera
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Connected Data World
 
MongoDC - Ikanow April 2012 Meetup
ikanow
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
DATAVERSITY
 
Data science governance and GDPR
Andy Petrella
 

What's hot (20)

PPTX
Project management for Big Data projects
Sandeep Kumar, PMP®
 
PDF
Slides: How Automating Data Lineage Improves BI Performance
DATAVERSITY
 
PPTX
Big Data Content Organization, Discovery, and Management
Access Innovations, Inc.
 
PDF
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DATAVERSITY
 
PDF
Introduction to visualizing Big Data
Dawit Nida
 
PDF
Big Data Analytics Architecture PowerPoint Presentation Slides
SlideTeam
 
PPTX
Lessons Learned The Hard Way: 32+ Data Science Interviews
Gregory Kamradt
 
PDF
HP Vertica Architecture Gives Massive Performance Boost to Toughest BI Querie...
Dana Gardner
 
PDF
Building a data fluent organization
Zach Gemignani
 
PPTX
Data science unit1
varshakumar21
 
PDF
DataEd Slides: Getting (Re)Started with Data Stewardship
DATAVERSITY
 
PPTX
From Asset to Impact - Presentation to ICS Data Protection Conference 2011
Castlebridge Associates
 
PDF
Data science governance : what and how
Andy Petrella
 
PDF
How to Consume Your Data for AI
DATAVERSITY
 
PDF
Predictive Analytics - How to get stuff out of your Crystal Ball
DATAVERSITY
 
PPTX
Big Data Analytics Proposal #1
Ziyad Saleh
 
PDF
Big Data Decision-Making
Teradata Aster
 
PDF
How does big data impact you
Annzalie (Ann) Barrett
 
PPTX
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Usama Fayyad
 
PDF
Data Strategy Best Practices
DATAVERSITY
 
Project management for Big Data projects
Sandeep Kumar, PMP®
 
Slides: How Automating Data Lineage Improves BI Performance
DATAVERSITY
 
Big Data Content Organization, Discovery, and Management
Access Innovations, Inc.
 
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DATAVERSITY
 
Introduction to visualizing Big Data
Dawit Nida
 
Big Data Analytics Architecture PowerPoint Presentation Slides
SlideTeam
 
Lessons Learned The Hard Way: 32+ Data Science Interviews
Gregory Kamradt
 
HP Vertica Architecture Gives Massive Performance Boost to Toughest BI Querie...
Dana Gardner
 
Building a data fluent organization
Zach Gemignani
 
Data science unit1
varshakumar21
 
DataEd Slides: Getting (Re)Started with Data Stewardship
DATAVERSITY
 
From Asset to Impact - Presentation to ICS Data Protection Conference 2011
Castlebridge Associates
 
Data science governance : what and how
Andy Petrella
 
How to Consume Your Data for AI
DATAVERSITY
 
Predictive Analytics - How to get stuff out of your Crystal Ball
DATAVERSITY
 
Big Data Analytics Proposal #1
Ziyad Saleh
 
Big Data Decision-Making
Teradata Aster
 
How does big data impact you
Annzalie (Ann) Barrett
 
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Usama Fayyad
 
Data Strategy Best Practices
DATAVERSITY
 
Ad

Viewers also liked (6)

PDF
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
ikanow
 
PPTX
The (financial) Return of Agile
Frank Vogelezang
 
PPTX
Open Analytics DC April 2012 Meetup
ikanow
 
PPTX
How IKANOW uses MongoDB to help organizations solve really big problems
ikanow
 
PDF
Hadoop MapReduce - I'm Sold, Now What?
ikanow
 
PDF
Mongo db washington dc 2014
ikanow
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
ikanow
 
The (financial) Return of Agile
Frank Vogelezang
 
Open Analytics DC April 2012 Meetup
ikanow
 
How IKANOW uses MongoDB to help organizations solve really big problems
ikanow
 
Hadoop MapReduce - I'm Sold, Now What?
ikanow
 
Mongo db washington dc 2014
ikanow
 
Ad

Similar to Value Mining: How Entity Extraction Informs Analysis (20)

PDF
Practical Data Management Plans
IUPUI
 
PPTX
Big data
RameshwariPatil3
 
PPTX
Effective Internal Investigations
Daegis
 
PDF
Introduction to Data Analytics, AKTU - UNIT-1
Dr Anuranjan Misra
 
PDF
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
PPTX
bigdata introduction for students pg msc
DharaniMani4
 
PDF
Big Data Evolution
itnewsafrica
 
PPTX
bigdata- Introduction for pg students fo
DharaniMani4
 
PPTX
Chapter 1 Introduction to Data Science (Computing)
jayashirymorgan
 
PDF
Getting Started in Data Science
Thinkful
 
PPTX
What is big data
mintubutani2212
 
PPTX
Big Data_Big Data_Big Data-Big Data_Big Data
Harish Khodke
 
PPTX
Big data ppt
Deepika ParthaSarathy
 
PPT
Data warehouse Vs Big Data
Lisette ZOUNON
 
PPTX
Data science.chapter-1,2,3
varshakumar21
 
PPTX
Building Competitive Moats With Data
Peter Skomoroch
 
PDF
Career in Data Science (July 2017, DTLA)
Thinkful
 
PPTX
Big data
Enfa George
 
PPTX
BIG DATA CHAPTER 2 IN DSS.pptx
muflehaljarrah
 
PPTX
Department of Commerce App Challenge: Big Data Dashboards
Brand Niemann
 
Practical Data Management Plans
IUPUI
 
Effective Internal Investigations
Daegis
 
Introduction to Data Analytics, AKTU - UNIT-1
Dr Anuranjan Misra
 
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
bigdata introduction for students pg msc
DharaniMani4
 
Big Data Evolution
itnewsafrica
 
bigdata- Introduction for pg students fo
DharaniMani4
 
Chapter 1 Introduction to Data Science (Computing)
jayashirymorgan
 
Getting Started in Data Science
Thinkful
 
What is big data
mintubutani2212
 
Big Data_Big Data_Big Data-Big Data_Big Data
Harish Khodke
 
Big data ppt
Deepika ParthaSarathy
 
Data warehouse Vs Big Data
Lisette ZOUNON
 
Data science.chapter-1,2,3
varshakumar21
 
Building Competitive Moats With Data
Peter Skomoroch
 
Career in Data Science (July 2017, DTLA)
Thinkful
 
Big data
Enfa George
 
BIG DATA CHAPTER 2 IN DSS.pptx
muflehaljarrah
 
Department of Commerce App Challenge: Big Data Dashboards
Brand Niemann
 

More from ikanow (7)

PPTX
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
ikanow
 
PPTX
Open Analytics: Building Effective Frameworks for Social Media Analysis
ikanow
 
PPTX
Cloud computing with AWS
ikanow
 
PPTX
Building Effective Frameworks for Social Media Analysis
ikanow
 
PPTX
Open Analytics DC June 2012 Presentation
ikanow
 
PDF
Agile intelligence through Open Analytics
ikanow
 
PPTX
Social Intelligence: Realizing Business Value in Big Data
ikanow
 
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
ikanow
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
ikanow
 
Cloud computing with AWS
ikanow
 
Building Effective Frameworks for Social Media Analysis
ikanow
 
Open Analytics DC June 2012 Presentation
ikanow
 
Agile intelligence through Open Analytics
ikanow
 
Social Intelligence: Realizing Business Value in Big Data
ikanow
 

Recently uploaded (20)

PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Software Development Methodologies in 2025
KodekX
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 

Value Mining: How Entity Extraction Informs Analysis

  • 1. Value Mining: How Entity Extraction Informs Analysis June 2012 | Andrew Strite
  • 2. Agenda • Big Data and Document Analysis • Case Study: Federal Agency – Problem Definition – Open Analytics & Entity Extraction – Reporting and Visualization – Results Assessment • Questions
  • 3. The Big Data Problem Data is becoming the new raw material of business: an economic input almost on par with capital and labor. “Every day I wake up and ask, „how can I flow data better, manage data better, analyze data better?” Rollin Ford, the CIO of Wal-Mart
  • 4. Solution: Document Analysis "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  • 5. Document Analysis • The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011 • And…
  • 6. Document Analysis • Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization. Who When people, organizations, past, present, future facilities, company dates What Where events, summaries, city, state, country, facts, themes coordinate
  • 8. Overview A Federal client produced reports for other DoD components and wanted to know: “Did our reports meet customer needs?” First step: assess historical reporting “What were teams writing about and when?”
  • 9. Problem: Unstructured Data • Plenty of raw data, but no way to get at it – 6K+ unstructured documents – 15+ file types – No standard formats • Teams (Who) • Dates (When) • Topics (What) – Some content not relevant
  • 10. Early Attempts • Initial client attempts to solve the problem mostly involved manual review – High document volume = labor intensive – Assessing relevance = skilled labor • Total process tied up skilled analysts for hundreds of man-hours. • Manual review prone to error – Incomplete attempts corrupted data
  • 11. Solution: Open Analytics • Process to design and implement analytical solutions • Joins open tools and agile engineering techniques • Goal is to enable organizations to quickly deliver smart analysis and enable top line growth
  • 12. Mechanism: Infinit.e Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching Storing Collecting Unstructured documents & Structured records
  • 13. Infinit.e Concept 20% Structured • Log files • Databases • Apps • Documents • Presentations • Spreadsheets • Meeting notes • Email • IM chats • Reports • Social 80% Unstructured • Entities • Events • Facts • Sentiment Unstructured and Structured Data • Geospatial • Temporal • Themes
  • 14. Infinit.e Data Model Duke and Progress announced merger plans in January 2012 Bernanke, 57 said in his testimony price increases “have begun to moderate” after a jump in oil costs Who people, organizations, earlier this year facilities, company What Tablet ownership levels hit 18% in China, the events, summaries, UK and US versus 3% in November 2010 facts, themes When <Incident> past, present, future <uid>20101043423</uid> dates <subject>1 person killed in armed attack by suspected Boko Haram in Maiduguri, Borno, Where Nigeria</subject> city, state, country, <multipleDays>No</multipleDays> coordinate <eventDate>06/04/2011</eventDate> </Incident>
  • 15. Applying Infinit.e “What were teams writing about and when?” Open Analytic and Agile Intelligence architecture
  • 17. Reporting and Visualization • Queries performed on the data, providing breakouts by team, topic, and dates • Flexible visualization – Built-in visualization framework – Multiple export options
  • 18. Finding Value • Over the course of 2.5 weeks, we applied the entity-based data model to our client’s document analysis problem • Major advantages to this approach were: – Agility – Precision – Relevance
  • 19. Agility • Automation reduced processing time: – Manual processing time: ~480 hours – Automated processing time: 2-3 hours • Speed enabled iterative development – Extraction adapted alongside analysts’ understanding of data – Positive feedback loop
  • 20. Precision • Entity definitions created from original data – Definitions improved based on feedback • Automation ensures uniform application across data set entity1 TOPIC1 TOPIC2 TOPIC1 entity2 entity3 entity1 TOPIC2 entity3
  • 21. Relevance • Entity extraction informs quality control – Duplicates identified based on similar entities – Exclude documents based on missing entities – Minimizes risk of data corruption – Reduced need for analyst review Duplicates Missing Meta-Data
  • 22. The Results • Extracted entities became key meta-data 6K+ unstructured documents became… …3.5K documents with value to the study
  • 23. The Results • Our client was able to complete the research shortly after final extraction • Confidence in methodology and results bolstered the value of recommendations • Considering similar approaches for future projects
  • 24. Bottom Line Using document analysis significantly… … reduces the time to ingest data. … cuts right to relevant information. … builds a framework for future analysis.
  • 25. Thank You! Andrew Strite www.ikanow.com [email protected] 301.513.1384