SlideShare a Scribd company logo
Tetherless World Constellation




   Data: Big and Broad
             Jim Hendler
    Tetherless World Constellation
Tetherless World Professor of Computer and Cognitive Science
            Head, Computer Science Department

   Rensselaer Polytechnic Institute
   http://www.cs.rpi.edu/~hendler
         @jahendler (twitter)
Outline (if I stick to it)

                       Tetherless World Constellation


• What is big data?
• How big is big?
• What is big data on the Web?
• What is Broad data?
• Got an example?
• What’s the problem?
• What’s going on
Useful Terms
                                              Tetherless World Constellation

• Machine-readable Data
   – Information available in a form that is accessible and
     manipulable by computer
   – Accessible ≠ Manipulable
      • eg PDF documents can be read in and displayed, but the
        information in the document is not readily available without special
        tooling
• Metadata
   – Information associated with (machine-readable) data that
     provides information about the data set
• Workflow, Provenance, and lots of other terms
   – Useful sorts of metadata with respect to who created the data,
     when, how was it processed, etc.
• Metadata and the other stuff most useful when it is
  machine-readable and openly available in commonly agreed
  upon formats
BIG Data is NOT the Web of Data
                                       Tetherless World Constellation

• The term “Big Data” is widely used
  nowadays to refer to a whole bunch of
  machine-readable data in one accessible
  (to the researcher) place
   – 3 main contexts
    • The large data collections of “big science” projects
       – in traditional data warehouse or database formats
    • The enterprise data of large, non-Web-based
      companies (IBM, TATA, etc.)
       – Generally in multiple
    • The data holdings of a Google, Facebook or other
      large Web company
       – Include large “unstructured” holdings
       – Include “graph” data
Tera, Peta, Zeta
                                            yotta, yotta, yotta…
                                       Tetherless World Constellation


• World Wide Web data is extremely large
• Extremely well “funded”
  – eg. Facebook
     • 25 Terabytes of logged data per day; valuation $33B (US
       NIH budget ~ $31B)
  – eg. Google
     • In 2008 it was estimated at 20 petabytes per day (not
       including youTube); current valuation $190B (about 1/3
       the entire US DoD budget)

• And really, really fascinating stuff
  – Data about people and their relationships
     •   To each other
     •   To products
     •   To activities and actions
     •   …
How BIG is Big?

Tetherless World Constellation
BIG Data

                            Tetherless World Constellation




Google uses their data in many ways
         Search => ads => user
Big Data is becoming different on the Web

                                     Tetherless World Constellation


• New Work
  – is moving away from traditional relational
   models
     • cf. NoSQL
  – Moving towards third party application and
    extension
     • cf. Mobile apps for local governments
  – Includes a focus on interoperability and
    exchange with “lightweight” semantics
     • Using ideas from the Semantic Web
        – Search: Schema.org
        – Social Networking: OGP
Which in part gives rise to BROAD data

                                     Tetherless World Constellation


• 4th context: Broad Data
  – The huge amount of freely available, but widely varied,
    Open Data on the World Wide Web (Structured and
    Semi-structured)
     • Example: The extended Facebook OGP graph (the
       part outside Facebook’s datasets)
     • Example: The growing linked open data cloud of
       freely available RDF linked data
     • Example: Hundreds of thousands of datasets that are
       available on the Web free from governments around
       the world
Example: adding “Breadth”

 Tetherless World Constellation




                    April 2010
Facebook’s Open Graph Protocol

                                                             Tetherless World Constellation

• Facebook now allows other sites to extend the graph
• Open Graph Protocol uses RDFa to let web sites contain
  information about the things people “like”
       og:title - The title of your object as it should appear within the graph, e.g., "The Rock".
       og:type - The type of your object, e.g., "movie". Depending on the type you specify, other
       properties may also be required.
       og:image - An image URL which should represent your object within the graph.
       og:url - The canonical URL of your object that will be used as its permanent ID in the graph
       og:description - A one to two sentence description of your object.
       og:site_name - If your object is part of a larger web site, the name which should be
       displayed for the overall site. e.g., "IMDb".




   – Not a traditional “ontology”
Big Data

                                   Tetherless World Constellation




Facebook generates terabytes of data per day
          What could be learned from this?
Creates a platform for SW-powered apps

              Tetherless World Constellation
BROAD data challenges

                            Tetherless World Constellation


• For broad data the new challenges
  that emerge include
  – (Web-scale) data search
  – “Crowd-sourced” modeling
  – rapid (and potentially ad hoc)
    integration of datasets
  – visualization and analysis of only-
    partially modeled datasets
  – policies for data use, reuse and
    combination.
Huh?

                          Tetherless World Constellation


“The more I work with data, the more I
realize I need Semantics”

 Huh?

The traditional database community has,
umm, not always been the first to embrace
semantics

What is different here?
Government Data Sharing

Tetherless World Constellation
The Web of Open
Government Data is Growing
• Analytics based on over 1,000,000 datasets
  from around the world can be seen at
   – http://logd.tw.rpi.edu/iogds_data_analytics
• The examples that follow are from that page
Datasets                 1,028,054
Countries                43
Catalogs                 192
Categories               2460
Languages                24
          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           17
International




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           18
2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           19
Many others…




                                                   Important note:
                                                   quantity is not really the most
                                                   important issue

          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           20
Topics (Across All Catalogs)




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           21
Topics (Across All Catalogs)




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           22
Combining data from different data sharing sites

                       Tetherless World Constellation
Data Integration Problems

                                       Tetherless World Constellation




Head to head comparions shows that
burglaries in Avon and Somerset (UK) far
exceed those in Los Angeles, California
(one of the highest crime areas in the US)
The problem is (likely) semantics

                                          Tetherless World Constellation




                                                        Same or
                                                        different?




Do the terms mean the same? Are they collected in the same way? Are
they processed differently? …
Example: Water

Tetherless World Constellation
Example: Water/Kenya

Tetherless World Constellation
Finding Data

                        Tetherless World Constellation




World Bank: Africa     Africover: Agriculture




 Kenya: Agricultural   US Data.gov: Crop
5 Star Data

                                         Tetherless World Constellation




              IOGDC Open Data Tutorial             29
9 July 2012
Broad Data “Integration”
requires simple semantics
 Tetherless World Constellation
Example any wikipedia topic!

   Tetherless World Constellation
Arizona

Tetherless World Constellation
Arizona info (From the previous)

       Tetherless World Constellation
USDA data turns out to be crucial

        Tetherless World Constellation
Metadata is crucial for Broad Data
                                           Tetherless World Constellation


• Metadata design is crucial to govt data
  sharing
  – Needed for search and federation in large data
    sharing efforts
• International data sharing
  – W3C Govt Linked Data Working Group
  – Need for vocabularies within govt sectors
     • Esp for cross-langauge use
        – How can we compare health (or legal, or social, or ….) data
          between countries like US, UK, India, Kenya (English) with
          Norway, China, France, etc.
        – How can we link local govts (in traditional languages, local
          dialects, etc) w/national data
Database metadata

Tetherless World Constellation
Dataset extension to schema.org (pending)

                 Tetherless World Constellation
Government Data in the linked open data cloud

                     Tetherless World Constellation




    Government Data is
    currently over ½ the cloud in
    size (~17B triples), 10s of
    thousands of links to other
    data (within and without)

http://linkeddata.org/
Research in Govt Data => Broad Data challenges

                                             Tetherless World Constellation

• Trust
   – Government data is controversial, and potentially biased
       • How do we confirm or dispute?
• Combination
   – When we combine data we need to keep the provenance of
     information (see trust)
       • How do we make policies explicit and sharable
• Scaling
   – Our project has already converted 9.9B triples from only
     >2,000 of the 710,000 government databases we can identify
     (116 catalogs, 32 countries, 16 languages)
       • Cross-catalog
       • Cross Langauge
• Versioning and updating
• Archiving
• Visualization
Big Data needs bigger ideas
            for visualization
          Tetherless World Constellation




      (Fox &Hendler, Science, 2/11/10)
A new idea we’re playing with at RPI

                               Tetherless World Constellation


• Data as “exhibition”
  – Museums/Performing Arts have explored
    accessibility for real world artifacts, can
    we extend these to the data web?
• Data via physical
  interaction
  – Using theatre techniques
    we can literally move a
    person through a data landscape, what
    new metaphors does this open up?
Conclusions
                                    Tetherless World Constellation

• Big data is going Broad
  – World Wide Web trend towards more and more
    varied data
     • In many domains
        – E-commerce, Open Govt, many more (cf.
          Health/Medical care)

• Broad data requires thinking outside the
  “Database” box
  – Including considering access
• Broad data opens exciting possibilities for
  research and innovation
  – And I hope will help provide tools for making
    data more accessible

More Related Content

PDF
Facilitating Web Science Collaboration through Semantic Markup
James Hendler
 
PPT
Wither OWL
James Hendler
 
PPT
On Beyond OWL: challenges for ontologies on the Web
James Hendler
 
PPT
Broad Data
James Hendler
 
PPT
Semantic Web: The Inside Story
James Hendler
 
PPT
Broad Data (India 2015)
James Hendler
 
PPTX
The Unreasonable Effectiveness of Metadata
James Hendler
 
PPT
The Semantic Web: It's for Real
James Hendler
 
Facilitating Web Science Collaboration through Semantic Markup
James Hendler
 
Wither OWL
James Hendler
 
On Beyond OWL: challenges for ontologies on the Web
James Hendler
 
Broad Data
James Hendler
 
Semantic Web: The Inside Story
James Hendler
 
Broad Data (India 2015)
James Hendler
 
The Unreasonable Effectiveness of Metadata
James Hendler
 
The Semantic Web: It's for Real
James Hendler
 

What's hot (20)

PPT
The Semantic Web: 2010 Update
James Hendler
 
PPTX
Why Watson Won: A cognitive perspective
James Hendler
 
PPTX
The Science of Data Science
James Hendler
 
PPTX
The Rensselaer IDEA: Data Exploration
James Hendler
 
PPTX
"Why the Semantic Web will Never Work" (note the quotes)
James Hendler
 
PPT
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
James Hendler
 
PPTX
SSSW2015 Data Workflow Tutorial
SSSW
 
PPTX
Intro to Data Science Concepts
University of Washington
 
PPTX
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
University of Washington
 
PPTX
Big Data Talent in Academic and Industry R&D
University of Washington
 
PPTX
The Other HPC: High Productivity Computing
University of Washington
 
PPTX
The Web of Data: do we actually understand what we built?
Frank van Harmelen
 
PPTX
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 
PDF
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Big Data Spain
 
PDF
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Jonathan Woodward
 
PDF
Knowledge discoverylaurahollink
SSSW
 
PPTX
Science Data, Responsibly
University of Washington
 
PDF
HyperMembrane Structures for Open Source Cognitive Computing
Jack Park
 
PDF
Data Science For Social Scientists Workshop
Ian Hopkinson
 
PPTX
Data, Responsibly: The Next Decade of Data Science
University of Washington
 
The Semantic Web: 2010 Update
James Hendler
 
Why Watson Won: A cognitive perspective
James Hendler
 
The Science of Data Science
James Hendler
 
The Rensselaer IDEA: Data Exploration
James Hendler
 
"Why the Semantic Web will Never Work" (note the quotes)
James Hendler
 
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
James Hendler
 
SSSW2015 Data Workflow Tutorial
SSSW
 
Intro to Data Science Concepts
University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
University of Washington
 
Big Data Talent in Academic and Industry R&D
University of Washington
 
The Other HPC: High Productivity Computing
University of Washington
 
The Web of Data: do we actually understand what we built?
Frank van Harmelen
 
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Big Data Spain
 
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Jonathan Woodward
 
Knowledge discoverylaurahollink
SSSW
 
Science Data, Responsibly
University of Washington
 
HyperMembrane Structures for Open Source Cognitive Computing
Jack Park
 
Data Science For Social Scientists Workshop
Ian Hopkinson
 
Data, Responsibly: The Next Decade of Data Science
University of Washington
 
Ad

Similar to Data Big and Broad (Oxford, 2012) (20)

PDF
Semantic Web: "ten year" update
James Hendler
 
PPS
Linking Open Data with Drupal
emmanuel_jamin
 
PPT
The Semantic Web: 2010 Update
James Hendler
 
PDF
121004 linking open_data_with_drupal_v1
manujam
 
PDF
First they have to find it: Getting Open Government Data Discovered and Used
Rensselaer Polytechnic Institute
 
PDF
Open Government Data, Linked Data, and the Missing Blocks in Korea
Haklae Kim
 
PPT
Adventures with Open Data in a Government World
Open Data @ CTIC
 
PDF
Where is the World is my Open Government Data?
Rensselaer Polytechnic Institute
 
PDF
Delivering on Standards for Publishing Government Linked Data
3 Round Stones
 
PPT
Semantic Web Science
James Hendler
 
PDF
Big Data on the Web – What We Will Do
Haklae Kim
 
PDF
US National Archives & Open Government Data
3 Round Stones
 
PDF
W3C TPAC 2012 Breakout Session on Government Linked Data
3 Round Stones
 
PPTX
reegle - a new key portal for open energy data
reeep
 
PDF
US EPA OSWER Linked Data Workshop 1-Feb-2013
3 Round Stones
 
PDF
Open Research Data: Licensing | Standards | Future
Ross Mounce
 
PPTX
Tragedy of the Data Commons (ODSC-East, 2021)
James Hendler
 
PPTX
Data.gov Overview, August 2012
Jeanne Holm
 
PPTX
Tragedy of the (Data) Commons
James Hendler
 
PPTX
Ontology Engineering at Scale for Open City Data Sharing
Oscar Corcho
 
Semantic Web: "ten year" update
James Hendler
 
Linking Open Data with Drupal
emmanuel_jamin
 
The Semantic Web: 2010 Update
James Hendler
 
121004 linking open_data_with_drupal_v1
manujam
 
First they have to find it: Getting Open Government Data Discovered and Used
Rensselaer Polytechnic Institute
 
Open Government Data, Linked Data, and the Missing Blocks in Korea
Haklae Kim
 
Adventures with Open Data in a Government World
Open Data @ CTIC
 
Where is the World is my Open Government Data?
Rensselaer Polytechnic Institute
 
Delivering on Standards for Publishing Government Linked Data
3 Round Stones
 
Semantic Web Science
James Hendler
 
Big Data on the Web – What We Will Do
Haklae Kim
 
US National Archives & Open Government Data
3 Round Stones
 
W3C TPAC 2012 Breakout Session on Government Linked Data
3 Round Stones
 
reegle - a new key portal for open energy data
reeep
 
US EPA OSWER Linked Data Workshop 1-Feb-2013
3 Round Stones
 
Open Research Data: Licensing | Standards | Future
Ross Mounce
 
Tragedy of the Data Commons (ODSC-East, 2021)
James Hendler
 
Data.gov Overview, August 2012
Jeanne Holm
 
Tragedy of the (Data) Commons
James Hendler
 
Ontology Engineering at Scale for Open City Data Sharing
Oscar Corcho
 
Ad

More from James Hendler (17)

PPTX
Knowing what AI Systems Don't know and Why it matters
James Hendler
 
PPTX
Exploring the Boundaries of Artificial Intelligence (or "Modern AI")
James Hendler
 
PPTX
Knowledge Graph Semantics/Interoperability
James Hendler
 
PPTX
The Future(s) of the World Wide Web
James Hendler
 
PPTX
Enhancing Precision Wellness with Personal Health Knowledge Graphs
James Hendler
 
PPTX
The Future of AI: Going Beyond Deep Learning, Watson, and the Semantic Web
James Hendler
 
PPTX
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
James Hendler
 
PPTX
Enhancing Precision Wellness with Knowledge Graphs and Semantic Analytics: O...
James Hendler
 
PPT
KR in the age of Deep Learning
James Hendler
 
PPTX
Digital Archiving, The Semantic Web, and Modern AI
James Hendler
 
PPT
Social Machines - 2017 Update (University of Iowa)
James Hendler
 
PPT
Knowledge Representation in the Age of Deep Learning, Watson, and the Semanti...
James Hendler
 
PPTX
Artificial Intelligence: Existential Threat or Our Best Hope for the Future?
James Hendler
 
PPTX
Watson: An Academic's Perspective
James Hendler
 
PPT
Big Data and Computer Science Education
James Hendler
 
PPTX
Watson at RPI - Summer 2013
James Hendler
 
PPT
Future of the World WIde Web (India)
James Hendler
 
Knowing what AI Systems Don't know and Why it matters
James Hendler
 
Exploring the Boundaries of Artificial Intelligence (or "Modern AI")
James Hendler
 
Knowledge Graph Semantics/Interoperability
James Hendler
 
The Future(s) of the World Wide Web
James Hendler
 
Enhancing Precision Wellness with Personal Health Knowledge Graphs
James Hendler
 
The Future of AI: Going Beyond Deep Learning, Watson, and the Semantic Web
James Hendler
 
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
James Hendler
 
Enhancing Precision Wellness with Knowledge Graphs and Semantic Analytics: O...
James Hendler
 
KR in the age of Deep Learning
James Hendler
 
Digital Archiving, The Semantic Web, and Modern AI
James Hendler
 
Social Machines - 2017 Update (University of Iowa)
James Hendler
 
Knowledge Representation in the Age of Deep Learning, Watson, and the Semanti...
James Hendler
 
Artificial Intelligence: Existential Threat or Our Best Hope for the Future?
James Hendler
 
Watson: An Academic's Perspective
James Hendler
 
Big Data and Computer Science Education
James Hendler
 
Watson at RPI - Summer 2013
James Hendler
 
Future of the World WIde Web (India)
James Hendler
 

Recently uploaded (20)

PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Software Development Methodologies in 2025
KodekX
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 

Data Big and Broad (Oxford, 2012)

  • 1. Tetherless World Constellation Data: Big and Broad Jim Hendler Tetherless World Constellation Tetherless World Professor of Computer and Cognitive Science Head, Computer Science Department Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
  • 2. Outline (if I stick to it) Tetherless World Constellation • What is big data? • How big is big? • What is big data on the Web? • What is Broad data? • Got an example? • What’s the problem? • What’s going on
  • 3. Useful Terms Tetherless World Constellation • Machine-readable Data – Information available in a form that is accessible and manipulable by computer – Accessible ≠ Manipulable • eg PDF documents can be read in and displayed, but the information in the document is not readily available without special tooling • Metadata – Information associated with (machine-readable) data that provides information about the data set • Workflow, Provenance, and lots of other terms – Useful sorts of metadata with respect to who created the data, when, how was it processed, etc. • Metadata and the other stuff most useful when it is machine-readable and openly available in commonly agreed upon formats
  • 4. BIG Data is NOT the Web of Data Tetherless World Constellation • The term “Big Data” is widely used nowadays to refer to a whole bunch of machine-readable data in one accessible (to the researcher) place – 3 main contexts • The large data collections of “big science” projects – in traditional data warehouse or database formats • The enterprise data of large, non-Web-based companies (IBM, TATA, etc.) – Generally in multiple • The data holdings of a Google, Facebook or other large Web company – Include large “unstructured” holdings – Include “graph” data
  • 5. Tera, Peta, Zeta yotta, yotta, yotta… Tetherless World Constellation • World Wide Web data is extremely large • Extremely well “funded” – eg. Facebook • 25 Terabytes of logged data per day; valuation $33B (US NIH budget ~ $31B) – eg. Google • In 2008 it was estimated at 20 petabytes per day (not including youTube); current valuation $190B (about 1/3 the entire US DoD budget) • And really, really fascinating stuff – Data about people and their relationships • To each other • To products • To activities and actions • …
  • 6. How BIG is Big? Tetherless World Constellation
  • 7. BIG Data Tetherless World Constellation Google uses their data in many ways Search => ads => user
  • 8. Big Data is becoming different on the Web Tetherless World Constellation • New Work – is moving away from traditional relational models • cf. NoSQL – Moving towards third party application and extension • cf. Mobile apps for local governments – Includes a focus on interoperability and exchange with “lightweight” semantics • Using ideas from the Semantic Web – Search: Schema.org – Social Networking: OGP
  • 9. Which in part gives rise to BROAD data Tetherless World Constellation • 4th context: Broad Data – The huge amount of freely available, but widely varied, Open Data on the World Wide Web (Structured and Semi-structured) • Example: The extended Facebook OGP graph (the part outside Facebook’s datasets) • Example: The growing linked open data cloud of freely available RDF linked data • Example: Hundreds of thousands of datasets that are available on the Web free from governments around the world
  • 10. Example: adding “Breadth” Tetherless World Constellation April 2010
  • 11. Facebook’s Open Graph Protocol Tetherless World Constellation • Facebook now allows other sites to extend the graph • Open Graph Protocol uses RDFa to let web sites contain information about the things people “like” og:title - The title of your object as it should appear within the graph, e.g., "The Rock". og:type - The type of your object, e.g., "movie". Depending on the type you specify, other properties may also be required. og:image - An image URL which should represent your object within the graph. og:url - The canonical URL of your object that will be used as its permanent ID in the graph og:description - A one to two sentence description of your object. og:site_name - If your object is part of a larger web site, the name which should be displayed for the overall site. e.g., "IMDb". – Not a traditional “ontology”
  • 12. Big Data Tetherless World Constellation Facebook generates terabytes of data per day What could be learned from this?
  • 13. Creates a platform for SW-powered apps Tetherless World Constellation
  • 14. BROAD data challenges Tetherless World Constellation • For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling – rapid (and potentially ad hoc) integration of datasets – visualization and analysis of only- partially modeled datasets – policies for data use, reuse and combination.
  • 15. Huh? Tetherless World Constellation “The more I work with data, the more I realize I need Semantics” Huh? The traditional database community has, umm, not always been the first to embrace semantics What is different here?
  • 16. Government Data Sharing Tetherless World Constellation
  • 17. The Web of Open Government Data is Growing • Analytics based on over 1,000,000 datasets from around the world can be seen at – http://logd.tw.rpi.edu/iogds_data_analytics • The examples that follow are from that page Datasets 1,028,054 Countries 43 Catalogs 192 Categories 2460 Languages 24 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 17
  • 18. International 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 18
  • 19. 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 19
  • 20. Many others… Important note: quantity is not really the most important issue 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 20
  • 21. Topics (Across All Catalogs) 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 21
  • 22. Topics (Across All Catalogs) 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 22
  • 23. Combining data from different data sharing sites Tetherless World Constellation
  • 24. Data Integration Problems Tetherless World Constellation Head to head comparions shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California (one of the highest crime areas in the US)
  • 25. The problem is (likely) semantics Tetherless World Constellation Same or different? Do the terms mean the same? Are they collected in the same way? Are they processed differently? …
  • 28. Finding Data Tetherless World Constellation World Bank: Africa Africover: Agriculture Kenya: Agricultural US Data.gov: Crop
  • 29. 5 Star Data Tetherless World Constellation IOGDC Open Data Tutorial 29 9 July 2012
  • 30. Broad Data “Integration” requires simple semantics Tetherless World Constellation
  • 31. Example any wikipedia topic! Tetherless World Constellation
  • 33. Arizona info (From the previous) Tetherless World Constellation
  • 34. USDA data turns out to be crucial Tetherless World Constellation
  • 35. Metadata is crucial for Broad Data Tetherless World Constellation • Metadata design is crucial to govt data sharing – Needed for search and federation in large data sharing efforts • International data sharing – W3C Govt Linked Data Working Group – Need for vocabularies within govt sectors • Esp for cross-langauge use – How can we compare health (or legal, or social, or ….) data between countries like US, UK, India, Kenya (English) with Norway, China, France, etc. – How can we link local govts (in traditional languages, local dialects, etc) w/national data
  • 37. Dataset extension to schema.org (pending) Tetherless World Constellation
  • 38. Government Data in the linked open data cloud Tetherless World Constellation Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without) http://linkeddata.org/
  • 39. Research in Govt Data => Broad Data challenges Tetherless World Constellation • Trust – Government data is controversial, and potentially biased • How do we confirm or dispute? • Combination – When we combine data we need to keep the provenance of information (see trust) • How do we make policies explicit and sharable • Scaling – Our project has already converted 9.9B triples from only >2,000 of the 710,000 government databases we can identify (116 catalogs, 32 countries, 16 languages) • Cross-catalog • Cross Langauge • Versioning and updating • Archiving • Visualization
  • 40. Big Data needs bigger ideas for visualization Tetherless World Constellation (Fox &Hendler, Science, 2/11/10)
  • 41. A new idea we’re playing with at RPI Tetherless World Constellation • Data as “exhibition” – Museums/Performing Arts have explored accessibility for real world artifacts, can we extend these to the data web? • Data via physical interaction – Using theatre techniques we can literally move a person through a data landscape, what new metaphors does this open up?
  • 42. Conclusions Tetherless World Constellation • Big data is going Broad – World Wide Web trend towards more and more varied data • In many domains – E-commerce, Open Govt, many more (cf. Health/Medical care) • Broad data requires thinking outside the “Database” box – Including considering access • Broad data opens exciting possibilities for research and innovation – And I hope will help provide tools for making data more accessible