Digital Enterprise Research Institute                                         www.deri.ie




                                                  Data Curation at the
                                                   New York Times
                      Edward Curry, Andre Freitas, Seán O'Riain




 ed.curry@deri.org
 http://www.deri.org/
 http://www.EdwardCurry.org/
 Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
Speaker Profile
Digital Enterprise Research Institute                                                 www.deri.ie



            Research Scientist at the Digital Enterprise Research
             Institute (DERI)
                   Leading international web science research organization
            Researching how web of data is changing way business
             work and interact with information
                   Projects include studies of enterprise linked data, community-
                    based data curation, semantic data analytics, and semantic
                    search
                   Investigate utilization within the pharmaceutical, oil & gas,
                    financial, advertising, media, manufacturing, health care, ICT,
                    and automotive industries
            Invited speaker at the 2010 MIT Sloan CIO Symposium
             to an audience of more than 600 CIOs
Overview
Digital Enterprise Research Institute                    www.deri.ie



            Curation Background
                   The Business Need for Curated Data
                   What is Data Curation?
                   Data Quality and Curation
                   How to Curate Data


            New York Times Case Study

            Best Practices from Case Study Learning
The Business Need
Digital Enterprise Research Institute                              www.deri.ie



               Knowledge workers need:
                   Access              to the right information
                   Confidence              in that information


               Working incomplete
                inaccurate, or wrong
                information can have
                disastrous consequences
The Problems with Data
Digital Enterprise Research Institute                                           www.deri.ie



          Flawed Data
             Effects   25% of critical data in world‟s top companies
                 (Gartner)

          Data Quality
             Recent               banking crisis (Economist Dec‟09)
             Inaccurate   figures made it difficult to manage operations
                 (investments exposure and risk)
                    –   “asset are defined differently in different programs”
                    –   “numbers did not always add up”
                    –   “departments do not trust each other‟s figures”
                    –   “figures … not worth the pixels they were made of”
What is Data Curation?
Digital Enterprise Research Institute                                    www.deri.ie


        Digital Curation
            Selection,    preservation, maintenance, collection, and
                archiving of digital assets

        Data Curation
            Active             management of data over its life-cycle

        Data Curators
            Ensure    data is trustworthy, discoverable, accessible,
                reusable, and fit for use
                   – Museum cataloguers of the Internet age
What is Data Curation?
Digital Enterprise Research Institute                                www.deri.ie




            Data Governance
                Convergence     of data quality, data management,
                    business process management, and risk
                    management

            Data Curation is a complimentary activity
                Part   of overall data governance strategy for
                    organization

            Data Curator = Data Steward ??
                   Overlapping terms between communities
Data Quality and Curation
Digital Enterprise Research Institute                                                www.deri.ie



            What is Data Quality?
                Desirable              characteristics for information resource
                Described              as a series of quality dimensions
                       – Discoverability, Accessibility, Timeliness, Completeness,
                         Interpretation, Accuracy, Consistency, Provenance &
                         Reputation

            Data curation can be used to improve these
             quality dimensions
Data Quality and Curation
Digital Enterprise Research Institute                                    www.deri.ie



            Discoverability & Accessibility
                Curate    to streamline search by storing and classifying
                    in appropriate and consistent manner

            Accuracy
                Curate     to ensure data correctly represents the “real-
                    world” values it models

            Consistency
                Curate      to ensure data created and maintained using
                    standardized definitions, calculations, terms, and
                    identifiers
Data Quality and Curation
Digital Enterprise Research Institute                                                www.deri.ie




            Provenance & Reputation
                Curate                 to track source of data and determine reputation
                Curate                 to include the objectivity of the source/producer
                       – Is the information unbiased, unprejudiced, and impartial?
                       – Or does it come from a reputable but partisan source?




                       Other dimensions discussed in chapter
How to Curate Data
Digital Enterprise Research Institute                               www.deri.ie




            Data Curation is a large field with sophisticated
             techniques and processes

            Section provides high-level overview on:
                Should                 you curate data?
                Types             of Curation
                Setting                up a curation process


               Additional detail and references available in book
               chapter
Should You Curate Data?
Digital Enterprise Research Institute                                              www.deri.ie




            Curation can have multiple motivations
                Improving                accessibility, quality, consistency,…

            Will the data benefit from curation?
                Identify               business case
                Determine                if potential return support investment

            Not all enterprise data should be curated
                Suits   knowledge-centric data rather than transactional
                    operations data
Types of Data Curation
Digital Enterprise Research Institute                        www.deri.ie



            Multiple approaches to curate data, no single
             correct way
                Who?
                       – Individual Curators
                       – Curation Departments
                       – Community-based Curation
                How?
                       – Manual Curation
                       – (Semi-)Automated
                       – Sheer Curation
Types of Data Curation – Who?
Digital Enterprise Research Institute                                                 www.deri.ie




            Individual Data Curators
                Suitable               for infrequently changing small quantity of
                    data
                       – (<1,000 records)
                       – Minimal curation effort (minutes per record)
Types of Data Curation – Who?
Digital Enterprise Research Institute                                             www.deri.ie


            Curation Departments
                Curation     experts working with subject matter experts
                    to curate data within formal process
                       – Can deal with large curation effort (000‟s of records)

            Limitations
                Scalability: Can struggle with large quantities of
                    dynamic data (>million records)
                Availability:  Post-hoc nature creates delay in curated
                    data availability
Types of Data Curation - Who?
Digital Enterprise Research Institute                                    www.deri.ie



            Community-Based Data Curation
                Decentralized               approach to data curation
                Crowd-sourcing                the curation process
                       – Leverages community of users to curate data
                Wisdom                 of the community (crowd)
                Can           scale to millions of records
Types of Data Curation – How?
Digital Enterprise Research Institute                                        www.deri.ie



            Manual Curation
                Curators               directly manipulate data
                Can           tie users up with low-value add activities

            (Sem-)Automated Curation
                Algorithms      can (semi-)automate curation activities
                    such as data cleansing, record duplication and
                    classification
                Can           be supervised or approved by human curators
Types of Data Curation – How?
Digital Enterprise Research Institute                                          www.deri.ie



            Sheer curation, or Curation at Source
                Curation    activities integrated in normal workflow of
                    those creating and managing data
                Can     be as simple as vetting or “rating” the results of a
                    curation algorithm
                Results                can be available immediately

            Blended Approaches: Best of Both
                Sheer             curation + post hoc curation department
                Allows             immediate access to curated data
                Ensures                quality control with expert curation
Setting up a Curation Process
Digital Enterprise Research Institute                                  www.deri.ie




            5 Steps to setup a curation process:
               1 - Identify what data you need to curate
               2 - Identify who will curate the data
               3 - Define the curation workflow
               4 - Identity appropriate data-in & data-out formats
               5 - Identify the artifacts, tools, and processes needed to
                   support the curation process
The New York Times
Digital Enterprise Research Institute                            www.deri.ie




                             100 Years of Expert Data Curation
The New York Times
Digital Enterprise Research Institute                 www.deri.ie


            Largest metropolitan and third largest
             newspaper in the United States


            nytimes.com
                    Most popular newspaper
                     website in US

            100 year old curated
             repository defining its
             participation in the
             emerging Web of Data
The New York Times
Digital Enterprise Research Institute                                              www.deri.ie


       Data curation dates back to 1913
           Publisher/owner      Adolph S. Ochs decided to provide a
               set of additions to the newspaper
       New York Times Index
           Organized                   catalog of articles titles and summaries
                  – Containing issue, date and column of article
                  – Categorized by subject and names
                  – Introduced on quarterly then annual basis
       Transitory content of newspaper became
        important source of searchable historical data
           Often            used to settle historical debates
The New York Times
Digital Enterprise Research Institute                                            www.deri.ie


              Index Department was created in 1913
                Curation               and cataloguing of NYT resources
                       – Since 1851 NYT had low quality index for internal use

            Developed a comprehensive catalog using a
             controlled vocabulary
                Covering    subjects, personal names, organizations,
                    geographic locations and titles of creative works
                    (books, movies, etc), linked to articles and their
                    summaries

            Current Index Dept. has ~15 people
The New York Times
Digital Enterprise Research Institute                                          www.deri.ie



            Challenges with consistently and accurately
             classifying news articles over time
                Keywords     expressing subjects may show some
                    variance due to cultural or legal constraints
                Identities   of some entities, such as organizations and
                    places, changed over time

            Controlled vocabulary grew to hundreds of
             thousands of categories
                Adding                 complexity to classification process
The New York Times
Digital Enterprise Research Institute                               www.deri.ie




            Increased importance of Web drove need to
             improve categorization of online content

            Curation carried out by Index Department
                Library-time           (days to weeks)
                Print          edition can handle next-day index

            Not suitable for real-time online publishing
                nytimes.com            needed a same-day index
The New York Times
Digital Enterprise Research Institute                                    www.deri.ie


            Introduced two stage curation process
                Editorial  staff performed best-effort semi-automated
                    sheer curation at point of online pub.
                       – Several hundreds journalists
                Index     Department follow up with long-term accurate
                    classification and archiving

            Benefits:
                Non-expert      journalist curators provide instant
                    accessibility to online users
                Index    Department provides long-term high-quality
                    curation in a “trust but verify” approach
NYT Curation Workflow
Digital Enterprise Research Institute                                        www.deri.ie




  Curation                starts with article getting out of the newsroom
NYT Curation Workflow
Digital Enterprise Research Institute                             www.deri.ie




  Member      of editorial staff submits article to web-based rule
      based information extraction system (SAS Teragram)
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




 Teragram   uses linguistic extraction rules based on subset of
    Index Dept‟s controlled vocab.
NYT Curation Workflow
Digital Enterprise Research Institute                        www.deri.ie




  Teragram     suggests tags based on the Index vocabulary that
      can potentially describe the content of article
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




  Editorial  staff member selects terms that best describe the
      contents and inserts new tags if necessary
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




  Reviewed       by the taxonomy managers with feedback to
      editorial staff on classification process
NYT Curation Workflow
Digital Enterprise Research Institute                     www.deri.ie




  Article           is published online at nytimes.com
NYT Curation Workflow
Digital Enterprise Research Institute                           www.deri.ie




  At   later stage article receives second level curation by Index
      Dept. additional Index tags and a summary
NYT Curation Workflow
Digital Enterprise Research Institute            www.deri.ie




  Article           is submitted to NYT Index
The New York Times
Digital Enterprise Research Institute                      www.deri.ie


           Early adopter of Linked Open Data (June „09)
The New York Times
Digital Enterprise Research Institute                                    www.deri.ie


    Linked Open Data @ data.nytimes.com
        Subset               of 10,000 tags from index vocabulary
        Dataset               of people, organizations & locations
               – Complemented by search services to consume data
                 about articles, movies, best sellers, Congress votes,
                 real estate,…
    Benefits
        Improves                  traffic by third party data usage
        Lowers      development cost of new applications for
            different verticals inside the website
               – E.g. movies, travel, sports, books
Overview
Digital Enterprise Research Institute                    www.deri.ie



            Curation Background
                   The Business Need for Curated Data
                   What is Data Curation?
                   Data Quality and Curation
                   How to Curate Data


            Case Study New York Times

            Best Practices from Case Study Learning
Best Practices from Case Study
       Learning
Digital Enterprise Research Institute                           www.deri.ie


            Social Best Practices
                Participation
                Engagement
                Incentives
                Community                Governance Models

            Technical Best Practices
                Data           Representation
                Human-                 and AutomatedCuration
                Track            Provenance
Social Best Practices
Digital Enterprise Research Institute                                              www.deri.ie




            Participation
                Stakeholders  involvement for data producers and
                    consumers must occur early in project
                       – Provides insight into basic questions of what they want
                         to do, for whom, and what it will provide
                White     papers are effective means to present these
                    ideas, and solicit opinion from community
                       – Can be used to establish informal „social contract‟ for
                         community
Social Best Practices
Digital Enterprise Research Institute                                               www.deri.ie




            Engagement
                Outreach                 activities essential for promotion and
                    feedback
                Typical                consumers-to-contributors ratios of less than
                    5%
                Social            communication and networking forums are
                    useful
                       – Majority of community may not communicate using
                         these media
                       – Communication by email still remains important
Social Best Practices
Digital Enterprise Research Institute                                     www.deri.ie




            Incentives
                Sheer      curation needs line of sight from data curating
                    activity, to tangible exploitation benefits
                Lack   of awareness of value proposition will slow
                    emergence of collaborative contributions
                Recognizing   contributing curators through a formal
                    feedback mechanism
                       – Reinforces contribution culture
                       – Directly increases output quality
Social Best Practices
Digital Enterprise Research Institute                                         www.deri.ie




            Community Governance Models
                Effective  governance structure is vital to ensure
                    success of community
                Internal  communities and consortium perform well
                    when they leverage traditional corporate and
                    democratic governance models
                Open      communities need to engage the community
                    within the governance process
                       – Follow less orthodox approaches using meritocratic
                         and autocratic principles
Technical Best Practices
Digital Enterprise Research Institute                                    www.deri.ie

            Data Representation
                Must   be robust and standardized to encourage
                    community usage and tools development
                Support     for legacy data formats and ability to
                    translate data forward to support new technology and
                    standards
            Human & Automated Curation
                Balancing              will improve data quality
                Automated      curation should always defer to, and never
                    override, human curation edits
                       – Automate validating data deposition and entry
                       – Target community at focused curation tasks
Technical Best Practices
Digital Enterprise Research Institute                                         www.deri.ie



            Track Provenance
                All  curation activities should be recorded and
                    maintained as part data provenance effort
                       – Especially where human curators are involved
                Users             can have different perspectives of provenance
                       – A scientist may need to evaluate the fine grained
                         experiment description behind the data
                       – For a business analyst the ‟brand‟ of data provider can
                         be sufficient for determining quality
Conclusions
Digital Enterprise Research Institute                                               www.deri.ie




        Data curation can ensure the quality of data and
         its fitness for use
        Pre-competitive data can be shared without
         conferring a commercial advantage
        Pre-competitive data communities
                Common                 curation tasks carried out once in public
                    domain
                Reduces                cost, increase quantity and quality
Acknowledgements
Digital Enterprise Research Institute                                                      www.deri.ie


        Collaborators Andre Freitas & Seán O'Riain

        Insight from Thought Leaders
               Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product
                Development and Management), and Gregg Fenton (Director Emerging Platforms)
                from the New York Times
               Krista Thomas (Vice President, Marketing & Communications), Tom Tague
                (OpenCalais initiative Lead) from Thomson Reuters
               Antony Williams (VP of Strategic Development ) from ChemSpider
               Helen Berman (Director), John Westbrook (Product Development) from the Protein
                Data Bank
               Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.

        The work presented has been funded by Science
         Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-
         2).
Further Information
Digital Enterprise Research Institute                     www.deri.ie


The Role of Community-Driven
Data Curation for Enterprises
Edward Curry, Andre Freitas, & Seán O'Riain




  In David Wood (ed.),
  Linking Enterprise Data Springer, 2010.
  Available Free at:
  http://3roundstones.com/led_book/led-curry-et-al.html

More Related Content

PPTX
Wikipedia (DBpedia): Crowdsourced Data Curation
PDF
Developing an Sustainable IT Capability: Lessons From Intel's Journey
PDF
Challenges Ahead for Converging Financial Data
PDF
Approximate Semantic Matching of Heterogeneous Events
PPTX
An Environmental Chargeback for Data Center and Cloud Computing Consumers
PPTX
The Role of Community-Driven Data Curation for Enterprises
PDF
Using Linked Data and the Internet of Things for Energy Management
PDF
Dealing with Semantic Heterogeneity in Real-Time Information
Wikipedia (DBpedia): Crowdsourced Data Curation
Developing an Sustainable IT Capability: Lessons From Intel's Journey
Challenges Ahead for Converging Financial Data
Approximate Semantic Matching of Heterogeneous Events
An Environmental Chargeback for Data Center and Cloud Computing Consumers
The Role of Community-Driven Data Curation for Enterprises
Using Linked Data and the Internet of Things for Energy Management
Dealing with Semantic Heterogeneity in Real-Time Information

What's hot (20)

PPTX
Building Optimisation using Scenario Modeling and Linked Data
PDF
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
PPT
Big Data Public Private Forum (BIG) @ European Data Forum 2013
PDF
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
PPT
Querying Heterogeneous Datasets on the Linked Data Web
PDF
System of Systems Information Interoperability using a Linked Dataspace
PDF
Linked Building (Energy) Data
PDF
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
PDF
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
PDF
The Big Data Value PPP: A Standardisation Opportunity for Europe
PDF
Transforming the European Data Economy: A Strategic Research and Innovation A...
PDF
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
PDF
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
PDF
A Capability Maturity Framework for Sustainable ICT
PDF
Key Technology Trends for Big Data in Europe
PDF
Linked Water Data For Water Information Management
PDF
Interactive Water Services: The Waternomics Approach
PDF
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
PPTX
Crowdsourcing Approaches for Smart City Open Data Management
PDF
Big Data and Big Data Management (BDM) with current Technologies –Review
Building Optimisation using Scenario Modeling and Linked Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
Querying Heterogeneous Datasets on the Linked Data Web
System of Systems Information Interoperability using a Linked Dataspace
Linked Building (Energy) Data
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
The Big Data Value PPP: A Standardisation Opportunity for Europe
Transforming the European Data Economy: A Strategic Research and Innovation A...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
A Capability Maturity Framework for Sustainable ICT
Key Technology Trends for Big Data in Europe
Linked Water Data For Water Information Management
Interactive Water Services: The Waternomics Approach
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Crowdsourcing Approaches for Smart City Open Data Management
Big Data and Big Data Management (BDM) with current Technologies –Review
Ad

Viewers also liked (8)

PDF
Influenciencia del mundo emocional en el aprendizaje
PDF
Open Data Innovation in Smart Cities: Challenges and Trends
PDF
Towards a BIG Data Public Private Partnership
PDF
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
PDF
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
PDF
Citizen Actuation For Lightweight Energy Management
PDF
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
PDF
Towards Unified and Native Enrichment in Event Processing Systems
Influenciencia del mundo emocional en el aprendizaje
Open Data Innovation in Smart Cities: Challenges and Trends
Towards a BIG Data Public Private Partnership
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
Citizen Actuation For Lightweight Energy Management
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Towards Unified and Native Enrichment in Event Processing Systems
Ad

Similar to Data Curation at the New York Times (20)

PPTX
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
PPTX
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
PPT
Envisioning a discussion dashboard for collective intelligence of web convers...
PDF
Manfred Linking the Real World
PDF
KMWorld Martin Briefing
PPTX
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
PPTX
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
PDF
Down to Business: Taking Action Quickly with Linked Data Services
PDF
Digital DNA for Organic Enterprises
PDF
Towards Patient Controlled Privacy
PPTX
Microsoft Purview Data Governance L100 Pitch Deck.PPTX
PPTX
2018 10 igneous
PPTX
Introduction to Open Data
PPTX
basic of data science and big data......
PDF
Externalization Trend
PDF
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
ODP
Knowledge management on the desktop
PDF
Big_Data_ML_Madhu_Reddiboina
PPTX
Self-service Linked Government Data
PPTX
Towards Social semantic journalism
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Envisioning a discussion dashboard for collective intelligence of web convers...
Manfred Linking the Real World
KMWorld Martin Briefing
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
Down to Business: Taking Action Quickly with Linked Data Services
Digital DNA for Organic Enterprises
Towards Patient Controlled Privacy
Microsoft Purview Data Governance L100 Pitch Deck.PPTX
2018 10 igneous
Introduction to Open Data
basic of data science and big data......
Externalization Trend
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
Knowledge management on the desktop
Big_Data_ML_Madhu_Reddiboina
Self-service Linked Government Data
Towards Social semantic journalism

Recently uploaded (20)

PPTX
Module 1 Introduction to Web Programming .pptx
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Statistics on Ai - sourced from AIPRM.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PPTX
Internet of Everything -Basic concepts details
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
DOCX
search engine optimization ppt fir known well about this
Module 1 Introduction to Web Programming .pptx
Basics of Cloud Computing - Cloud Ecosystem
The influence of sentiment analysis in enhancing early warning system model f...
Statistics on Ai - sourced from AIPRM.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
NewMind AI Weekly Chronicles – August ’25 Week IV
Advancing precision in air quality forecasting through machine learning integ...
Lung cancer patients survival prediction using outlier detection and optimize...
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
sbt 2.0: go big (Scala Days 2025 edition)
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Internet of Everything -Basic concepts details
Taming the Chaos: How to Turn Unstructured Data into Decisions
Consumable AI The What, Why & How for Small Teams.pdf
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Comparative analysis of machine learning models for fake news detection in so...
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
search engine optimization ppt fir known well about this

Data Curation at the New York Times

  • 1. Digital Enterprise Research Institute www.deri.ie Data Curation at the New York Times Edward Curry, Andre Freitas, Seán O'Riain [email protected] http://www.deri.org/ http://www.EdwardCurry.org/ Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
  • 2. Speaker Profile Digital Enterprise Research Institute www.deri.ie  Research Scientist at the Digital Enterprise Research Institute (DERI)  Leading international web science research organization  Researching how web of data is changing way business work and interact with information  Projects include studies of enterprise linked data, community- based data curation, semantic data analytics, and semantic search  Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries  Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
  • 3. Overview Digital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  New York Times Case Study  Best Practices from Case Study Learning
  • 4. The Business Need Digital Enterprise Research Institute www.deri.ie  Knowledge workers need:  Access to the right information  Confidence in that information  Working incomplete inaccurate, or wrong information can have disastrous consequences
  • 5. The Problems with Data Digital Enterprise Research Institute www.deri.ie  Flawed Data  Effects 25% of critical data in world‟s top companies (Gartner)  Data Quality  Recent banking crisis (Economist Dec‟09)  Inaccurate figures made it difficult to manage operations (investments exposure and risk) – “asset are defined differently in different programs” – “numbers did not always add up” – “departments do not trust each other‟s figures” – “figures … not worth the pixels they were made of”
  • 6. What is Data Curation? Digital Enterprise Research Institute www.deri.ie  Digital Curation  Selection, preservation, maintenance, collection, and archiving of digital assets  Data Curation  Active management of data over its life-cycle  Data Curators  Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use – Museum cataloguers of the Internet age
  • 7. What is Data Curation? Digital Enterprise Research Institute www.deri.ie  Data Governance  Convergence of data quality, data management, business process management, and risk management  Data Curation is a complimentary activity  Part of overall data governance strategy for organization  Data Curator = Data Steward ??  Overlapping terms between communities
  • 8. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  What is Data Quality?  Desirable characteristics for information resource  Described as a series of quality dimensions – Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation  Data curation can be used to improve these quality dimensions
  • 9. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  Discoverability & Accessibility  Curate to streamline search by storing and classifying in appropriate and consistent manner  Accuracy  Curate to ensure data correctly represents the “real- world” values it models  Consistency  Curate to ensure data created and maintained using standardized definitions, calculations, terms, and identifiers
  • 10. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  Provenance & Reputation  Curate to track source of data and determine reputation  Curate to include the objectivity of the source/producer – Is the information unbiased, unprejudiced, and impartial? – Or does it come from a reputable but partisan source? Other dimensions discussed in chapter
  • 11. How to Curate Data Digital Enterprise Research Institute www.deri.ie  Data Curation is a large field with sophisticated techniques and processes  Section provides high-level overview on:  Should you curate data?  Types of Curation  Setting up a curation process Additional detail and references available in book chapter
  • 12. Should You Curate Data? Digital Enterprise Research Institute www.deri.ie  Curation can have multiple motivations  Improving accessibility, quality, consistency,…  Will the data benefit from curation?  Identify business case  Determine if potential return support investment  Not all enterprise data should be curated  Suits knowledge-centric data rather than transactional operations data
  • 13. Types of Data Curation Digital Enterprise Research Institute www.deri.ie  Multiple approaches to curate data, no single correct way  Who? – Individual Curators – Curation Departments – Community-based Curation  How? – Manual Curation – (Semi-)Automated – Sheer Curation
  • 14. Types of Data Curation – Who? Digital Enterprise Research Institute www.deri.ie  Individual Data Curators  Suitable for infrequently changing small quantity of data – (<1,000 records) – Minimal curation effort (minutes per record)
  • 15. Types of Data Curation – Who? Digital Enterprise Research Institute www.deri.ie  Curation Departments  Curation experts working with subject matter experts to curate data within formal process – Can deal with large curation effort (000‟s of records)  Limitations  Scalability: Can struggle with large quantities of dynamic data (>million records)  Availability: Post-hoc nature creates delay in curated data availability
  • 16. Types of Data Curation - Who? Digital Enterprise Research Institute www.deri.ie  Community-Based Data Curation  Decentralized approach to data curation  Crowd-sourcing the curation process – Leverages community of users to curate data  Wisdom of the community (crowd)  Can scale to millions of records
  • 17. Types of Data Curation – How? Digital Enterprise Research Institute www.deri.ie  Manual Curation  Curators directly manipulate data  Can tie users up with low-value add activities  (Sem-)Automated Curation  Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification  Can be supervised or approved by human curators
  • 18. Types of Data Curation – How? Digital Enterprise Research Institute www.deri.ie  Sheer curation, or Curation at Source  Curation activities integrated in normal workflow of those creating and managing data  Can be as simple as vetting or “rating” the results of a curation algorithm  Results can be available immediately  Blended Approaches: Best of Both  Sheer curation + post hoc curation department  Allows immediate access to curated data  Ensures quality control with expert curation
  • 19. Setting up a Curation Process Digital Enterprise Research Institute www.deri.ie  5 Steps to setup a curation process: 1 - Identify what data you need to curate 2 - Identify who will curate the data 3 - Define the curation workflow 4 - Identity appropriate data-in & data-out formats 5 - Identify the artifacts, tools, and processes needed to support the curation process
  • 20. The New York Times Digital Enterprise Research Institute www.deri.ie 100 Years of Expert Data Curation
  • 21. The New York Times Digital Enterprise Research Institute www.deri.ie  Largest metropolitan and third largest newspaper in the United States  nytimes.com  Most popular newspaper website in US  100 year old curated repository defining its participation in the emerging Web of Data
  • 22. The New York Times Digital Enterprise Research Institute www.deri.ie  Data curation dates back to 1913  Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaper  New York Times Index  Organized catalog of articles titles and summaries – Containing issue, date and column of article – Categorized by subject and names – Introduced on quarterly then annual basis  Transitory content of newspaper became important source of searchable historical data  Often used to settle historical debates
  • 23. The New York Times Digital Enterprise Research Institute www.deri.ie  Index Department was created in 1913  Curation and cataloguing of NYT resources – Since 1851 NYT had low quality index for internal use  Developed a comprehensive catalog using a controlled vocabulary  Covering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries  Current Index Dept. has ~15 people
  • 24. The New York Times Digital Enterprise Research Institute www.deri.ie  Challenges with consistently and accurately classifying news articles over time  Keywords expressing subjects may show some variance due to cultural or legal constraints  Identities of some entities, such as organizations and places, changed over time  Controlled vocabulary grew to hundreds of thousands of categories  Adding complexity to classification process
  • 25. The New York Times Digital Enterprise Research Institute www.deri.ie  Increased importance of Web drove need to improve categorization of online content  Curation carried out by Index Department  Library-time (days to weeks)  Print edition can handle next-day index  Not suitable for real-time online publishing  nytimes.com needed a same-day index
  • 26. The New York Times Digital Enterprise Research Institute www.deri.ie  Introduced two stage curation process  Editorial staff performed best-effort semi-automated sheer curation at point of online pub. – Several hundreds journalists  Index Department follow up with long-term accurate classification and archiving  Benefits:  Non-expert journalist curators provide instant accessibility to online users  Index Department provides long-term high-quality curation in a “trust but verify” approach
  • 27. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Curation starts with article getting out of the newsroom
  • 28. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
  • 29. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Teragram uses linguistic extraction rules based on subset of Index Dept‟s controlled vocab.
  • 30. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article
  • 31. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Editorial staff member selects terms that best describe the contents and inserts new tags if necessary
  • 32. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Reviewed by the taxonomy managers with feedback to editorial staff on classification process
  • 33. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Article is published online at nytimes.com
  • 34. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  At later stage article receives second level curation by Index Dept. additional Index tags and a summary
  • 35. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Article is submitted to NYT Index
  • 36. The New York Times Digital Enterprise Research Institute www.deri.ie  Early adopter of Linked Open Data (June „09)
  • 37. The New York Times Digital Enterprise Research Institute www.deri.ie  Linked Open Data @ data.nytimes.com  Subset of 10,000 tags from index vocabulary  Dataset of people, organizations & locations – Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,…  Benefits  Improves traffic by third party data usage  Lowers development cost of new applications for different verticals inside the website – E.g. movies, travel, sports, books
  • 38. Overview Digital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  Case Study New York Times  Best Practices from Case Study Learning
  • 39. Best Practices from Case Study Learning Digital Enterprise Research Institute www.deri.ie  Social Best Practices  Participation  Engagement  Incentives  Community Governance Models  Technical Best Practices  Data Representation  Human- and AutomatedCuration  Track Provenance
  • 40. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Participation  Stakeholders involvement for data producers and consumers must occur early in project – Provides insight into basic questions of what they want to do, for whom, and what it will provide  White papers are effective means to present these ideas, and solicit opinion from community – Can be used to establish informal „social contract‟ for community
  • 41. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Engagement  Outreach activities essential for promotion and feedback  Typical consumers-to-contributors ratios of less than 5%  Social communication and networking forums are useful – Majority of community may not communicate using these media – Communication by email still remains important
  • 42. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Incentives  Sheer curation needs line of sight from data curating activity, to tangible exploitation benefits  Lack of awareness of value proposition will slow emergence of collaborative contributions  Recognizing contributing curators through a formal feedback mechanism – Reinforces contribution culture – Directly increases output quality
  • 43. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Community Governance Models  Effective governance structure is vital to ensure success of community  Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models  Open communities need to engage the community within the governance process – Follow less orthodox approaches using meritocratic and autocratic principles
  • 44. Technical Best Practices Digital Enterprise Research Institute www.deri.ie  Data Representation  Must be robust and standardized to encourage community usage and tools development  Support for legacy data formats and ability to translate data forward to support new technology and standards  Human & Automated Curation  Balancing will improve data quality  Automated curation should always defer to, and never override, human curation edits – Automate validating data deposition and entry – Target community at focused curation tasks
  • 45. Technical Best Practices Digital Enterprise Research Institute www.deri.ie  Track Provenance  All curation activities should be recorded and maintained as part data provenance effort – Especially where human curators are involved  Users can have different perspectives of provenance – A scientist may need to evaluate the fine grained experiment description behind the data – For a business analyst the ‟brand‟ of data provider can be sufficient for determining quality
  • 46. Conclusions Digital Enterprise Research Institute www.deri.ie  Data curation can ensure the quality of data and its fitness for use  Pre-competitive data can be shared without conferring a commercial advantage  Pre-competitive data communities  Common curation tasks carried out once in public domain  Reduces cost, increase quantity and quality
  • 47. Acknowledgements Digital Enterprise Research Institute www.deri.ie  Collaborators Andre Freitas & Seán O'Riain  Insight from Thought Leaders  Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times  Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters  Antony Williams (VP of Strategic Development ) from ChemSpider  Helen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank  Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.  The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion- 2).
  • 48. Further Information Digital Enterprise Research Institute www.deri.ie The Role of Community-Driven Data Curation for Enterprises Edward Curry, Andre Freitas, & Seán O'Riain In David Wood (ed.), Linking Enterprise Data Springer, 2010. Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html