SlideShare a Scribd company logo
Content + Signals
The value of the entire data estate for machine learning
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Thanks to
Corey Harper, Çağatay Demiralp, Marieke van Erp
ConTech Live 2021
Outline
• Where I’m coming from
• The Success of Machine Learning
• The Need for Data
• Reducing (Training) Data Acquisition Costs
• Implications + Actions
• A national federation of AI
research labs
• One ICAI head office
• Science Park Amsterdam
• Five ICAI locations
• Currently:
• Amsterdam (2)
• Delft
• Nijmegen
• Utrecht
ING AI for Fintech
Partnering with Industry
Content + Signals: The value of the entire data estate for machine learning
MACHINES CAN READ
https://demo.allennlp.org/reading-comprehension
MACHINES CAN READ
gluebenchmark.com
DEEP NEURAL NETWORKS
Adams Wei Yu, David Dohan,
Minh-Thang Luong, Rui Zhao,
Kai Chen, Mohammad Norouzi,
Quoc V. Le: QANet: Combining
Local Convolution with Global
Self-Attention for Reading
Comprehension. ICLR (Poster)
2018
Source: Sharir, Or, Barak Peleg, and
Yoav Shoham. "The Cost of Training
NLP Models: A Concise Overview." arXiv
preprint arXiv:2004.08900 (2020).
THE NEED FOR DATA
Lin, T. Y., Maire, M., Belongie, S.,
Hays, J., Perona, P., Ramanan, D., ...
& Zitnick, C. L. (2014, September).
Microsoft coco: Common objects in
context. In European conference on
computer vision (pp. 740-755).
Springer, Cham.
THE NEED FOR ANNOTATED DATA
Zhang, Yuhao, et al. "Position-aware attention and supervised data improve slot filling."
Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing. 2017.
Annotation is Expensive
Content + Signals: The value of the entire data estate for machine learning
Reduce the Cost of Data?
Reduce the Cost of Data?
use what you have
Reduce the Cost of Annotated Data
http://ai.stanford.edu/blog/weak-supervision/
Transfer Learning
Source Symeonidou, Anthi, Viachaslau Sazonau, and Paul Groth. "Transfer Learning for
Biomedical Named Entity Recognition with BioBERT." SEMANTICS Posters&Demos. 2019.
Transfer Learning
https://transformer.huggingface.co/doc/arxiv-nlp
Active Learning
prodi.gy
Source:
Stephen H. Bach et al. 2019. Snorkel DryBell: A Case Study in Deploying Weak
Supervision at Industrial Scale. In Proceedings of the 2019 International Conference on
Management of Data (SIGMOD '19). ACM, New York, NY, USA, 362-375. DOI:
https://doi.org/10.1145/3299869.3314036
https://ai.googleblog.com/2019/03/harnessing-
organizational-knowledge-for.html
Weak Supervision
The really long tail - smell extraction
Ryan Brate, Paul Groth and Marieke van Erp (2020) Towards
Olfactory Information Extraction from Text: A Case Study on
Detecting Smell Experiences in Novels. LaTeCH-CLfL 2020
Weak Supervision as Data Programming
http://ai.stanford.edu/blog/weak-supervision/
Supervision Sources / Signals
• Heuristics and rules: e.g. existing human-authored rules about the target
domain.
• Topic models, taggers, and classifiers: e.g. machine learning models about
the target domain or a related domain.
• Aggregate statistics: e.g. tracked metrics about the target domain.
• Knowledge or entity graphs: e.g. databases of facts about the target
domain.
https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html
Multi-modal Data
Source:
Dunnmon, J. A., Ratner, A. J., Saab, K., Khandwala,
N., Markert, M., Sagreiya, H., ... & Ré, C. (2020).
Cross-modal data programming enables rapid medical
machine learning. Patterns, 100019.
End user data programming
Source:
Data Programming by Demonstration: A
Framework for Interactively Learning
Labeling Functions.
S. Evensen, C. Ge, D. Choi, Ç. Demiralp
Findings of EMNLP (Ruler), 2020.
Supervision with Observation
Source:
Wang, Xin, Nicolas Thome, and Matthieu Cord. "Gaze latent
support vector machine for image classification improved by
weakly supervised region selection." Pattern Recognition 72
(2017): 59-71.
Implications
Premise Consequence
Improving ability to use expertise Expertise is a critical resource
Improving ability to use more and
different signals
Signal capture becomes imperative
Multiple content sources buttress
each other
Understanding and use the entire
data estate
Machine learning SOTA is
accessible
Problem formulation is fundamental
knowledgescientist.org
Source:
Michael Lauruhn and Paul Groth.
“Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
Action 1: Make a map
https://a16z.com/2019/02/22/humanity-ai-better-together/
Action 2: Problems + Expertise
Conclusion
• Powerful ML models are available today
• Data is the essential the driver
• Don’t overlook your resources:
• your content, your expertise your customer insight
Paul Groth | p.groth@uva.nl | @pgroth | pgroth.com | indelab.org

More Related Content

PDF
Knowledge Graph Maintenance
Paul Groth
 
PPTX
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth
 
PDF
Knowledge Graph Maintenance
Paul Groth
 
PPTX
Minimal viable-datareuse-czi
Paul Groth
 
PPTX
The need for a transparent data supply chain
Paul Groth
 
PPTX
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
PPTX
From Data Search to Data Showcasing
Paul Groth
 
PPTX
Data Communities - reusable data in and outside your organization.
Paul Groth
 
Knowledge Graph Maintenance
Paul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth
 
Knowledge Graph Maintenance
Paul Groth
 
Minimal viable-datareuse-czi
Paul Groth
 
The need for a transparent data supply chain
Paul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
From Data Search to Data Showcasing
Paul Groth
 
Data Communities - reusable data in and outside your organization.
Paul Groth
 

What's hot (20)

PPTX
More ways of symbol grounding for knowledge graphs?
Paul Groth
 
PPTX
Thinking About the Making of Data
Paul Groth
 
PPTX
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
PDF
Knowledge Graph Futures
Paul Groth
 
PPTX
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
PDF
Knowledge Representation on the Web
Rinke Hoekstra
 
PDF
Prov-O-Viz: Interactive Provenance Visualization
Rinke Hoekstra
 
PDF
An Ecosystem for Linked Humanities Data
Rinke Hoekstra
 
PDF
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Rinke Hoekstra
 
PPTX
Knowledge graph construction for research & medicine
Paul Groth
 
PPTX
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Sören Auer
 
PDF
Managing Metadata for Science and Technology Studies: the RISIS case
Rinke Hoekstra
 
PPTX
Describing Scholarly Contributions semantically with the Open Research Knowle...
Sören Auer
 
PDF
Tutorial Data Management and workflows
SSSW
 
PDF
Open interoperability standards, tools and services at EMBL-EBI
Pistoia Alliance
 
PDF
Data science and privacy regulation
blogzilla
 
PDF
Dealing with Open Domain Data
Mathieu d'Aquin
 
PPTX
Hattrick-Simpers MRS Webinar on AI in Materials
Jason Hattrick-Simpers
 
PPTX
Sanderson Shout It Out: LOUD
National Information Standards Organization (NISO)
 
PPTX
2013 nas-ehs-data-integration-dc
c.titus.brown
 
More ways of symbol grounding for knowledge graphs?
Paul Groth
 
Thinking About the Making of Data
Paul Groth
 
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
Knowledge Graph Futures
Paul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
Knowledge Representation on the Web
Rinke Hoekstra
 
Prov-O-Viz: Interactive Provenance Visualization
Rinke Hoekstra
 
An Ecosystem for Linked Humanities Data
Rinke Hoekstra
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Rinke Hoekstra
 
Knowledge graph construction for research & medicine
Paul Groth
 
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Sören Auer
 
Managing Metadata for Science and Technology Studies: the RISIS case
Rinke Hoekstra
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Sören Auer
 
Tutorial Data Management and workflows
SSSW
 
Open interoperability standards, tools and services at EMBL-EBI
Pistoia Alliance
 
Data science and privacy regulation
blogzilla
 
Dealing with Open Domain Data
Mathieu d'Aquin
 
Hattrick-Simpers MRS Webinar on AI in Materials
Jason Hattrick-Simpers
 
2013 nas-ehs-data-integration-dc
c.titus.brown
 
Ad

Similar to Content + Signals: The value of the entire data estate for machine learning (20)

PDF
Webinar trends in machine learning ce adar july 9 2020 susan mckeever
smckeever
 
PPTX
Deep learning introduction
Adwait Bhave
 
PDF
Artificial Intelligence - Anna Uni -v1.pdf
Jayanti Prasad Ph.D.
 
PDF
Artificial Intelligence: an introduction.pdf
Eleonora Ciceri
 
PDF
Directions in machine learning Ceadar webinar
smckeever
 
PDF
Deep learning 1.0 and Beyond, Part 2
Deakin University
 
PDF
nncollovcapaldo2013-131220052427-phpapp01.pdf
GayathriRHICETCSESTA
 
PDF
nncollovcapaldo2013-131220052427-phpapp01.pdf
GayathriRHICETCSESTA
 
PDF
Empirical AI Research
Deakin University
 
PPTX
Machine Learning AND Deep Learning for OpenPOWER
Ganesan Narayanasamy
 
PDF
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Dr. Shivashankar
 
PDF
Weak Supervision.pdf
StephenLeo7
 
PDF
Deep analytics via learning to reason
Deakin University
 
PDF
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Academy
 
PPTX
What Deep Learning Means for Artificial Intelligence
Jonathan Mugan
 
PPTX
Artificial intelligence
Birger Moell
 
PDF
Artificial_Neural_Network.pdf
ssuser136534
 
PDF
Week 1.pdf
AnjaliJain608033
 
PDF
Neural Networks and Deep Learning
Asim Jalis
 
Webinar trends in machine learning ce adar july 9 2020 susan mckeever
smckeever
 
Deep learning introduction
Adwait Bhave
 
Artificial Intelligence - Anna Uni -v1.pdf
Jayanti Prasad Ph.D.
 
Artificial Intelligence: an introduction.pdf
Eleonora Ciceri
 
Directions in machine learning Ceadar webinar
smckeever
 
Deep learning 1.0 and Beyond, Part 2
Deakin University
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
GayathriRHICETCSESTA
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
GayathriRHICETCSESTA
 
Empirical AI Research
Deakin University
 
Machine Learning AND Deep Learning for OpenPOWER
Ganesan Narayanasamy
 
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Dr. Shivashankar
 
Weak Supervision.pdf
StephenLeo7
 
Deep analytics via learning to reason
Deakin University
 
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Academy
 
What Deep Learning Means for Artificial Intelligence
Jonathan Mugan
 
Artificial intelligence
Birger Moell
 
Artificial_Neural_Network.pdf
ssuser136534
 
Week 1.pdf
AnjaliJain608033
 
Neural Networks and Deep Learning
Asim Jalis
 
Ad

More from Paul Groth (16)

PDF
Co-Constructing Explanations for AI Systems using Provenance
Paul Groth
 
PDF
Evaluation Challenges in Using Generative AI for Science & Technical Content
Paul Groth
 
PDF
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
PDF
Data Curation and Debugging for Data Centric AI
Paul Groth
 
PPTX
Elsevier’s Healthcare Knowledge Graph
Paul Groth
 
PPTX
Diversity and Depth: Implementing AI across many long tail domains
Paul Groth
 
PPTX
Progressive Provenance Capture Through Re-computation
Paul Groth
 
PPTX
From Text to Data to the World: The Future of Knowledge Graphs
Paul Groth
 
PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
PPTX
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
PPTX
Machines are people too
Paul Groth
 
PPTX
Are we finally ready for transclusion?*
Paul Groth
 
PPTX
Structured Data & the Future of Educational Material
Paul Groth
 
PPTX
Research Data Sharing: A Basic Framework
Paul Groth
 
PPTX
Data for Science: How Elsevier is using data science to empower researchers
Paul Groth
 
PPTX
Tradeoffs in Automatic Provenance Capture
Paul Groth
 
Co-Constructing Explanations for AI Systems using Provenance
Paul Groth
 
Evaluation Challenges in Using Generative AI for Science & Technical Content
Paul Groth
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Data Curation and Debugging for Data Centric AI
Paul Groth
 
Elsevier’s Healthcare Knowledge Graph
Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Paul Groth
 
Progressive Provenance Capture Through Re-computation
Paul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
Paul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
Machines are people too
Paul Groth
 
Are we finally ready for transclusion?*
Paul Groth
 
Structured Data & the Future of Educational Material
Paul Groth
 
Research Data Sharing: A Basic Framework
Paul Groth
 
Data for Science: How Elsevier is using data science to empower researchers
Paul Groth
 
Tradeoffs in Automatic Provenance Capture
Paul Groth
 

Recently uploaded (20)

PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The Future of Artificial Intelligence (AI)
Mukul
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Software Development Methodologies in 2025
KodekX
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 

Content + Signals: The value of the entire data estate for machine learning

  • 1. Content + Signals The value of the entire data estate for machine learning Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Corey Harper, Çağatay Demiralp, Marieke van Erp ConTech Live 2021
  • 2. Outline • Where I’m coming from • The Success of Machine Learning • The Need for Data • Reducing (Training) Data Acquisition Costs • Implications + Actions
  • 3. • A national federation of AI research labs • One ICAI head office • Science Park Amsterdam • Five ICAI locations • Currently: • Amsterdam (2) • Delft • Nijmegen • Utrecht ING AI for Fintech Partnering with Industry
  • 8. DEEP NEURAL NETWORKS Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, Quoc V. Le: QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. ICLR (Poster) 2018
  • 9. Source: Sharir, Or, Barak Peleg, and Yoav Shoham. "The Cost of Training NLP Models: A Concise Overview." arXiv preprint arXiv:2004.08900 (2020).
  • 10. THE NEED FOR DATA Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.
  • 11. THE NEED FOR ANNOTATED DATA Zhang, Yuhao, et al. "Position-aware attention and supervised data improve slot filling." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.
  • 14. Reduce the Cost of Data?
  • 15. Reduce the Cost of Data? use what you have
  • 16. Reduce the Cost of Annotated Data http://ai.stanford.edu/blog/weak-supervision/
  • 17. Transfer Learning Source Symeonidou, Anthi, Viachaslau Sazonau, and Paul Groth. "Transfer Learning for Biomedical Named Entity Recognition with BioBERT." SEMANTICS Posters&Demos. 2019.
  • 20. Source: Stephen H. Bach et al. 2019. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 362-375. DOI: https://doi.org/10.1145/3299869.3314036 https://ai.googleblog.com/2019/03/harnessing- organizational-knowledge-for.html Weak Supervision
  • 21. The really long tail - smell extraction Ryan Brate, Paul Groth and Marieke van Erp (2020) Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels. LaTeCH-CLfL 2020
  • 22. Weak Supervision as Data Programming http://ai.stanford.edu/blog/weak-supervision/
  • 23. Supervision Sources / Signals • Heuristics and rules: e.g. existing human-authored rules about the target domain. • Topic models, taggers, and classifiers: e.g. machine learning models about the target domain or a related domain. • Aggregate statistics: e.g. tracked metrics about the target domain. • Knowledge or entity graphs: e.g. databases of facts about the target domain. https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html
  • 24. Multi-modal Data Source: Dunnmon, J. A., Ratner, A. J., Saab, K., Khandwala, N., Markert, M., Sagreiya, H., ... & Ré, C. (2020). Cross-modal data programming enables rapid medical machine learning. Patterns, 100019.
  • 25. End user data programming Source: Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions. S. Evensen, C. Ge, D. Choi, Ç. Demiralp Findings of EMNLP (Ruler), 2020.
  • 26. Supervision with Observation Source: Wang, Xin, Nicolas Thome, and Matthieu Cord. "Gaze latent support vector machine for image classification improved by weakly supervised region selection." Pattern Recognition 72 (2017): 59-71.
  • 27. Implications Premise Consequence Improving ability to use expertise Expertise is a critical resource Improving ability to use more and different signals Signal capture becomes imperative Multiple content sources buttress each other Understanding and use the entire data estate Machine learning SOTA is accessible Problem formulation is fundamental
  • 29. Source: Michael Lauruhn and Paul Groth. “Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016). Action 1: Make a map
  • 31. Conclusion • Powerful ML models are available today • Data is the essential the driver • Don’t overlook your resources: • your content, your expertise your customer insight Paul Groth | [email protected] | @pgroth | pgroth.com | indelab.org

Editor's Notes

  • #11: 330K images (>200K labeled) 1.5 million object instances