The Science of Data Science
(Data plus Semantics yields Knowledge)
Prof. James Hendler
Tetherless World Constellation Chair of Computer, Web and
Cognitive Sciences
Director, The Rensselaer IDEA
1
The Rensselaer Institute for Data
Exploration and Applications
Performance Plan to Budget
Presentation
February 2015
The Rensselaer Institute for Data Exploration and Applications (IDEA) is a
breakthrough initiative brings together key research areas and advanced
technologies to revolutionize the way we use data in science, engineering,
and virtually every other research and educational discipline. By bridging the
gaps between analytics, modeling, and simulation we continue the
Rensselaer tradition as a leader in applying critical technologies to improving
everyday life and meeting the challenges of the future.
3
The Rensselaer Institute for Data Exploration and Applications
Business
Systems:
Built and Natural
Environments:
Cyber-
Resiliency:
Policy, Ethics and
Stewardship:
Materials Informatics:Data-driven
Physical/Life Sciences:
Healthcare Analytics
and Mobile Health:
Social Network
Analytics:
Agents and
Augmented Reality:
4
IDEA project examples
• Healthcare in Context:
Data mining/analytics to
Improve public health from
a systems perspective at
the individual to national
scales.
• Data-Centric Engineering
Design: Data-driven
Design & Control under
uncertainty via data fusion
across multiple scales and
sources
• Supply Chain Resilience
through Information
Visibility: Demonstrate
uses of supply chain
information visibility for
anticipating, mitigating and
recovering from disruptive
events
• Accelerated design of
functional materials/Material
Ontology: Address basic
materials processing data-based
informatics for complex,
multifunctional (often nano)
materials.
• Biome-informatics: Develop data
aggregation and computational
tools to integrate disparate
datasets into large ecosystem
models using data collected on
the microbial communities that
inhabit the base of most
ecosystems
• Deducing Structure to Function
in Biomedicine: Develop
systematic data-resourced
methods for discovering and
exploiting structure-to-function
relationships.
5
KDD Pipeline – as usually presented
Data Storage
(Big Data
Warehouse)
KDD Pipeline – in the real world
Data is increasingly being
brought in from external
sources, with mixed
provenance, and
increasingly outside the
analyzers’ control.
At increasing rates and scales
6
Data
Storage
Sensors and apps Social
Media
Customer
Behaviors
Web
Partners
Formatting, standards use, data
cleansing, data bias analysis, …
Open data
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Data
Storage
Tough data integration challenges
Enterprise
analytics
Open Data
Integration
Hard
problems!
Closing the loop on (big) data
IDEA is focusing on key data science
areas
which are revolutionizing engineering, science
and business with significant social impact
8
Predictive Analytics Discovery Informatics Data Exploration
Theme 1: Predictive Analytics
9
From “what is” to “what if”
Courtesy of
Eric Schadt,
Mount Sinai
Example: Healthcare Data Analytics
The Digital Universe of Data to Better
Diagnose and Treat Patients
Courtesy of
Eric
Schadt,
Mount Sinai
Identifying predictive features in data
Each factor must be separately
analyzed for its “Predictivity”
• Mutual information measure
The “black art” of predictive
analytics is finding the right
ones
• Use too few, the model is
weak
• Use too many, the model
becomes slow and dominated
by noise
Algorithms required to do this
because the overwhelming
number of “weak” factors defies
human abilities to combine
• Machine learning identifies
key feature
• some require “roll ups”
• some require “pull outs”
• Mathematical techniques then
reduce the dimensionality
11
12
Predictive analytics in sensors
Extend-o-hand
(Josh Shinavier. PhD)
Classification of the sensor data (via machine-learning) allows predictive recognition
of different gestures (i.e. before the gesture is finished).
13
Predictive analytics in large scale behaviors
List clusters at risk for Asian Clams
<1mile Cook’s Bay.”
Machine-learning generates predicts future distributions of invasive species in Lake
George based on current distributions and bathymetry similarity.
Predictive Social Network Analytics (with RPI NeST center)
14
Social Networks in Action
Analyzing cascading failures
Modeling (supply chain)
networks…
and predicting (cascading)
network risks.
Modeling network stressors (including
human cognitive element)
Understanding network dynamics
15
Data Science Research Center: tools for data analytics
Theory & Algorithms
• Randomized
• Optimization
• Approximation
• Multilinear Algebra
Applications
Statistics
• Multivariate analysis
• Optimal Experimental
Design
Dimension reduction by
randomized algorithms for
numerical linear algebra for
identify significant components
and visualizing Petabyte-scale
data matrices (P. Drineas, CSCI)
Parallel Factor Analysis for tensor systems creates a scalable
solution, on AMOS, for a critical data-processing component of
data analytics for large graphs. (B. Yener, CSCI)
Computational concerns
• Scaling
• Cyber Security for Data
Adding Semantics: Discovery Informatics
16
From “what if” to “Why”
17
Scientific data: Microbiome informatics
Human Biome
Environmental Biome
Built Environment
Data Analytics
Semantic Data Integration
While microbes are among the smallest
organisms on the planet, they are also
the largest influence on mass and
nutrient transport in the biosphere. They
are the base of most natural ecosystems,
as well as the purveyors of air and water
quality. It is also microbes that primarily
govern disease transfer and human
health in our built environments.
18
Materials Processing Ontology (cMDIS/IDEA)
The materials field has made much progress on systematically understanding materials
structure-to-property relationships, but lacks an organized model of processing-to-
property relations.
A critical need for systematic development of new materials technologies!
Goal: Create a (machine-readable) ontology
for materials processing.
By combining our expertise in data science,
materials and manufacturing, we are creating a
key missing link in the Materials Genome
Initiative.
Some questions need a qualitative answer
Platform for Experimental Collaborative Ethnography
20
Discovery Informatics Requires Unstructured data
Integration of text analytics,
natural language processing,
network-based multimedia
analysis and
structured/unstructured data
integration
Requires Unstructured data (real-time feeds /images/video)
DOE SEAB report on HPC:
How might a neuromorphic “accelerator” type processor be
used to improve the application performance, power
consumption and overall system reliability of future
exascale systems?
21
Power Consumption (w/IBM)
Network Learning (sensors)
Sparse Distributed Representations
Hybrid Neural/Symbolic Systems
Neuromorphic Computing: software systems that implement models
inspired by neural systems to analyze data tied to perception, motor control,
or multisensory integration.
22
Neuromorphic Computing (CCI/IDEA)
Joint CCI/IDEA project to use supercomputer to model state-of-the-art neuromorphic processors
Use for improving AMOS energy use (like autonomic control)
Use for exploring inputs from data-sensing systems (extrinsic control)
Neuromorphic Computing requires critical Rensselaer technologies
Integrating data analytics (on the fly) with simulation and modeling
CCI (AMOS) allows us to explore new variants on neuromorphic
approaches
IDEA provides learning models and analytics capabilities for evaluation
Together allow us to attack audio/visual streaming data
autonomic
extrinsic
Theme 3: Data Exploration
23
From “why” to “what is”
24
From visualization to exploration
… Unfortunately, visualization too often becomes an end product of scientific analysis,
rather than an exploration tool that scientists can use throughout the research life cycle.
However, new database technologies, coupled with emerging Web-based technologies,
may hold the key to lowering the cost of visualization generation and allow it to become
a more integral part of the scientific process.
25
From visualization to exploration
… Unfortunately, visualization too often becomes an end product of scientific analysis,
rather than an exploration tool that scientists can use throughout the research life cycle.
However, new database technologies, coupled with emerging Web-based technologies,
may hold the key to lowering the cost of visualization generation and allow it to become
a more integral part of the scientific process.
26
From what is, to what if, to why (and back)
These capabilities are critical in “closing the loop” between data,
simulation and modeling in scientific discovery, engineering
design, and business innovation.
27
A “Data Science” Research Agenda
Multiscale
Sparcity
Abductive Agent-oriented
• Gathering and
representing
information from
multiple sources
• topic of CODS talk
• Systematic (and
scalable) methods for
predictive analytics
• example: Parallel
search for best kernel
functions
28
Supporting the Scientific agenda
• New Data Exploration
platforms
• example: Patent
pending on new multi-
user collaborative
device
• Cognitive and
immersive platforms
• Data sharing standards
• Research Data Alliance
• W3C
The Rensselaer IDEA
Summary
• Data is not just the “oil” of the new
generation
• information is the new power source generated from that “oil”
• Using data for prediction is becoming less of an art,
but still needs systematicity
• Scaling tools beyond MapReduce
• Better methods for rapid customization
• Turning data into causal or design knowledge is in its
early stages
• Closing the loop from data to design requires new informatics,
new mathematics, and new ways of thinking beyond data mining
29

More Related Content

PPTX
The Rensselaer IDEA: Data Exploration
PPT
Broad Data (India 2015)
PPT
The Semantic Web: It's for Real
PPT
Semantic Web: The Inside Story
PPT
Big Data and Computer Science Education
PPT
Data Big and Broad (Oxford, 2012)
PPTX
Intro to Data Science Concepts
PPTX
Lecture #01
The Rensselaer IDEA: Data Exploration
Broad Data (India 2015)
The Semantic Web: It's for Real
Semantic Web: The Inside Story
Big Data and Computer Science Education
Data Big and Broad (Oxford, 2012)
Intro to Data Science Concepts
Lecture #01

What's hot (20)

PDF
Data science presentation 2nd CI day
PDF
BigDataCSEKeyNote_2012
PPTX
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PDF
Python for Data Science - TDC 2015
PPTX
Big Data and Data Science: The Technologies Shaping Our Lives
PPS
Big Data Science: Intro and Benefits
PDF
Data science e machine learning
PDF
Introduction to Data Science and Large-scale Machine Learning
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
PPTX
Lecture #03
PDF
Demystifying Data Science with an introduction to Machine Learning
PDF
Data science presentation
PPTX
Big Data and the Art of Data Science
PPTX
Data Science: Past, Present, and Future
PPTX
Lecture #02
PPTX
Introduction of Data Science
PPTX
Big data deep learning: applications and challenges
PDF
Applications of Machine Learning at USC
PPTX
Data science unit2
Data science presentation 2nd CI day
BigDataCSEKeyNote_2012
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Python for Data Science - TDC 2015
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data Science: Intro and Benefits
Data science e machine learning
Introduction to Data Science and Large-scale Machine Learning
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Lecture #03
Demystifying Data Science with an introduction to Machine Learning
Data science presentation
Big Data and the Art of Data Science
Data Science: Past, Present, and Future
Lecture #02
Introduction of Data Science
Big data deep learning: applications and challenges
Applications of Machine Learning at USC
Data science unit2

Viewers also liked (17)

PPT
Data Mining
PDF
The Art of Data Science
PPTX
The Art and Science of Analyzing Software Data
PPT
Social Machines - 2017 Update (University of Iowa)
PPT
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
PPTX
Watson: An Academic's Perspective
PPT
On Beyond OWL: challenges for ontologies on the Web
PDF
Python as the Zen of Data Science
DOCX
KPI e Metriche per i Media e la Comunicazione Commerciale
PDF
Ephesians for Beginners - #6 - The Basis for Unity in the Church
PDF
Anticipatory Coordination in Socio-technical Knowledge-intensive Environments...
PPT
καστοριά
PPTX
Introduction to high-tech entrepreneurship
PDF
腰カラビナ そして野帳
PDF
Getting started erlang
PPT
CSCM Chapter 3 strategic procurement and value chain cscm
PDF
Strengthening Security with Continuous Monitoring
Data Mining
The Art of Data Science
The Art and Science of Analyzing Software Data
Social Machines - 2017 Update (University of Iowa)
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
Watson: An Academic's Perspective
On Beyond OWL: challenges for ontologies on the Web
Python as the Zen of Data Science
KPI e Metriche per i Media e la Comunicazione Commerciale
Ephesians for Beginners - #6 - The Basis for Unity in the Church
Anticipatory Coordination in Socio-technical Knowledge-intensive Environments...
καστοριά
Introduction to high-tech entrepreneurship
腰カラビナ そして野帳
Getting started erlang
CSCM Chapter 3 strategic procurement and value chain cscm
Strengthening Security with Continuous Monitoring

Similar to The Science of Data Science (20)

PPTX
Jeff's what isdatascience
PDF
Predictive Analytics - BarCamp Boston 2011
PDF
A Deep Dissertion Of Data Science Related Issues And Its Applications
PPTX
Foresight conversation
PPTX
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
PPT
Partnering for Research Data
PDF
STI Summit 2011 - Digital Worlds
PDF
Taming the Big Data Beast - Together
PPT
Informatics Transform : Re-engineering Libraries for the Data Decade
PDF
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
PPT
Aaas Data Intensive Science And Grid
PDF
Python's Role in the Future of Data Analysis
PPTX
Big Data Content Organization, Discovery, and Management
PPTX
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
PDF
High Performance Data Analytics and a Java Grande Run Time
PPTX
Data science Innovations January 2018
PDF
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
PDF
2018 learning approach-digitaltrends
DOCX
Analytical thinking 12 - August 2012
PPT
eScience: A Transformed Scientific Method
Jeff's what isdatascience
Predictive Analytics - BarCamp Boston 2011
A Deep Dissertion Of Data Science Related Issues And Its Applications
Foresight conversation
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Partnering for Research Data
STI Summit 2011 - Digital Worlds
Taming the Big Data Beast - Together
Informatics Transform : Re-engineering Libraries for the Data Decade
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Aaas Data Intensive Science And Grid
Python's Role in the Future of Data Analysis
Big Data Content Organization, Discovery, and Management
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
High Performance Data Analytics and a Java Grande Run Time
Data science Innovations January 2018
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
2018 learning approach-digitaltrends
Analytical thinking 12 - August 2012
eScience: A Transformed Scientific Method

More from James Hendler (20)

PPTX
Knowing what AI Systems Don't know and Why it matters
PPTX
Exploring the Boundaries of Artificial Intelligence (or "Modern AI")
PPTX
Tragedy of the Data Commons (ODSC-East, 2021)
PPTX
Tragedy of the (Data) Commons
PPTX
Knowledge Graph Semantics/Interoperability
PPTX
The Future(s) of the World Wide Web
PPTX
Enhancing Precision Wellness with Personal Health Knowledge Graphs
PPTX
The Future of AI: Going Beyond Deep Learning, Watson, and the Semantic Web
PPTX
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
PPTX
Enhancing Precision Wellness with Knowledge Graphs and Semantic Analytics: O...
PPT
KR in the age of Deep Learning
PPTX
Digital Archiving, The Semantic Web, and Modern AI
PPTX
The Unreasonable Effectiveness of Metadata
PPT
Knowledge Representation in the Age of Deep Learning, Watson, and the Semanti...
PPT
Wither OWL
PPTX
Artificial Intelligence: Existential Threat or Our Best Hope for the Future?
PDF
Facilitating Web Science Collaboration through Semantic Markup
PPTX
Why Watson Won: A cognitive perspective
PPTX
Watson at RPI - Summer 2013
PPT
Future of the World WIde Web (India)
Knowing what AI Systems Don't know and Why it matters
Exploring the Boundaries of Artificial Intelligence (or "Modern AI")
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the (Data) Commons
Knowledge Graph Semantics/Interoperability
The Future(s) of the World Wide Web
Enhancing Precision Wellness with Personal Health Knowledge Graphs
The Future of AI: Going Beyond Deep Learning, Watson, and the Semantic Web
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
Enhancing Precision Wellness with Knowledge Graphs and Semantic Analytics: O...
KR in the age of Deep Learning
Digital Archiving, The Semantic Web, and Modern AI
The Unreasonable Effectiveness of Metadata
Knowledge Representation in the Age of Deep Learning, Watson, and the Semanti...
Wither OWL
Artificial Intelligence: Existential Threat or Our Best Hope for the Future?
Facilitating Web Science Collaboration through Semantic Markup
Why Watson Won: A cognitive perspective
Watson at RPI - Summer 2013
Future of the World WIde Web (India)

Recently uploaded (20)

PPTX
Presentation - Principles of Instructional Design.pptx
PPTX
Slides World Game (s) Great Redesign Eco Economic Epochs.pptx
PDF
Advancements in abstractive text summarization: a deep learning approach
PDF
TicketRoot: Event Tech Solutions Deck 2025
PDF
Examining Bias in AI Generated News Content.pdf
PDF
1_Keynote_Breaking Barriers_한계를 넘어서_Charith Mendis.pdf
PDF
Fitaura: AI & Machine Learning Powered Fitness Tracker
PDF
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
PDF
Secure Java Applications against Quantum Threats
PPTX
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
PPT
Overviiew on Intellectual property right
PDF
State of AI in Business 2025 - MIT NANDA
PPTX
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
PDF
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
PPTX
maintenance powerrpoint for adaprive and preventive
PDF
Introduction to c language from lecture slides
PPTX
Report in SIP_Distance_Learning_Technology_Impact.pptx
PDF
Human Computer Interaction Miterm Lesson
PDF
Intravenous drug administration application for pediatric patients via augmen...
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
Presentation - Principles of Instructional Design.pptx
Slides World Game (s) Great Redesign Eco Economic Epochs.pptx
Advancements in abstractive text summarization: a deep learning approach
TicketRoot: Event Tech Solutions Deck 2025
Examining Bias in AI Generated News Content.pdf
1_Keynote_Breaking Barriers_한계를 넘어서_Charith Mendis.pdf
Fitaura: AI & Machine Learning Powered Fitness Tracker
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
Secure Java Applications against Quantum Threats
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
Overviiew on Intellectual property right
State of AI in Business 2025 - MIT NANDA
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
maintenance powerrpoint for adaprive and preventive
Introduction to c language from lecture slides
Report in SIP_Distance_Learning_Technology_Impact.pptx
Human Computer Interaction Miterm Lesson
Intravenous drug administration application for pediatric patients via augmen...
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf

The Science of Data Science

  • 1. The Science of Data Science (Data plus Semantics yields Knowledge) Prof. James Hendler Tetherless World Constellation Chair of Computer, Web and Cognitive Sciences Director, The Rensselaer IDEA 1
  • 2. The Rensselaer Institute for Data Exploration and Applications Performance Plan to Budget Presentation February 2015 The Rensselaer Institute for Data Exploration and Applications (IDEA) is a breakthrough initiative brings together key research areas and advanced technologies to revolutionize the way we use data in science, engineering, and virtually every other research and educational discipline. By bridging the gaps between analytics, modeling, and simulation we continue the Rensselaer tradition as a leader in applying critical technologies to improving everyday life and meeting the challenges of the future.
  • 3. 3 The Rensselaer Institute for Data Exploration and Applications Business Systems: Built and Natural Environments: Cyber- Resiliency: Policy, Ethics and Stewardship: Materials Informatics:Data-driven Physical/Life Sciences: Healthcare Analytics and Mobile Health: Social Network Analytics: Agents and Augmented Reality:
  • 4. 4 IDEA project examples • Healthcare in Context: Data mining/analytics to Improve public health from a systems perspective at the individual to national scales. • Data-Centric Engineering Design: Data-driven Design & Control under uncertainty via data fusion across multiple scales and sources • Supply Chain Resilience through Information Visibility: Demonstrate uses of supply chain information visibility for anticipating, mitigating and recovering from disruptive events • Accelerated design of functional materials/Material Ontology: Address basic materials processing data-based informatics for complex, multifunctional (often nano) materials. • Biome-informatics: Develop data aggregation and computational tools to integrate disparate datasets into large ecosystem models using data collected on the microbial communities that inhabit the base of most ecosystems • Deducing Structure to Function in Biomedicine: Develop systematic data-resourced methods for discovering and exploiting structure-to-function relationships.
  • 5. 5 KDD Pipeline – as usually presented Data Storage (Big Data Warehouse)
  • 6. KDD Pipeline – in the real world Data is increasingly being brought in from external sources, with mixed provenance, and increasingly outside the analyzers’ control. At increasing rates and scales 6 Data Storage Sensors and apps Social Media Customer Behaviors Web Partners Formatting, standards use, data cleansing, data bias analysis, … Open data Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage Data Storage
  • 7. Tough data integration challenges Enterprise analytics Open Data Integration Hard problems!
  • 8. Closing the loop on (big) data IDEA is focusing on key data science areas which are revolutionizing engineering, science and business with significant social impact 8 Predictive Analytics Discovery Informatics Data Exploration
  • 9. Theme 1: Predictive Analytics 9 From “what is” to “what if”
  • 10. Courtesy of Eric Schadt, Mount Sinai Example: Healthcare Data Analytics The Digital Universe of Data to Better Diagnose and Treat Patients Courtesy of Eric Schadt, Mount Sinai
  • 11. Identifying predictive features in data Each factor must be separately analyzed for its “Predictivity” • Mutual information measure The “black art” of predictive analytics is finding the right ones • Use too few, the model is weak • Use too many, the model becomes slow and dominated by noise Algorithms required to do this because the overwhelming number of “weak” factors defies human abilities to combine • Machine learning identifies key feature • some require “roll ups” • some require “pull outs” • Mathematical techniques then reduce the dimensionality 11
  • 12. 12 Predictive analytics in sensors Extend-o-hand (Josh Shinavier. PhD) Classification of the sensor data (via machine-learning) allows predictive recognition of different gestures (i.e. before the gesture is finished).
  • 13. 13 Predictive analytics in large scale behaviors List clusters at risk for Asian Clams <1mile Cook’s Bay.” Machine-learning generates predicts future distributions of invasive species in Lake George based on current distributions and bathymetry similarity.
  • 14. Predictive Social Network Analytics (with RPI NeST center) 14 Social Networks in Action Analyzing cascading failures Modeling (supply chain) networks… and predicting (cascading) network risks. Modeling network stressors (including human cognitive element) Understanding network dynamics
  • 15. 15 Data Science Research Center: tools for data analytics Theory & Algorithms • Randomized • Optimization • Approximation • Multilinear Algebra Applications Statistics • Multivariate analysis • Optimal Experimental Design Dimension reduction by randomized algorithms for numerical linear algebra for identify significant components and visualizing Petabyte-scale data matrices (P. Drineas, CSCI) Parallel Factor Analysis for tensor systems creates a scalable solution, on AMOS, for a critical data-processing component of data analytics for large graphs. (B. Yener, CSCI) Computational concerns • Scaling • Cyber Security for Data
  • 16. Adding Semantics: Discovery Informatics 16 From “what if” to “Why”
  • 17. 17 Scientific data: Microbiome informatics Human Biome Environmental Biome Built Environment Data Analytics Semantic Data Integration While microbes are among the smallest organisms on the planet, they are also the largest influence on mass and nutrient transport in the biosphere. They are the base of most natural ecosystems, as well as the purveyors of air and water quality. It is also microbes that primarily govern disease transfer and human health in our built environments.
  • 18. 18 Materials Processing Ontology (cMDIS/IDEA) The materials field has made much progress on systematically understanding materials structure-to-property relationships, but lacks an organized model of processing-to- property relations. A critical need for systematic development of new materials technologies! Goal: Create a (machine-readable) ontology for materials processing. By combining our expertise in data science, materials and manufacturing, we are creating a key missing link in the Materials Genome Initiative.
  • 19. Some questions need a qualitative answer Platform for Experimental Collaborative Ethnography
  • 20. 20 Discovery Informatics Requires Unstructured data Integration of text analytics, natural language processing, network-based multimedia analysis and structured/unstructured data integration
  • 21. Requires Unstructured data (real-time feeds /images/video) DOE SEAB report on HPC: How might a neuromorphic “accelerator” type processor be used to improve the application performance, power consumption and overall system reliability of future exascale systems? 21 Power Consumption (w/IBM) Network Learning (sensors) Sparse Distributed Representations Hybrid Neural/Symbolic Systems Neuromorphic Computing: software systems that implement models inspired by neural systems to analyze data tied to perception, motor control, or multisensory integration.
  • 22. 22 Neuromorphic Computing (CCI/IDEA) Joint CCI/IDEA project to use supercomputer to model state-of-the-art neuromorphic processors Use for improving AMOS energy use (like autonomic control) Use for exploring inputs from data-sensing systems (extrinsic control) Neuromorphic Computing requires critical Rensselaer technologies Integrating data analytics (on the fly) with simulation and modeling CCI (AMOS) allows us to explore new variants on neuromorphic approaches IDEA provides learning models and analytics capabilities for evaluation Together allow us to attack audio/visual streaming data autonomic extrinsic
  • 23. Theme 3: Data Exploration 23 From “why” to “what is”
  • 24. 24 From visualization to exploration … Unfortunately, visualization too often becomes an end product of scientific analysis, rather than an exploration tool that scientists can use throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, may hold the key to lowering the cost of visualization generation and allow it to become a more integral part of the scientific process.
  • 25. 25 From visualization to exploration … Unfortunately, visualization too often becomes an end product of scientific analysis, rather than an exploration tool that scientists can use throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, may hold the key to lowering the cost of visualization generation and allow it to become a more integral part of the scientific process.
  • 26. 26 From what is, to what if, to why (and back) These capabilities are critical in “closing the loop” between data, simulation and modeling in scientific discovery, engineering design, and business innovation.
  • 27. 27 A “Data Science” Research Agenda Multiscale Sparcity Abductive Agent-oriented
  • 28. • Gathering and representing information from multiple sources • topic of CODS talk • Systematic (and scalable) methods for predictive analytics • example: Parallel search for best kernel functions 28 Supporting the Scientific agenda • New Data Exploration platforms • example: Patent pending on new multi- user collaborative device • Cognitive and immersive platforms • Data sharing standards • Research Data Alliance • W3C
  • 29. The Rensselaer IDEA Summary • Data is not just the “oil” of the new generation • information is the new power source generated from that “oil” • Using data for prediction is becoming less of an art, but still needs systematicity • Scaling tools beyond MapReduce • Better methods for rapid customization • Turning data into causal or design knowledge is in its early stages • Closing the loop from data to design requires new informatics, new mathematics, and new ways of thinking beyond data mining 29

Editor's Notes

  • #12: Ones with numbers are secondary diagnosis indicator variables. * indicate categorical variables. In practice during modeling they is one “predictivity” index
  • #13: Working with faculty from SoS, SoE, HASS and SoA
  • #15: (in the UCTE power grid network, employing capacity-limited current flows in resistor networks).
  • #20: Put it on a slide: show an example of someone using it Current version of PECE is running over 4 different projects (Disaster STS Network, The Asthma Files, World PECE, World Academia) Largest site (DSTS Network) has 35 users from universities all over the country and 14 different user groups “Feature” lists of moveable functionality and modules that can be ported to any Drupal site.
  • #27: this is the data science agenda- basically, these are the hard problems in the closing the loop – how to go from the correlation on one side to the causal on the other – I don’t love the term agent-oriented, but we mean a combination of unstructured, AI, etc – abductive is usually where I talk about these being hard inverse problems where we don’t know a specific function, but rathr are looking for an explanation.
  • #28: this is the data science agenda- basically, these are the hard problems in the closing the loop – how to go from the correlation on one side to the causal on the other – I don’t love the term agent-oriented, but we mean a combination of unstructured, AI, etc – abductive is usually where I talk about these being hard inverse problems where we don’t know a specific function, but rathr are looking for an explanation.