SlideShare a Scribd company logo
Educating a New Breed of
Data Scientists for Scientific
Data Management
Jian Qin
School of Information Studies
Syracuse University
Microsoft eScience Workshop, Chicago, October 9, 2012
DS	

   Talk points
        ›  Data science (DS) and data scientists in the context of
           scientific data
        ›  An iSchool version of the DS curriculum
        ›  Findings and lessons from implementing the DS curriculum
        ›  A new breed of data scientists: the iSchool approach




                               10/9/12     MICROSOFT ESCIENCE WORKSHOP 2012   2
What is data science?
DS	




                     “An emerging area of work
                   concerned with the collection,
                presentation, analysis, visualization,
                 management, and preservation of
                  large collections of information.”


                    Stanton, J. (2012). Introduction to Data Science.
                    http://ischool.syr.edu/media/documents/2012/3/
                                DataScienceBook1_1.pdf
                              10/9/12           MICROSOFT ESCIENCE WORKSHOP 2012   3
DS	

   Data science and scientific research


    Plan, design, consult                                     Ingest, store,
     for, implement, and                                   organize, merge,
        evaluate data                                    filter, and transform
    management projects                                     data and create
         and services                                    analysis-ready data




                            10/9/12   MICROSOFT ESCIENCE WORKSHOP 2012     4
What data scientists are expected to do:
               the job market
       DS	

                                 Laboratory Data                         Data Modeling/
                                             Management Specialist                   Management Specialist
Scientific Data Management                   •  Administer operational database      •  Working closely with the high
Specialist                                   •  Assure the quality of data              performance computing and
•     Design, develop, implement, and           database content                        the IT manager
      manage high-throughput automatic       •  Interact closely with researchers,   •  Develop a data model for
      data processing infrastructure for        lab managers, and platform              complex multi-scale rocks
      large databases in a mature system        coordinators                         •  Design and organize a
•     Develop and improve the                •  Track deliverables against budget       database and complex
      infrastructure supporting this system     and prepare data reports                queries
•     Interface with multiple data providers •  Collaborate closely with IT and      •  Integrate and mange multi-
      to design, build, and maintain their      bioinformatics colleagues               scale rocks subjected to
      customized databases                   •  Assist IT in gathering workflow         large-scale scientific
•     Clarify requirements, feature             requirements                            computing applications
      requests and bug reports for software •  Test changes and updates in IT
                                                systems                               http://www.ingrainrocks.com/
      developers and assist in testing                                                data-management-specialist/
      code.                                  •  Create and maintain app
                                                documentation
     http://www.bioinformatics.org/
     forums/forum.php?forum_id=9670
                                                    10/9/12           MICROSOFT ESCIENCE WORKSHOP 2012          5
DS	


         “We’re increasingly finding data in
        the wild, and data scientists are
        involved with gathering data,
        massaging it into a tractable form,
        making it tell its story, and presenting
        that story to others.”
          Loukides, M. (2011). What is data science? Sebastopol, CA: O’Reilly.



                                   10/9/12           MICROSOFT ESCIENCE WORKSHOP 2012   6
What data scientists are expected to do:
DS	

   the difference from tradition
        ›  Data scientists are more likely to be involved across the
           data lifecycle:
           –  Acquiring new data sets: 33%
           –  Parsing data sets: 29%
           –  Filtering and organizing data: 40%
           –  Mining data for patterns: 30%
           –  Advanced algorithms to solve analytical problems: 29%
           –  Representing data visually: 38%
           –  Telling a story with data: 34%
           –  Interacting with data dynamically: 37%
           –  Making business decisions based on data: 40%
    http://mashable.com/2012/01/13/career-of-   10/9/12   MICROSOFT ESCIENCE WORKSHOP 2012   7
    the-future-data-scientist-infographic/
How should educational
        programs address the
        challenge?
        A case of the CAS in Data Science
        program at Syracuse iSchool


DS	

                     10/9/12   MICROSOFT ESCIENCE WORKSHOP 2012   8
Story 1: cognitive-demanding workflows and
DS	

   data management
›  Domain: Thermochronology and tectonics
›  What’s involved: rock samples from drilling and field observation, sliced and
   grained rock samples
›  Data types: Excel data files (lots of them), spectrum and microscopic images,
   annotations
›  Analysis: modeling and sensemaking by combining data from multiple data
   files with specialized software
›  Bottleneck problem: manually matching/merging/filtering data is extremely
   cumbersome and the problem is compounded by the difficulty finding the right
   data files
DS	

   Story 2: highly automated workflows
    ›  Domain: Astrophysics: gravitational wave detection
    ›  What’s involved: data ingestion from laser interferometers, raw data
       calibration and segmentation, workflow management, provenance

    ›  Data types: streaming data from
       the laser interferometers, images
    ›  Analysis: detection of “events”
    ›  Bottleneck problem: tracking of
       data and processes and the
       relationships between them

                                   10/9/12     MICROSOFT ESCIENCE WORKSHOP 2012   10
Ability to use a        Knowledge
                                                               Data
DS	

      wide variety          of a subject
                                                             modeling,
             tools for             domain
         documentation,                                    database and
          analysis, and                                    query design
          report of data



                                           Data                        OS,
             Collaboration,
            communication,
                                         scientists                Programming
                                                                    languages
                and co-
              ordination


                              Content and                Encoding
    What are                   repository               languages
                                systems
    expected of data
    scientists?                10/9/12          MICROSOFT ESCIENCE WORKSHOP 2012   11
DS	

Analytical    skills: domain modeling
   Requirement analysis
                                Interview skills, analysis and
                                generalization skills
    Workflow analysis
                                Ability to capture components and
                                sequences in workflows
      Data modeling

                                Ability to translate domain analysis
   Data transformation          into data models
     needs analysis
                                Ability to envision the data model
     Data provenance            within the larger system architecture
      needs analysis
Analytical skills: from data sources to patterns,
DS	

   relationships, and trends
                                       Analytical tools


                     “Hacking”


                                                                        Knowledge



                                    Data
                                    products


                          10/9/12         MICROSOFT ESCIENCE WORKSHOP 2012    13
Data management skills: data lifecycle and
DS	

 infrastructural services

      Metadata    Encoding       Semantic         Identify               Infrastructural
      standards   language        control       management               services

     Processed, transformed, derived, calculated, … data                 •  Data source
                                                                            discovery
                                                                         •  Data curation
                      Common data format
                          Image formats
                                                                         •  Data preservation
                          Matrix formats                                 •  Data integration and
                      Microarray file formats                               mashup
                     Communication protocols                             •  Data citation,
                                                                            publication, and
                                                                            distribution
                                                                         •  Data linking and
                                                                            interoperability
                                                                         •  …
                                     10/9/12          MICROSOFT ESCIENCE WORKSHOP 2012      14
Technology skills with excellent communication
DS	

   skills

        TECHNOLOGY SKILLS                    COMMUNICATION SKILLS
        ›  Operation systems                 ›  Interviews
        ›  Repository systems                ›  “Ice breaking”
        ›  Database systems                  ›  Community building
        ›  Programming languages             ›  Institutionalization
        ›  Encoding languages                ›  Stakeholder buy-in
        ›  Specialized programming



                                   10/9/12      MICROSOFT ESCIENCE WORKSHOP 2012   15
No superman model for beginning data
DS	

   scientists
              Data
            analytics                                        Data storage
                                                                 and
                                                             management

                            Data scientists
                                Core:
                          Applied data science
                               Databases
         General system
          management                                             Data
                                                             visualization


                           10/9/12     MICROSOFT ESCIENCE WORKSHOP 2012      16
DS	

     The CAS in Data Science program at SU
          ›  Required:
             –  Data Administration Concepts and Database Management
             –  Applied Data Science
          ›  Elective:
        Data Analytics             Data Storage and                 Data Visualization
        •  Data Mining             Management                       •  Information Architecture for
        •  Basics of Information   •  Technologies for Web             Internet Services
           Retrieval Systems          Content Management            •  Information Visualization
        •  Natural Language        •  Foundations of Digital Data
           Processing              •  Creating, Managing, and       General Systems Management
        •  Advanced Information       Preserving Digital Assets     •  Enterprise Technologies
           Analytics               •  Data Warehousing              •  Managing Information Systems
        •  Research Methods        •  Advanced Database                Projects
        •  Statistical Methods        Management                    •  Information Systems Analysis


                                            10/9/12           MICROSOFT ESCIENCE WORKSHOP 2012        17
What we learned from the program
DS	

   development
        ›  Data science is a moving target with multiple focal points
          –  Versions from statistics, computer science, and library and
             information science
        ›  Skills vs. theories
          –  Students are anxious to learn skills but not so interested in
             theories
          –  Theories help build visions
        ›  Sufficient hands-on time for technologies and tools
        ›  Authentic learning through real-world data management
           projects
                                   10/9/12       MICROSOFT ESCIENCE WORKSHOP 2012   18
DS	

   Reconciliation of the two views of data science

         “An emerging area of work                              “We’re increasingly finding
             concerned with the                                  data in the wild, and data
          collection, presentation,                            scientists are involved with
           analysis, visualization,                           gathering data, massaging it
             management, and                                 into a tractable form, making it
            preservation of large                           tell its story, and presenting that
         collections of information.”                                  story to others.”

        Stanton, J. (2012). Introduction to Data Science.       Loukides, M. (2011). What is data
        http://ischool.syr.edu/media/documents/2012/3/          science? Sebastopol, CA: O’Reilly.
                    DataScienceBook1_1.pdf


                                                 10/9/12         MICROSOFT ESCIENCE WORKSHOP 2012    19
The iSchool’s version of data science
DS	

    education
                               Ability to use a       Knowledge
                                 wide variety         of a subject                Data
                                   tools for            domain                  modeling,
                               documentation,                                 database and
                                analysis, and                                 query design
        Eventually the          report of data

        iSchool data science
        program will build                                    Data                      OS,
                                   Collaboration,
        the foundation for        communication,
                                                            scientists              Programming
                                                                                     languages
                                      and co-
        super data                  ordination

        scientists…
                                                    Content and             Encoding
                                                     repository            languages
                                                      systems


                                     10/9/12          MICROSOFT ESCIENCE WORKSHOP 2012        20
DS	


        eScience Librarianship Curriculum Project:
        http://eslib.ischool.syr.edu/

        Science Data Literacy Project:
        http://sdl.syr.edu/

        CAS in Data Science:
        http://ischool.syr.edu/future/cas/
        datascience.aspx



                            10/9/12      MICROSOFT ESCIENCE WORKSHOP 2012   21

More Related Content

PPTX
Needs for Data Management & Citation Throughout the Information Lifecycle
Micah Altman
 
PDF
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
Ilkay Altintas, Ph.D.
 
PPTX
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
National Information Standards Organization (NISO)
 
PPTX
Repository Federation: Towards Data Interoperability
Robert H. McDonald
 
PPT
online Record Linkage
Priya Pandian
 
DOC
MS Word file resumes16869r.doc.doc
butest
 
PDF
Research Solutions for Education
Lee Stott
 
PDF
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
i_scienceEU
 
Needs for Data Management & Citation Throughout the Information Lifecycle
Micah Altman
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
Ilkay Altintas, Ph.D.
 
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
National Information Standards Organization (NISO)
 
Repository Federation: Towards Data Interoperability
Robert H. McDonald
 
online Record Linkage
Priya Pandian
 
MS Word file resumes16869r.doc.doc
butest
 
Research Solutions for Education
Lee Stott
 
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
i_scienceEU
 

What's hot (10)

PPTX
Building a Data Discovery Network for Sustainability Science
Robert H. McDonald
 
PDF
THE Jisc Supplement 25 Nov 2009
Fiona Salvage
 
PPTX
SEAD Datanet and Sustainability Science
Robert H. McDonald
 
PDF
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
Jason Riedy
 
PDF
Big Data-Survey
ijeei-iaes
 
PPTX
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
skonkiel
 
PDF
NOSQL Database Engines for Big Data Management
ijtsrd
 
PDF
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
i_scienceEU
 
PPT
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ajay Ohri
 
PDF
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
ijcsit
 
Building a Data Discovery Network for Sustainability Science
Robert H. McDonald
 
THE Jisc Supplement 25 Nov 2009
Fiona Salvage
 
SEAD Datanet and Sustainability Science
Robert H. McDonald
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
Jason Riedy
 
Big Data-Survey
ijeei-iaes
 
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
skonkiel
 
NOSQL Database Engines for Big Data Management
ijtsrd
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
i_scienceEU
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ajay Ohri
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
ijcsit
 
Ad

Similar to Educating a New Breed of Data Scientists for Scientific Data Management (20)

PDF
Data Science: An Emerging Field for Future Jobs
Jian Qin
 
PDF
Simon Hodson
Eduserv
 
PPTX
Paper presentation
K.K. Tripathi
 
PPTX
Jeff's what isdatascience
lizliddy
 
PDF
Data Science Tools and Technologies: A Comprehensive Overview
saniakhan8105
 
PPTX
The Science of Data Science
James Hendler
 
PPTX
Ch1IntroductiontoDataScience.pptx
AbderrahmanABID2
 
PPTX
To architect or engineer? Lessons from DataPool on building RDM repositories
jiscdatapool
 
PDF
Unlock Your Data for ML & AI using Data Virtualization
Denodo
 
PPTX
Building Data Ecosystems for Accelerated Discovery
adamkraut
 
PDF
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Denodo
 
PDF
Linked Open data: CNR
DatiGovIT
 
PDF
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Calpont Corporation
 
PPT
Metadata in general and Dublin Core in specific; some experiences
Kerstin Forsberg
 
PDF
How Data Virtualization Adds Value to Your Data Science Stack
Denodo
 
PPTX
Breed data scientists_ A Presentation.pptx
GautamPopli1
 
PPTX
The-Vital-Role-of-Databases-in-Data-Science.pptx
MuhammadJameel64
 
PDF
Introduction to Data Science.pdf
University of Sindh
 
PDF
9th International Conference on Database and Data Mining (DBDM 2021)
albert ca
 
PDF
Linked Data Visualization 1st Edition Laura Po
audinogibson
 
Data Science: An Emerging Field for Future Jobs
Jian Qin
 
Simon Hodson
Eduserv
 
Paper presentation
K.K. Tripathi
 
Jeff's what isdatascience
lizliddy
 
Data Science Tools and Technologies: A Comprehensive Overview
saniakhan8105
 
The Science of Data Science
James Hendler
 
Ch1IntroductiontoDataScience.pptx
AbderrahmanABID2
 
To architect or engineer? Lessons from DataPool on building RDM repositories
jiscdatapool
 
Unlock Your Data for ML & AI using Data Virtualization
Denodo
 
Building Data Ecosystems for Accelerated Discovery
adamkraut
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Denodo
 
Linked Open data: CNR
DatiGovIT
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Calpont Corporation
 
Metadata in general and Dublin Core in specific; some experiences
Kerstin Forsberg
 
How Data Virtualization Adds Value to Your Data Science Stack
Denodo
 
Breed data scientists_ A Presentation.pptx
GautamPopli1
 
The-Vital-Role-of-Databases-in-Data-Science.pptx
MuhammadJameel64
 
Introduction to Data Science.pdf
University of Sindh
 
9th International Conference on Database and Data Mining (DBDM 2021)
albert ca
 
Linked Data Visualization 1st Edition Laura Po
audinogibson
 
Ad

Recently uploaded (20)

PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PDF
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 

Educating a New Breed of Data Scientists for Scientific Data Management

  • 1. Educating a New Breed of Data Scientists for Scientific Data Management Jian Qin School of Information Studies Syracuse University Microsoft eScience Workshop, Chicago, October 9, 2012
  • 2. DS Talk points ›  Data science (DS) and data scientists in the context of scientific data ›  An iSchool version of the DS curriculum ›  Findings and lessons from implementing the DS curriculum ›  A new breed of data scientists: the iSchool approach 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 2
  • 3. What is data science? DS “An emerging area of work concerned with the collection, presentation, analysis, visualization, management, and preservation of large collections of information.” Stanton, J. (2012). Introduction to Data Science. http://ischool.syr.edu/media/documents/2012/3/ DataScienceBook1_1.pdf 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 3
  • 4. DS Data science and scientific research Plan, design, consult Ingest, store, for, implement, and organize, merge, evaluate data filter, and transform management projects data and create and services analysis-ready data 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 4
  • 5. What data scientists are expected to do: the job market DS Laboratory Data Data Modeling/ Management Specialist Management Specialist Scientific Data Management •  Administer operational database •  Working closely with the high Specialist •  Assure the quality of data performance computing and •  Design, develop, implement, and database content the IT manager manage high-throughput automatic •  Interact closely with researchers, •  Develop a data model for data processing infrastructure for lab managers, and platform complex multi-scale rocks large databases in a mature system coordinators •  Design and organize a •  Develop and improve the •  Track deliverables against budget database and complex infrastructure supporting this system and prepare data reports queries •  Interface with multiple data providers •  Collaborate closely with IT and •  Integrate and mange multi- to design, build, and maintain their bioinformatics colleagues scale rocks subjected to customized databases •  Assist IT in gathering workflow large-scale scientific •  Clarify requirements, feature requirements computing applications requests and bug reports for software •  Test changes and updates in IT systems http://www.ingrainrocks.com/ developers and assist in testing data-management-specialist/ code. •  Create and maintain app documentation http://www.bioinformatics.org/ forums/forum.php?forum_id=9670 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 5
  • 6. DS “We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.” Loukides, M. (2011). What is data science? Sebastopol, CA: O’Reilly. 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 6
  • 7. What data scientists are expected to do: DS the difference from tradition ›  Data scientists are more likely to be involved across the data lifecycle: –  Acquiring new data sets: 33% –  Parsing data sets: 29% –  Filtering and organizing data: 40% –  Mining data for patterns: 30% –  Advanced algorithms to solve analytical problems: 29% –  Representing data visually: 38% –  Telling a story with data: 34% –  Interacting with data dynamically: 37% –  Making business decisions based on data: 40% http://mashable.com/2012/01/13/career-of- 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 7 the-future-data-scientist-infographic/
  • 8. How should educational programs address the challenge? A case of the CAS in Data Science program at Syracuse iSchool DS 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 8
  • 9. Story 1: cognitive-demanding workflows and DS data management ›  Domain: Thermochronology and tectonics ›  What’s involved: rock samples from drilling and field observation, sliced and grained rock samples ›  Data types: Excel data files (lots of them), spectrum and microscopic images, annotations ›  Analysis: modeling and sensemaking by combining data from multiple data files with specialized software ›  Bottleneck problem: manually matching/merging/filtering data is extremely cumbersome and the problem is compounded by the difficulty finding the right data files
  • 10. DS Story 2: highly automated workflows ›  Domain: Astrophysics: gravitational wave detection ›  What’s involved: data ingestion from laser interferometers, raw data calibration and segmentation, workflow management, provenance ›  Data types: streaming data from the laser interferometers, images ›  Analysis: detection of “events” ›  Bottleneck problem: tracking of data and processes and the relationships between them 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 10
  • 11. Ability to use a Knowledge Data DS wide variety of a subject modeling, tools for domain documentation, database and analysis, and query design report of data Data OS, Collaboration, communication, scientists Programming languages and co- ordination Content and Encoding What are repository languages systems expected of data scientists? 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 11
  • 12. DS Analytical skills: domain modeling Requirement analysis Interview skills, analysis and generalization skills Workflow analysis Ability to capture components and sequences in workflows Data modeling Ability to translate domain analysis Data transformation into data models needs analysis Ability to envision the data model Data provenance within the larger system architecture needs analysis
  • 13. Analytical skills: from data sources to patterns, DS relationships, and trends Analytical tools “Hacking” Knowledge Data products 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 13
  • 14. Data management skills: data lifecycle and DS infrastructural services Metadata Encoding Semantic Identify Infrastructural standards language control management services Processed, transformed, derived, calculated, … data •  Data source discovery •  Data curation Common data format Image formats •  Data preservation Matrix formats •  Data integration and Microarray file formats mashup Communication protocols •  Data citation, publication, and distribution •  Data linking and interoperability •  … 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 14
  • 15. Technology skills with excellent communication DS skills TECHNOLOGY SKILLS COMMUNICATION SKILLS ›  Operation systems ›  Interviews ›  Repository systems ›  “Ice breaking” ›  Database systems ›  Community building ›  Programming languages ›  Institutionalization ›  Encoding languages ›  Stakeholder buy-in ›  Specialized programming 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 15
  • 16. No superman model for beginning data DS scientists Data analytics Data storage and management Data scientists Core: Applied data science Databases General system management Data visualization 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 16
  • 17. DS The CAS in Data Science program at SU ›  Required: –  Data Administration Concepts and Database Management –  Applied Data Science ›  Elective: Data Analytics Data Storage and Data Visualization •  Data Mining Management •  Information Architecture for •  Basics of Information •  Technologies for Web Internet Services Retrieval Systems Content Management •  Information Visualization •  Natural Language •  Foundations of Digital Data Processing •  Creating, Managing, and General Systems Management •  Advanced Information Preserving Digital Assets •  Enterprise Technologies Analytics •  Data Warehousing •  Managing Information Systems •  Research Methods •  Advanced Database Projects •  Statistical Methods Management •  Information Systems Analysis 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 17
  • 18. What we learned from the program DS development ›  Data science is a moving target with multiple focal points –  Versions from statistics, computer science, and library and information science ›  Skills vs. theories –  Students are anxious to learn skills but not so interested in theories –  Theories help build visions ›  Sufficient hands-on time for technologies and tools ›  Authentic learning through real-world data management projects 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 18
  • 19. DS Reconciliation of the two views of data science “An emerging area of work “We’re increasingly finding concerned with the data in the wild, and data collection, presentation, scientists are involved with analysis, visualization, gathering data, massaging it management, and into a tractable form, making it preservation of large tell its story, and presenting that collections of information.” story to others.” Stanton, J. (2012). Introduction to Data Science. Loukides, M. (2011). What is data http://ischool.syr.edu/media/documents/2012/3/ science? Sebastopol, CA: O’Reilly. DataScienceBook1_1.pdf 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 19
  • 20. The iSchool’s version of data science DS education Ability to use a Knowledge wide variety of a subject Data tools for domain modeling, documentation, database and analysis, and query design Eventually the report of data iSchool data science program will build Data OS, Collaboration, the foundation for communication, scientists Programming languages and co- super data ordination scientists… Content and Encoding repository languages systems 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 20
  • 21. DS eScience Librarianship Curriculum Project: http://eslib.ischool.syr.edu/ Science Data Literacy Project: http://sdl.syr.edu/ CAS in Data Science: http://ischool.syr.edu/future/cas/ datascience.aspx 10/9/12 MICROSOFT ESCIENCE WORKSHOP 2012 21