! 
What is Big Data in Biomedicine?! 
Data Types to be considered! 
! 
Susanna-Assunta Sansone, PhD! 
! 
@biosharing! 
@isatools! 
@scientificdata! 
! 
B-DEBATE: Big Data in Biomedicine. Challenges and Opportunities, 11 Nov, 2014 
Data Consultant, 
Honorary Academic Editor 
Associate Director, 
Principal Investigator
Let’s not forget the long tail of research data 
• Big science efforts represent only a small proportion! 
o often featuring homogenous and well-organized data! 
! 
• There is a large proportion of small independent research efforts! 
o a rich variety of specialty data sets!
Let’s not forget the long tail of research data 
• Small independent research efforts fall in the long-tail of the distribution! 
o Most of this (such as as siloed databases, null findings) is 
unpublished! 
o These dark data hold a potential wealth of knowledge!
Plagued by selective reporting of data and methods 
• Over 50% of completed studies in 
biomedicine do not appear in the 
published literature! 
! 
• Instead reside in file drawers 
personal and hard drives! 
! 
• Often because results do not 
conform to author's hypotheses! 
“Only half the health-related 
studies funded by the European 
Union between 1998 and 2006 - 
an expenditure of €6 billion - led 
to identifiable reports”!
Role of data papers and data journals 
• Incentive, credit for sharing! 
o Big and small data! 
o Unpublished data! 
o Long tail of data! 
o Curated aggregation ! 
• Peer review focus! 
• Value of data vs. analysis! 
• Discoverability and reusability! 
o Complementing community 
databases! 
• Narrative/context!
Role of data papers and data journals 
• The power of “small data” are in their aggregation and integration 
with other datasets! 
• There is value in all well-curated, validated and reusable data – big 
and small!
Adding value to research articles and data records 
Research 
articles 
Descriptors 
Data 
Data 
records
Adding value to research articles and data records 
Research 
articles 
Descriptors 
Data 
Data 
records 
Credit for sharing 
your data 
Focused on reuse 
and reproducibility 
Peer reviewed, 
curated 
Open Access 
Promoting 
community 
data and code 
repositories
Progressively refine guidance to authors and reviewers 
~ 156 
~ 70 
~ 334 
Source: BioPortal 
Databases ! 
implementing ! 
standards! 
miame! 
MIAPA! 
MIRIAM! 
MIX!MIQAS! 
MIGEN! 
MIAPE! 
CIMR! 
MIASE! 
REMARK! 
MIQE! 
CONSORT! 
MISFISHIE….! 
MAGE-Tab! 
GCDML! 
SRAxml! 
SOFT! FASTA! 
DICOM! 
MzML! 
SBRML! 
CML! 
GELML! 
SEDML…! 
MITAB! 
ISA-Tab! 
AAO! 
CHEBI! 
OBI! 
PATO! ENVO! 
MOD! 
TEDDY! 
BTO! 
IDO…! 
XAO! 
PRO! 
DO 
VO!
Mapping the landscape of standards and databases
Mapping the landscape of standards and databases
Help stakeholders to make informed decisions 
Researchers, developers and curators lack support and guidance on how to best navigate and 
select content standards, understand their maturity, or find databases that implement them; 
Funders, journals and librarians do not have enough information to make informed decisions 
on which content standards or database to recommended in policies, or funded or implemented
Summarizing 
• Selective reporting of data and methods is still an issue 
• Let’s not forget the potential value of the long-tail of data 
• Data papers and journals can provide incentive and 
credit to share more data - big and small 
• Content standards do help - but the current wealth of 
options is an obstacle

Big data, small data, data papers - short statement for "BDebate on Biomedicine 2014"

  • 1.
    ! What isBig Data in Biomedicine?! Data Types to be considered! ! Susanna-Assunta Sansone, PhD! ! @biosharing! @isatools! @scientificdata! ! B-DEBATE: Big Data in Biomedicine. Challenges and Opportunities, 11 Nov, 2014 Data Consultant, Honorary Academic Editor Associate Director, Principal Investigator
  • 2.
    Let’s not forgetthe long tail of research data • Big science efforts represent only a small proportion! o often featuring homogenous and well-organized data! ! • There is a large proportion of small independent research efforts! o a rich variety of specialty data sets!
  • 3.
    Let’s not forgetthe long tail of research data • Small independent research efforts fall in the long-tail of the distribution! o Most of this (such as as siloed databases, null findings) is unpublished! o These dark data hold a potential wealth of knowledge!
  • 4.
    Plagued by selectivereporting of data and methods • Over 50% of completed studies in biomedicine do not appear in the published literature! ! • Instead reside in file drawers personal and hard drives! ! • Often because results do not conform to author's hypotheses! “Only half the health-related studies funded by the European Union between 1998 and 2006 - an expenditure of €6 billion - led to identifiable reports”!
  • 5.
    Role of datapapers and data journals • Incentive, credit for sharing! o Big and small data! o Unpublished data! o Long tail of data! o Curated aggregation ! • Peer review focus! • Value of data vs. analysis! • Discoverability and reusability! o Complementing community databases! • Narrative/context!
  • 6.
    Role of datapapers and data journals • The power of “small data” are in their aggregation and integration with other datasets! • There is value in all well-curated, validated and reusable data – big and small!
  • 7.
    Adding value toresearch articles and data records Research articles Descriptors Data Data records
  • 8.
    Adding value toresearch articles and data records Research articles Descriptors Data Data records Credit for sharing your data Focused on reuse and reproducibility Peer reviewed, curated Open Access Promoting community data and code repositories
  • 9.
    Progressively refine guidanceto authors and reviewers ~ 156 ~ 70 ~ 334 Source: BioPortal Databases ! implementing ! standards! miame! MIAPA! MIRIAM! MIX!MIQAS! MIGEN! MIAPE! CIMR! MIASE! REMARK! MIQE! CONSORT! MISFISHIE….! MAGE-Tab! GCDML! SRAxml! SOFT! FASTA! DICOM! MzML! SBRML! CML! GELML! SEDML…! MITAB! ISA-Tab! AAO! CHEBI! OBI! PATO! ENVO! MOD! TEDDY! BTO! IDO…! XAO! PRO! DO VO!
  • 10.
    Mapping the landscapeof standards and databases
  • 11.
    Mapping the landscapeof standards and databases
  • 12.
    Help stakeholders tomake informed decisions Researchers, developers and curators lack support and guidance on how to best navigate and select content standards, understand their maturity, or find databases that implement them; Funders, journals and librarians do not have enough information to make informed decisions on which content standards or database to recommended in policies, or funded or implemented
  • 13.
    Summarizing • Selectivereporting of data and methods is still an issue • Let’s not forget the potential value of the long-tail of data • Data papers and journals can provide incentive and credit to share more data - big and small • Content standards do help - but the current wealth of options is an obstacle