Sharing reusable phylogenetic data:
we're not there yet

Ross Mounce
@rmounce
http://orcid.org/0000-0002-3520-2046
A talk of
two halves
1.) Outlining the extent of the problem
(lack of) sharing, standards, care (?)
2.) What I'm trying to do about it:
Digging data out of PDFs
Re-releasing as
Where's the data?
Just ~4% of published phylogenetic studies in 2010
publicly archived their supporting phylo data in

Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, & Vos R. 2012

Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis
BMC Research Notes 10.1186/1756-0500-5-574

Check our data yourself on Dryad here: 10.5061/dryad.h6pf365t
Scientists cannot be relied upon to
share published data upon request
This has been known for a while now
e.g. (in Psychology) Wicherts et al 2006
But has been confirmed to be true for phylogenetics too:
Drew et al 2013 'Lost Branches in the Tree of Life'
report that just ~16% of researchers contacted supplied
the requested ('published') phylo data.
My own experience tallies with this – I soon stopped bothering to try and
ask people via email for a copy of their published data. It's a waste of time.
The (Single) Supplementary Data File
was a Y2K solution – a dump
Many legacy journal supplementary data systems
bury data and leave it there to decompose
Often not re-usable in form e.g. a lazy PDF
Sometimes 'typeset', corrupting the data
A jumble of words & data where the bit you
want is on page 92 (no programmatic access)

Research
BURIED and really not very discoverable
Data

Do reviewers even look at it? I think not tbh
I wasted too much of my PhD
trying to get usable data to re-analyze
This is what I felt like...

So I tried to do something
about it...

An open letter in support of
palaeontology data archiving
www.supportpalaeodatarchiving.co.uk

Which was picked-up by Nature News
Which, in turn got me in touch with:
Part 2
Since few will help you to re-use their data
You've got to dig it out
and
make it re-usable yourself
AND
re-release it openly
so no-one else wastes their time doing this
It's not just phylogenetics.
I learned from the Open Knowledge Conference (Berlin 2011)
that a lot different academic fields seem also struggle to
make re-usable published data available.

If it's a common, shared-problem...
why not seek a shared, cross-disciplinary solution?
AMI (Amanuensis)
Building upon tools first developed
in computational chemistry by the Murray-Rust lab
e.g.
ChemicalTagger → PhyloTagger (Entity tagging)
(Chem)PubCrawler → (Phylo)PubCrawler
(to getting 10,000+ PDFs to work on)

https://bitbucket.org/nickday/pub-crawler
http://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger
Open Source
BBSRC grant approved
“PLUTo: Phyloinformatic Literature Unlocking Tools”
Software for making published phyloinformatic
data discoverable, open, and reusable
...I just need to get my PhD viva done & rubber-stamped

Instructions for getting the current working setup here:
(multiple repositories, dependencies & requirements!)
http://rossmounce.co.uk/2013/10/06/setting-up-ami2-on-windows/
PDF 
HTML


AMI

Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle Håstad
and Per Alström 4

2,3

Styles , superscripts
And diåcritics
preserved!
PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus

Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
AMI
0.84
0.91
0.93
0.95
Posterior
probability

23.12
34.54
37.21
38.55
Branch
lengths

NexML
HTML

Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma

Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae

Genus

Family
Acknowledgements & Thanks

For the Panton Fellowship,
inspiration and support

To the organisers
of both the session:
Nico, Hilmar, Rutger
and the conference
as a whole!

For travel & accommodation
support, without which I couldn't
possibly attend TDWG

My main collaborators on PLUTo: Matthew Wills and Peter Murray-Rust

More Related Content

PDF
Open scholarship [a FOSTER open science talk]
PDF
Open Access for Early Career Researchers
PDF
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
PDF
Museum impact: linking-up specimens with research published on them
PDF
Modern Tools & Rationales for 21st Century Research
PDF
Open Research Data: Licensing | Standards | Future
PDF
The State of Open Research Data
PPTX
Content Mining at Wellcome Trust
Open scholarship [a FOSTER open science talk]
Open Access for Early Career Researchers
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Museum impact: linking-up specimens with research published on them
Modern Tools & Rationales for 21st Century Research
Open Research Data: Licensing | Standards | Future
The State of Open Research Data
Content Mining at Wellcome Trust

What's hot (20)

PPTX
ContentMine: Liberating scholarship from Open publications and theses
PDF
Workshop 5: Uptake of, and concepts in text and data mining
PPTX
ContentMining in Neuroscience
PPTX
The Content Mine (presented at UKSG)
PPTX
Open Notebook Science
PPTX
Content Mining of Science in Europe
PPTX
Open data and Open Science
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
Cochrane workshop2016
PPTX
ContentMine and WikiData
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
Making Theses USEFUL
PPTX
Content Mining at Wellcome Trust
PPT
Open Access Overview, Faculty Senate Library Committee, 10/21/08
PPT
SPARC Overview and Update, October 2008
PPTX
Principles and practice of Open Science
PPTX
Text and Data Mining explained at FTDM
PPTX
OpenNotebookScience NOW!
PPT
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
PPT
Opportunities and Challenges of establishing Open Access Repositories: A case...
ContentMine: Liberating scholarship from Open publications and theses
Workshop 5: Uptake of, and concepts in text and data mining
ContentMining in Neuroscience
The Content Mine (presented at UKSG)
Open Notebook Science
Content Mining of Science in Europe
Open data and Open Science
ContentMine + EPMC: Finding Zika!
Cochrane workshop2016
ContentMine and WikiData
Automatic Extraction of Science and Medicine from the scholarly literature
Making Theses USEFUL
Content Mining at Wellcome Trust
Open Access Overview, Faculty Senate Library Committee, 10/21/08
SPARC Overview and Update, October 2008
Principles and practice of Open Science
Text and Data Mining explained at FTDM
OpenNotebookScience NOW!
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
Opportunities and Challenges of establishing Open Access Repositories: A case...
Ad

Similar to Sharing re-usable phylogenetic data: we're not there yet (20)

PPT
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
PPTX
The culture of researchData
PPTX
The culture of researchData
PPTX
The Culture of Research Data, by Peter Murray-Rust
PDF
The OpenCon Intro to Open Data
PDF
RDA Scholarly Infrastructure 2015
PPTX
Cartegena051811
PDF
FAIR and open biodiversity collection data management
PPT
Open Data in a Big Data World: easy to say, but hard to do?
PPT
Evolution of e-Research
PPTX
Open Data and Open Science
PPTX
Data Sharing in Economics – Opportunities and Limitations_Toepfer
PPTX
Reward, reproducibility and recognition in research - the case for going Open
PPTX
ContentMining in Neuroscience
PPTX
ContentMining in Neuroscience
PDF
Open science curriculum for students, June 2019
PPTX
Data sharing archiving discovery, Bill Michener
PPTX
Public Data Archiving in Ecology and Evolution: How well are we doing?
PPTX
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
PDF
A basic course on Research data management, part 1: what and why
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
The culture of researchData
The culture of researchData
The Culture of Research Data, by Peter Murray-Rust
The OpenCon Intro to Open Data
RDA Scholarly Infrastructure 2015
Cartegena051811
FAIR and open biodiversity collection data management
Open Data in a Big Data World: easy to say, but hard to do?
Evolution of e-Research
Open Data and Open Science
Data Sharing in Economics – Opportunities and Limitations_Toepfer
Reward, reproducibility and recognition in research - the case for going Open
ContentMining in Neuroscience
ContentMining in Neuroscience
Open science curriculum for students, June 2019
Data sharing archiving discovery, Bill Michener
Public Data Archiving in Ecology and Evolution: How well are we doing?
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
A basic course on Research data management, part 1: what and why
Ad

More from Ross Mounce (8)

PDF
The PLUTo project @iEvoBio 2014
PDF
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
PDF
Social Media For Researchers
PDF
Social Media for Science
PDF
Herding Cats
PDF
Content Mining
PPT
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
PPT
ProgPal2011
The PLUTo project @iEvoBio 2014
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Social Media For Researchers
Social Media for Science
Herding Cats
Content Mining
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
ProgPal2011

Recently uploaded (20)

PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
4 layer Arch & Reference Arch of IoT.pdf
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
sbt 2.0: go big (Scala Days 2025 edition)
sustainability-14-14877-v2.pddhzftheheeeee
Microsoft Excel 365/2024 Beginner's training
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
UiPath Agentic Automation session 1: RPA to Agents
Consumable AI The What, Why & How for Small Teams.pdf
4 layer Arch & Reference Arch of IoT.pdf
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
CloudStack 4.21: First Look Webinar slides
Training Program for knowledge in solar cell and solar industry
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Basics of Cloud Computing - Cloud Ecosystem
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
Custom Battery Pack Design Considerations for Performance and Safety
Flame analysis and combustion estimation using large language and vision assi...
Credit Without Borders: AI and Financial Inclusion in Bangladesh
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf

Sharing re-usable phylogenetic data: we're not there yet

  • 1. Sharing reusable phylogenetic data: we're not there yet Ross Mounce @rmounce http://orcid.org/0000-0002-3520-2046
  • 2. A talk of two halves 1.) Outlining the extent of the problem (lack of) sharing, standards, care (?) 2.) What I'm trying to do about it: Digging data out of PDFs Re-releasing as
  • 3. Where's the data? Just ~4% of published phylogenetic studies in 2010 publicly archived their supporting phylo data in Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, & Vos R. 2012 Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis BMC Research Notes 10.1186/1756-0500-5-574 Check our data yourself on Dryad here: 10.5061/dryad.h6pf365t
  • 4. Scientists cannot be relied upon to share published data upon request This has been known for a while now e.g. (in Psychology) Wicherts et al 2006 But has been confirmed to be true for phylogenetics too: Drew et al 2013 'Lost Branches in the Tree of Life' report that just ~16% of researchers contacted supplied the requested ('published') phylo data. My own experience tallies with this – I soon stopped bothering to try and ask people via email for a copy of their published data. It's a waste of time.
  • 5. The (Single) Supplementary Data File was a Y2K solution – a dump Many legacy journal supplementary data systems bury data and leave it there to decompose Often not re-usable in form e.g. a lazy PDF Sometimes 'typeset', corrupting the data A jumble of words & data where the bit you want is on page 92 (no programmatic access) Research BURIED and really not very discoverable Data Do reviewers even look at it? I think not tbh
  • 6. I wasted too much of my PhD trying to get usable data to re-analyze This is what I felt like... So I tried to do something about it... An open letter in support of palaeontology data archiving www.supportpalaeodatarchiving.co.uk Which was picked-up by Nature News Which, in turn got me in touch with:
  • 7. Part 2 Since few will help you to re-use their data You've got to dig it out and make it re-usable yourself AND re-release it openly so no-one else wastes their time doing this
  • 8. It's not just phylogenetics. I learned from the Open Knowledge Conference (Berlin 2011) that a lot different academic fields seem also struggle to make re-usable published data available. If it's a common, shared-problem... why not seek a shared, cross-disciplinary solution?
  • 9. AMI (Amanuensis) Building upon tools first developed in computational chemistry by the Murray-Rust lab e.g. ChemicalTagger → PhyloTagger (Entity tagging) (Chem)PubCrawler → (Phylo)PubCrawler (to getting 10,000+ PDFs to work on) https://bitbucket.org/nickday/pub-crawler http://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger Open Source
  • 10. BBSRC grant approved “PLUTo: Phyloinformatic Literature Unlocking Tools” Software for making published phyloinformatic data discoverable, open, and reusable ...I just need to get my PhD viva done & rubber-stamped Instructions for getting the current working setup here: (multiple repositories, dependencies & requirements!) http://rossmounce.co.uk/2013/10/06/setting-up-ami2-on-windows/
  • 11. PDF  HTML  AMI Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad and Per Alström 4 2,3 Styles , superscripts And diåcritics preserved!
  • 12. PDF  Turdus iliacus Taeniopygia guttata Serinus canaria Lanius excubitor Melopsittacus undulatus Pavo cristatus Sturnus vulgaris Dolichonyx oryzivorus Ficedula hypoleuca Vaccinium myrtillus Falco tinnunculus Turdus Pomatostomus Leothrix Amytornis Acanthisitta Orthonyx x 2 Malurus Cnemophilus x 4 Philesturnus x 2 Motacilla x 2 Toxorhampus x 2
  • 13. Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
  • 15. Acknowledgements & Thanks For the Panton Fellowship, inspiration and support To the organisers of both the session: Nico, Hilmar, Rutger and the conference as a whole! For travel & accommodation support, without which I couldn't possibly attend TDWG My main collaborators on PLUTo: Matthew Wills and Peter Murray-Rust