Mining Scientific Diagrams for facts

Mining Scientific Images
Peter Murray-Rust,
Dept of Chemistry and TheContentMine
DAMTP, Cambridge, UK, 2016-01-27
contentmine.org is supported by a grant to PMR as a

The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org

Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] per month
2.5 million (papers + supplemental data) /year [citation needed]*
each 3 mm thick
 4500 m high per year [2]
* Most is not Publicly readable
[1] http://www.crossref.org/01company/crossref_indicators.html

Most Publishers destroy structured
information (LaTeX, Word) into PDF …
• Characters (NOT words or higher structure)
WORD is simply 4 characters, no space chars
• Paths (NOT circles, squares …) “Vectors”
… APIs then destroy it further into Pixels
(e.g. PNG  or JPG )
Content Mine will read 10,000 PNGs a day and
try to recover the science.

What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these

PMR is collaborating with the European Bioinformatics
Institute to liberate all metabolic information from journals

Multisegment diagram
Whitespace
“corridors”
Superpixel
Bounding box
Semantic
labels

Chemistry in Patents
Obfuscation?

Chemical Computer Vision
Raw Mobile photo; problems:
Shadows, contrast, noise, skew, clipping

BoofCV Operations
Low Level Image Processing
Blur Different operations for smoothing/blurring images.
Derivatives Shows the first and second order image derivatives.
Contour Detects the contour/edges of objects inside an image.
Denoising ways to remove noise from images, e.g. wavelet and blur filters.
Interpolation Shows different interpolation algorithms scaling up an image.
Binary Operations Different basic binary image operations.
Remove lens distortion
Lines
Orientation
Shape Fitting
Superpixels
Boofcv.org Open Source Java Library

https://en.wikipedia.org/wiki/Otsu's_method
Thresholding
(Binarization)
https://en.wikipedia.org/wiki/Thresholding_
%28image_processing%29

Binarization (pixels = 0,1)
Irregular edges

Antialiased Original
Binarization

Colours – antialiasing and
posterisation

Posterisation
Extracted since unique posterized colour

Erosion and Dilation
• https://en.wikipedia.org/wiki/Mathematical_morphology
Erosion Opening
Dilation Closing
Dilation followed by erosion can remove small breaks, etc.

http://homepages.inf.ed.ac.uk/rbf/HIPR2/thin.htm
http://rosettacode.org/wiki/Zhang-Suen_thinning_algorithm
Algorithm
Assume black pixels are one and white pixels zero, and that the input image is a rectangular N by M array of ones
and zeroes.
The algorithm operates on all black pixels P1 that can have eight neighbours. The neighbours are, in order,
arranged as:
P9P2P3
P8P1P4
P7P6P5
Obviously the boundary pixels of the image cannot have the full eight neighbours.
Define A ( P 1 ) { A(P1)} = the number of transitions from white to black, (0 -> 1) in the sequence
P2,P3,P4,P5,P6,P7,P8,P9,P2. (Note the extra P2 at the end - it is circular).
Define B ( P 1 ) {B(P1)} = The number of black pixel neighbours of P1. ( = sum(P2 .. P9) )
Step 1 All pixels are tested and pixels satisfying all the following conditions (simultaneously) are just noted at this
stage.
(0) The pixel is black and has eight neighbours
(1) 2 <= B ( P 1 ) <= 6 {2<=B(P1)<=6}
(2) A(P1) = 1
(3) At least one of P2 and P4 and P6 is white
After iterating over the image and collecting all the pixels satisfying all step 1 conditions, all these condition
satisfying pixels are set to white.
Step 2 All pixels are again tested and pixels satisfying all the following conditions are just noted at this stage.
(0) The pixel is black and has eight neighbours
(1) 2 <= B ( P 1 ) <= 6 2<=B(P1)<=6}
(2) A(P1) = 1
After iterating over the image and collecting all the pixels satisfying all step 2 conditions, all these condition
satisfying pixels are again set to white.
Iteration If any pixels were set in this round of either step 1 or step 2 then all steps are repeated until no image
pixels are so changed.
Zhang-Suen Thinning

Thinning: thick lines to 1-pixel

Vectorization of line segments
NodesNon-node
Segmentation of one
edge into 4 lines
Douglas-Peucker
segmentation
algorithm
Fully thinned binary image

Chemical Optical Character Recognition
Small alphabet, clean typefaces, clear boundaries make
this relatively tractable. Problems are “I” “O” etc.

Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR

http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014

Ross Mounce (Bath), Panton Fellow
• Sharing research data:
http://www.slideshare.net/rossmounce
• How-to figures from PLOS/One [link]:
Ross shows how to bring figures to life:
• PLOSOne at http://bit.ly/PLOStrees
• PLOS at http://bit.ly/phylofigs (demo)

Note Jaggy and
broken pixels
NEW Bacteria must have a phylogenetic tree
Length
_________Weight
Binomial Name Culture/Strain GENBANK ID
Evolution
Rate

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-
mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick

IJSEM phylotrees
• International Journal Systematic and
Evolutionary Microbiology
• All new microorganisms are expected to be
published there
• Consistent (though primitive) approach to
trees

OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga
_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te
rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat
um:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)

Automatic Open Notebook of computations
Everything is posted to Github before being analyzed

Bacillus subtilis [131238]*
Bacteroides fragilis [221817]
Brevibacillus brevis
Cyclobacterium marinum
Escherichia coli [25419]
Filobacillus milosensis
Flectobacillus major [15809775]
Flexibacter flexilis [15809789]
Formosa algae
Gelidibacter algens [16982233]
Halobacillus halophilus
Lentibacillus salicampi [18345921]
Octadecabacter arcticus
Psychroflexus torquis [16988834]
Pseudomonas aeruginosa [31856]
Sagittula stellata [16992371]
Salegentibacter salegens
Sphingobacterium spiritivorum
Terrabacter tumescens
• [Identifier in Wikidata]
• Missing = not found with Wikidata API
20 commonest organisms (in > 30 papers) in trees from IJSEM*
Half do not appear to be in Wikidata
Can the Wikipedia Scientists comment?
*Int. J. Syst. Evol. Microbiol.

Display your own tree
• Cut and paste…
• ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182)
,((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218
,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n18
7),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n8
8,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198))
)),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n2
31,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,
((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,
n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n1
58,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163
,n227)),((n53,n131),n159)))))));
• View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or
• http://www.trex.uqam.ca/index.php?action=newick&project=trex

Supertree for 924 species
Tree

Supertree created from 4300 papers

To be extracted:
* Symbol(x,y)
* Error bar (y+,y-)
* Line
Yaxis
• Extent

Typical PDF with vectors - hyperlink

But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow
Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE

UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points

Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Smoothing
Gaussian Filter
Automatic
extraction

C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark

After AMI2 processing…..
… AMI2 has detected a square

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)

Precision + Recall for ImageAnalysis?
• Chemical Patents (obfuscation) ca 25% PR
• Binomial names from text > 99% PR
• Binomial from images (lookup) 95%+
• Trees from images (pred.)
• Molecules: image ca 90% SVG >
• Analysis massively hampered by Copyright

Software Availability and collaboration
• All software OSI-compliant (non-GPL) Apache2 , MIT, BSD
• http://bitbucket.org/wwmm, (euclid, Jumbo6, svg, pdf2svg,
• http://bitbucket.org/petermr, svgbuilder, xhtml2stm,
imageanalysis, diagramanalyzer
• http://bitbucket.org/AndyHowlett/ami2-poc
• http://github.com/petermr/ami-plugin
• http://github.com/ContentMine
• http://boofcv.org
• collaboration with PDFBox, TabulaPDF, JailbreakingThePDF
• Extracted data CC 0

Questions and comments
Thanks:
• Andy Howlett, Dept Chemistry, Cambridge
• Mark Williamson, Dept Chemistry, Cambridge
• Ross Mounce, Biology, University of Bath
• Shuttleworth Foundation
PM-R has offered to mentor an MSc project this summer
for anyone interested.
contentmine.org

Mining Scientific Diagrams for facts

Mining Scientific Diagrams for facts

More Related Content

What's hot

Viewers also liked

Similar to Mining Scientific Diagrams for facts

More from TheContentMine

Recently uploaded

Mining Scientific Diagrams for facts

Editor's Notes