SlideShare a Scribd company logo
Saturn Cloud
Accelerating NLP with Dask on Saturn
Cloud
Elsevier Labs Online Lecture
November 2020
1
Hi!
Speakers
Aaron Richter
Senior Data Scientist @ Saturn Cloud
aaron@saturncloud.io
@rikturr
Sujit Pal
Technology Research Director @ Elsevier
sujit.pal@elsevier.com
@palsujit
2
Check out the
blog post!
🔗 How Elsevier Accelerated COVID-19 research using Dask on Saturn Cloud
3
Saturn Cloud
Data science with Python
4
Data science with Python
5
Saturn Cloud
What is Dask?
6
Dask
● Parallel computing for Python people
● Anaconda, ~2015
● Built in Python; Python API
● Mature, scientific computing communities
● Low-level task library
● High-level libraries for DataFrames, arrays, ML
● Integrates with PyData ecosystem
● Runs on laptop, scales to clusters
https://dask.org/
7
Dask
What does it do?
https://docs.dask.org/en/latest/user-interfaces.html
● Parallel machine learning (scikit)
● Parallel dataframes (pandas)
● Parallel arrays (numpy)
● Parallel anything else
8
What does it do?
Arrays and Dataframes
https://docs.dask.org/en/latest/array.html https://docs.dask.org/en/latest/dataframe.html
9
What does it do?
Anything else!
https://docs.dask.org/en/latest/delayed.html 10
What does it do?
Anything else!
https://dask.org/
11
Getting up to
Speed with Dask
https://youtu.be/S_ncqocDcBA
12
Spark vs. Dask
● Written in Scala with Python API
● All-in-one tool
○ Requires re-write to migrate
from PyData code
● Programming model not suited for
complex operations (multi-dim
arrays, machine learning)
● 100% Python
● Built to extend and interact with
PyData ecosystem
● High-level interfaces for
DataFrames, (multi-dim) Arrays, and
ML
● Native integration with RAPIDS for
GPU-acceleration
14
How can I run Dask clusters?
● Manual setup
● SSH
● HPC: MPI, SLURM, SGE, TORQUE, LSF, DRMAA, PBS
● Kubernetes (Docker, Helm)
● Hadoop/YARN
● Cloud provider: AWS or Azure
🔗 https://docs.dask.org/en/latest/setup.html
15
Saturn Cloud
16
● Fast setup
● Enterprise secure
● Pythonic parallelism
● Rapidly scale
PyData
● Multi-GPU computing
● The future of HPC
● Workflow orchestration
● Flow insight and mgmt
Bringing together the fastest hardware + OSS
Saturn Cloud
● Saturn manages all infrastructure
● Hosted: Run within our cloud
● Enterprise: Run within your AWS account
Saturn Cloud to the rescue!
Taking the DevOps out of Data Science
18
● Images
● Jupyter server
● Dask Cluster
● Deployments
Saturn Cloud
Core features
19
20
Saturn Cloud
Extracting Entities from
CORD-19
21
Genesis
22
Genesis
23
● Based on one of the many
COVID-19 initiatives (COVID-
KG)
● Original intent: extract entities
from CORD-19 dataset for
relationship mining.
Genesis
24
● Based on one of the many
COVID-19 initiatives (COVID-
KG)
● Original intent: extract entities
from CORD-19 dataset for
relationship mining.● CORD-19 dataset open sourced
by AllenAI.
● SciSpacy provided Language
models trained on Biomedical
text, and...
● Pre-trained Named Entity
Recognition (and linking) models.
Genesis
25
● Based on one of the many
COVID-19 initiatives (COVID-
KG)
● Original intent: extract entities
from CORD-19 dataset for
relationship mining.● CORD-19 dataset open sourced
by AllenAI.
● SciSpacy provided Language
models trained on Biomedical
text, and...
● Pre-trained Named Entity
Recognition (and linking) models.
● Dask based distributed
computing platform
● Opportunity to evaluate.
Goals
● Create standoff entity annotations for CORD-19.
● Automated entity recognition using pre-trained SciSpaCy models, where each
model recognizes a different subset of entity classes, e.g. DNA, Gene,
Protein, Chemical, Organism, Disease, etc.
● Output is structured as Parquet files, consumable via Dask or Spark.
● Share output dataset with community.
26
CORD-19 Dataset
● Started mid March 2020 with ~40k
articles released weekly.
● By Sept/Oct 2020 ~200k articles
released daily, growing everyday.
● Each release contains:
○ Metadata file (CSV)
○ Set of articles (JSON)
27
SciSpaCy NER(L) models
● Medium English LM for sentence
splitting.
● 4 NER models
● 5 NERL models using LM’s
candidate entity generator and
trained entity linking models.
28
Full Pipeline
● Read metadata.csv
● Parse each JSON file into
paragraphs.
● Split paragraphs into sentences.
● Extract entities from sentences
using a NER(L) model..
29
Files to Paragraphs
● Pipeline is embarrassingly parallel.
● Parse files to paragraphs has no
dependencies (i.e., perfectly
parallel)
30
Paragraphs to Sentences
● Split paragraphs to sentences needs
sentence splitter model assigned per
partition.
● Load only models that you need
using disable attribute.
31
Sentences to Entities (NER)
● Extract entities from sentence needs
NER model, assign lazily to worker per
partition.
● Use nlp.pipe and batching to exploit
multithreading.
32
Sentences to Entities (NERL)
● Extract and link entities from
sentence needs Language Model,
Entity Linker, etc.
● Assign eagerly per worker after
cluster creation.
33
Incremental Pipeline
● Extracted Entities + new metadata
and JSON files.
● Compute diffs (additions +
deletions)
● Parse added articles to paragraphs,
paragraphs to sentences, and
sentences to entities.
● Remove paragraphs, sentences,
and entities for deleted articles.
● Merge diff and original.
34
Output formats
1 paragraph dataframe (3.4M paragraphs), 1 sentence dataframe (17.1M sentences),
and 9 entity dataframes (total 805.4M entities).
35
Parquet Dask / Spark interop
● Output of paragraph, sentence, and entities are in Parquet format.
● Things to keep in mind for Spark interoperability when writing from Dask.
○ Column data types must be declared explicitly on the Dask end.
○ Column names should be specified when saving (“hidden” columns visible in Spark).
○ Explicit re-partitioning may be necessary when saving on Dask.
36
Deliverables
● Code
○ Set of Jupyter notebooks deployed on
Saturn Cloud -- sujitpal/saturn-scispacy
● Data
○ Dataframes in Parquet format (approx
70 GB, 35 for Sep 2020, 35 for Oct
2020).
○ Publicly available on s3://els-saturn-
scispacy/cord19-scispacy-entities
(requester pays).
○ Citable as a Mendeley Dataset.
37
Utility
● Generate micro datasets for specific tasks.
● Examples:
○ Human Phenotypes Annotations from HPO co-occurring in same sentence with Disease
Annotations from UMLS or BC5CDR.
○ Gene annotations co-occurring in same sentence with Cancer annotations (both from
BioNLP).
○ Curative Effects of Hydroxychloroquine on COVID-19.
○ Potential of BCG (typhoid) vaccine as protection against COVID-19.
● Annotations can be features for Topic Modeling or Categorization.
● Others...
38
Questions
? 39

More Related Content

What's hot (20)

PPT
LarKC Tutorial at ISWC 2009 - Data Model
LarKC
 
PPTX
Optimizing Application Architecture (.NET/Java topics)
Ravi Okade
 
PPTX
Democratizing Big Semantic Data management
WU (Vienna University of Economics and Business)
 
PDF
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Thamme Gowda
 
PDF
The Materials Project - Combining Science and Informatics to Accelerate Mater...
University of California, San Diego
 
PDF
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
⛑ Pavol Loffay
 
PDF
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
GUANGYUAN PIAO
 
PDF
Dgraph: Graph database for production environment
openCypher
 
PPTX
Exploring linked data in r
David Sherlock
 
PDF
Spark + i python
Guillermo Blasco Jiménez
 
PPT
Semantic web and Drupal: an introduction
Kristof Van Tomme
 
PPTX
NASA Terra Data Fusion
The HDF-EOS Tools and Information Center
 
PPTX
RDF-Gen: Generating RDF from streaming and archival data
Giorgos Santipantakis
 
PDF
PharoDAYS 2015: Pharo Status - by Markus Denker
Pharo
 
PDF
Reproducible Workflow with Cytoscape and Jupyter Notebook
Keiichiro Ono
 
PDF
apidays LIVE Paris 2021 - GraphQL Today and Tomorrow by Uri Goldshtein, The G...
apidays
 
PDF
Julia + R for Data Science
Work-Bench
 
PPTX
FAIR Projector Builder
Mark Wilkinson
 
PPTX
Dataset Descriptions in Open PHACTS and HCLS
Alasdair Gray
 
LarKC Tutorial at ISWC 2009 - Data Model
LarKC
 
Optimizing Application Architecture (.NET/Java topics)
Ravi Okade
 
Democratizing Big Semantic Data management
WU (Vienna University of Economics and Business)
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Thamme Gowda
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
University of California, San Diego
 
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
⛑ Pavol Loffay
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
GUANGYUAN PIAO
 
Dgraph: Graph database for production environment
openCypher
 
Exploring linked data in r
David Sherlock
 
Spark + i python
Guillermo Blasco Jiménez
 
Semantic web and Drupal: an introduction
Kristof Van Tomme
 
RDF-Gen: Generating RDF from streaming and archival data
Giorgos Santipantakis
 
PharoDAYS 2015: Pharo Status - by Markus Denker
Pharo
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Keiichiro Ono
 
apidays LIVE Paris 2021 - GraphQL Today and Tomorrow by Uri Goldshtein, The G...
apidays
 
Julia + R for Data Science
Work-Bench
 
FAIR Projector Builder
Mark Wilkinson
 
Dataset Descriptions in Open PHACTS and HCLS
Alasdair Gray
 

Similar to Accelerating NLP with Dask and Saturn Cloud (10)

PDF
Scalable Scientific Computing with Dask
Uwe Korn
 
PDF
New Capabilities in the PyData Ecosystem
Turi, Inc.
 
PPTX
Dask: Scaling Python
Matthew Rocklin
 
DOCX
What is Dask and How Does It Work?
Mel Denisse
 
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PDF
Fast and Scalable Python
Travis Oliphant
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
PDF
PyData Boston 2013
Travis Oliphant
 
PDF
How HPC and large-scale data analytics are transforming experimental science
inside-BigData.com
 
Scalable Scientific Computing with Dask
Uwe Korn
 
New Capabilities in the PyData Ecosystem
Turi, Inc.
 
Dask: Scaling Python
Matthew Rocklin
 
What is Dask and How Does It Work?
Mel Denisse
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Fast and Scalable Python
Travis Oliphant
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
PyData Boston 2013
Travis Oliphant
 
How HPC and large-scale data analytics are transforming experimental science
inside-BigData.com
 
Ad

More from Sujit Pal (20)

PPTX
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Sujit Pal
 
PPTX
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
 
PPTX
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
PPTX
Cheap Trick for Question Answering
Sujit Pal
 
PPTX
Searching Across Images and Test
Sujit Pal
 
PPTX
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Sujit Pal
 
PPTX
The power of community: training a Transformer Language Model on a shoestring
Sujit Pal
 
PPTX
Backprop Visualization
Sujit Pal
 
PPTX
Leslie Smith's Papers discussion for DL Journal Club
Sujit Pal
 
PPTX
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
PPTX
Transformer Mods for Document Length Inputs
Sujit Pal
 
PPTX
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
PPTX
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 
PPTX
Graph Techniques for Natural Language Processing
Sujit Pal
 
PPTX
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
PPTX
Search summit-2018-ltr-presentation
Sujit Pal
 
PPTX
Search summit-2018-content-engineering-slides
Sujit Pal
 
PPTX
SoDA v2 - Named Entity Recognition from streaming text
Sujit Pal
 
PPTX
Evolving a Medical Image Similarity Search
Sujit Pal
 
PPTX
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Sujit Pal
 
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Sujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Cheap Trick for Question Answering
Sujit Pal
 
Searching Across Images and Test
Sujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
Sujit Pal
 
Backprop Visualization
Sujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Transformer Mods for Document Length Inputs
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 
Graph Techniques for Natural Language Processing
Sujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Search summit-2018-ltr-presentation
Sujit Pal
 
Search summit-2018-content-engineering-slides
Sujit Pal
 
SoDA v2 - Named Entity Recognition from streaming text
Sujit Pal
 
Evolving a Medical Image Similarity Search
Sujit Pal
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Sujit Pal
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 

Accelerating NLP with Dask and Saturn Cloud

  • 1. Saturn Cloud Accelerating NLP with Dask on Saturn Cloud Elsevier Labs Online Lecture November 2020 1
  • 2. Hi! Speakers Aaron Richter Senior Data Scientist @ Saturn Cloud [email protected] @rikturr Sujit Pal Technology Research Director @ Elsevier [email protected] @palsujit 2
  • 3. Check out the blog post! 🔗 How Elsevier Accelerated COVID-19 research using Dask on Saturn Cloud 3
  • 5. Data science with Python 5
  • 7. Dask ● Parallel computing for Python people ● Anaconda, ~2015 ● Built in Python; Python API ● Mature, scientific computing communities ● Low-level task library ● High-level libraries for DataFrames, arrays, ML ● Integrates with PyData ecosystem ● Runs on laptop, scales to clusters https://dask.org/ 7
  • 8. Dask What does it do? https://docs.dask.org/en/latest/user-interfaces.html ● Parallel machine learning (scikit) ● Parallel dataframes (pandas) ● Parallel arrays (numpy) ● Parallel anything else 8
  • 9. What does it do? Arrays and Dataframes https://docs.dask.org/en/latest/array.html https://docs.dask.org/en/latest/dataframe.html 9
  • 10. What does it do? Anything else! https://docs.dask.org/en/latest/delayed.html 10
  • 11. What does it do? Anything else! https://dask.org/ 11
  • 12. Getting up to Speed with Dask https://youtu.be/S_ncqocDcBA 12
  • 13. Spark vs. Dask ● Written in Scala with Python API ● All-in-one tool ○ Requires re-write to migrate from PyData code ● Programming model not suited for complex operations (multi-dim arrays, machine learning) ● 100% Python ● Built to extend and interact with PyData ecosystem ● High-level interfaces for DataFrames, (multi-dim) Arrays, and ML ● Native integration with RAPIDS for GPU-acceleration
  • 14. 14
  • 15. How can I run Dask clusters? ● Manual setup ● SSH ● HPC: MPI, SLURM, SGE, TORQUE, LSF, DRMAA, PBS ● Kubernetes (Docker, Helm) ● Hadoop/YARN ● Cloud provider: AWS or Azure 🔗 https://docs.dask.org/en/latest/setup.html 15
  • 17. ● Fast setup ● Enterprise secure ● Pythonic parallelism ● Rapidly scale PyData ● Multi-GPU computing ● The future of HPC ● Workflow orchestration ● Flow insight and mgmt Bringing together the fastest hardware + OSS Saturn Cloud
  • 18. ● Saturn manages all infrastructure ● Hosted: Run within our cloud ● Enterprise: Run within your AWS account Saturn Cloud to the rescue! Taking the DevOps out of Data Science 18
  • 19. ● Images ● Jupyter server ● Dask Cluster ● Deployments Saturn Cloud Core features 19
  • 20. 20
  • 23. Genesis 23 ● Based on one of the many COVID-19 initiatives (COVID- KG) ● Original intent: extract entities from CORD-19 dataset for relationship mining.
  • 24. Genesis 24 ● Based on one of the many COVID-19 initiatives (COVID- KG) ● Original intent: extract entities from CORD-19 dataset for relationship mining.● CORD-19 dataset open sourced by AllenAI. ● SciSpacy provided Language models trained on Biomedical text, and... ● Pre-trained Named Entity Recognition (and linking) models.
  • 25. Genesis 25 ● Based on one of the many COVID-19 initiatives (COVID- KG) ● Original intent: extract entities from CORD-19 dataset for relationship mining.● CORD-19 dataset open sourced by AllenAI. ● SciSpacy provided Language models trained on Biomedical text, and... ● Pre-trained Named Entity Recognition (and linking) models. ● Dask based distributed computing platform ● Opportunity to evaluate.
  • 26. Goals ● Create standoff entity annotations for CORD-19. ● Automated entity recognition using pre-trained SciSpaCy models, where each model recognizes a different subset of entity classes, e.g. DNA, Gene, Protein, Chemical, Organism, Disease, etc. ● Output is structured as Parquet files, consumable via Dask or Spark. ● Share output dataset with community. 26
  • 27. CORD-19 Dataset ● Started mid March 2020 with ~40k articles released weekly. ● By Sept/Oct 2020 ~200k articles released daily, growing everyday. ● Each release contains: ○ Metadata file (CSV) ○ Set of articles (JSON) 27
  • 28. SciSpaCy NER(L) models ● Medium English LM for sentence splitting. ● 4 NER models ● 5 NERL models using LM’s candidate entity generator and trained entity linking models. 28
  • 29. Full Pipeline ● Read metadata.csv ● Parse each JSON file into paragraphs. ● Split paragraphs into sentences. ● Extract entities from sentences using a NER(L) model.. 29
  • 30. Files to Paragraphs ● Pipeline is embarrassingly parallel. ● Parse files to paragraphs has no dependencies (i.e., perfectly parallel) 30
  • 31. Paragraphs to Sentences ● Split paragraphs to sentences needs sentence splitter model assigned per partition. ● Load only models that you need using disable attribute. 31
  • 32. Sentences to Entities (NER) ● Extract entities from sentence needs NER model, assign lazily to worker per partition. ● Use nlp.pipe and batching to exploit multithreading. 32
  • 33. Sentences to Entities (NERL) ● Extract and link entities from sentence needs Language Model, Entity Linker, etc. ● Assign eagerly per worker after cluster creation. 33
  • 34. Incremental Pipeline ● Extracted Entities + new metadata and JSON files. ● Compute diffs (additions + deletions) ● Parse added articles to paragraphs, paragraphs to sentences, and sentences to entities. ● Remove paragraphs, sentences, and entities for deleted articles. ● Merge diff and original. 34
  • 35. Output formats 1 paragraph dataframe (3.4M paragraphs), 1 sentence dataframe (17.1M sentences), and 9 entity dataframes (total 805.4M entities). 35
  • 36. Parquet Dask / Spark interop ● Output of paragraph, sentence, and entities are in Parquet format. ● Things to keep in mind for Spark interoperability when writing from Dask. ○ Column data types must be declared explicitly on the Dask end. ○ Column names should be specified when saving (“hidden” columns visible in Spark). ○ Explicit re-partitioning may be necessary when saving on Dask. 36
  • 37. Deliverables ● Code ○ Set of Jupyter notebooks deployed on Saturn Cloud -- sujitpal/saturn-scispacy ● Data ○ Dataframes in Parquet format (approx 70 GB, 35 for Sep 2020, 35 for Oct 2020). ○ Publicly available on s3://els-saturn- scispacy/cord19-scispacy-entities (requester pays). ○ Citable as a Mendeley Dataset. 37
  • 38. Utility ● Generate micro datasets for specific tasks. ● Examples: ○ Human Phenotypes Annotations from HPO co-occurring in same sentence with Disease Annotations from UMLS or BC5CDR. ○ Gene annotations co-occurring in same sentence with Cancer annotations (both from BioNLP). ○ Curative Effects of Hydroxychloroquine on COVID-19. ○ Potential of BCG (typhoid) vaccine as protection against COVID-19. ● Annotations can be features for Topic Modeling or Categorization. ● Others... 38

Editor's Notes

  • #6: Q’s to ask: what other packages do you frequently use? Things that you always import with each notebook/script
  • #8: Goal: flexible parallel computing for Python ecosystem. Compatible with lots of packages Generally, if you do something in Python, you can probably make it faster pretty easily with Dask (minimal re-write) Click link, scroll to “powered by Dask”
  • #13: Ok so that was Dask, we got another tool to talk about
  • #14: Took over big data world post-Hadoop. Still carries Hadoop residue. There is a machine learning library, but is pretty slow (and why re-learn new library, when scikit-learn has more algorithms, plus you already know it) I worked with Spark for years, but always dropped back to PyData after I was done cleaning data Also not great for monitoring/debugging
  • #16: If you already have a cluster with one of these mangement tools, you’re good to go Cloud provider - AWS can’t get GPUs
  • #19: Contrary to popular belief, this is not zach galifinakis (its robert redford) Show Saturn UI