Accelerating NLP with Dask and Saturn Cloud

Saturn Cloud
Accelerating NLP with Dask on Saturn
Cloud
Elsevier Labs Online Lecture
November 2020
1

Hi!
Speakers
Aaron Richter
Senior Data Scientist @ Saturn Cloud
aaron@saturncloud.io
@rikturr
Sujit Pal
Technology Research Director @ Elsevier
sujit.pal@elsevier.com
@palsujit
2

Check out the
blog post!
🔗 How Elsevier Accelerated COVID-19 research using Dask on Saturn Cloud
3

Saturn Cloud
Data science with Python
4

Dask
● Parallel computing for Python people
● Anaconda, ~2015
● Built in Python; Python API
● Mature, scientific computing communities
● Low-level task library
● High-level libraries for DataFrames, arrays, ML
● Integrates with PyData ecosystem
● Runs on laptop, scales to clusters
https://dask.org/
7

Dask
What does it do?
https://docs.dask.org/en/latest/user-interfaces.html
● Parallel machine learning (scikit)
● Parallel dataframes (pandas)
● Parallel arrays (numpy)
● Parallel anything else
8

What does it do?
Arrays and Dataframes
https://docs.dask.org/en/latest/array.html https://docs.dask.org/en/latest/dataframe.html
9

What does it do?
Anything else!
https://docs.dask.org/en/latest/delayed.html 10

What does it do?
Anything else!
https://dask.org/
11

Getting up to
Speed with Dask
https://youtu.be/S_ncqocDcBA
12

Spark vs. Dask
● Written in Scala with Python API
● All-in-one tool
○ Requires re-write to migrate
from PyData code
● Programming model not suited for
complex operations (multi-dim
arrays, machine learning)
● 100% Python
● Built to extend and interact with
PyData ecosystem
● High-level interfaces for
DataFrames, (multi-dim) Arrays, and
ML
● Native integration with RAPIDS for
GPU-acceleration

How can I run Dask clusters?
● Manual setup
● SSH
● HPC: MPI, SLURM, SGE, TORQUE, LSF, DRMAA, PBS
● Kubernetes (Docker, Helm)
● Hadoop/YARN
● Cloud provider: AWS or Azure
🔗 https://docs.dask.org/en/latest/setup.html
15

● Fast setup
● Enterprise secure
● Pythonic parallelism
● Rapidly scale
PyData
● Multi-GPU computing
● The future of HPC
● Workflow orchestration
● Flow insight and mgmt
Bringing together the fastest hardware + OSS
Saturn Cloud

● Saturn manages all infrastructure
● Hosted: Run within our cloud
● Enterprise: Run within your AWS account
Saturn Cloud to the rescue!
Taking the DevOps out of Data Science
18

● Images
● Jupyter server
● Dask Cluster
● Deployments
Saturn Cloud
Core features
19

Saturn Cloud
Extracting Entities from
CORD-19
21

Genesis
23
● Based on one of the many
COVID-19 initiatives (COVID-
KG)
● Original intent: extract entities
from CORD-19 dataset for
relationship mining.

Genesis
24
KG)
relationship mining.● CORD-19 dataset open sourced
by AllenAI.
● SciSpacy provided Language
models trained on Biomedical
text, and...
● Pre-trained Named Entity
Recognition (and linking) models.

Genesis
25
KG)
relationship mining.● CORD-19 dataset open sourced
by AllenAI.
● SciSpacy provided Language
models trained on Biomedical
text, and...
● Pre-trained Named Entity
Recognition (and linking) models.
● Dask based distributed
computing platform
● Opportunity to evaluate.

Goals
● Create standoff entity annotations for CORD-19.
● Automated entity recognition using pre-trained SciSpaCy models, where each
model recognizes a different subset of entity classes, e.g. DNA, Gene,
Protein, Chemical, Organism, Disease, etc.
● Output is structured as Parquet files, consumable via Dask or Spark.
● Share output dataset with community.
26

CORD-19 Dataset
● Started mid March 2020 with ~40k
articles released weekly.
● By Sept/Oct 2020 ~200k articles
released daily, growing everyday.
● Each release contains:
○ Metadata file (CSV)
○ Set of articles (JSON)
27

SciSpaCy NER(L) models
● Medium English LM for sentence
splitting.
● 4 NER models
● 5 NERL models using LM’s
candidate entity generator and
trained entity linking models.
28

Full Pipeline
● Read metadata.csv
● Parse each JSON file into
paragraphs.
● Split paragraphs into sentences.
● Extract entities from sentences
using a NER(L) model..
29

Files to Paragraphs
● Pipeline is embarrassingly parallel.
● Parse files to paragraphs has no
dependencies (i.e., perfectly
parallel)
30

Paragraphs to Sentences
● Split paragraphs to sentences needs
sentence splitter model assigned per
partition.
● Load only models that you need
using disable attribute.
31

Sentences to Entities (NER)
● Extract entities from sentence needs
NER model, assign lazily to worker per
partition.
● Use nlp.pipe and batching to exploit
multithreading.
32

Sentences to Entities (NERL)
● Extract and link entities from
sentence needs Language Model,
Entity Linker, etc.
● Assign eagerly per worker after
cluster creation.
33

Incremental Pipeline
● Extracted Entities + new metadata
and JSON files.
● Compute diffs (additions +
deletions)
● Parse added articles to paragraphs,
paragraphs to sentences, and
sentences to entities.
● Remove paragraphs, sentences,
and entities for deleted articles.
● Merge diff and original.
34

Output formats
1 paragraph dataframe (3.4M paragraphs), 1 sentence dataframe (17.1M sentences),
and 9 entity dataframes (total 805.4M entities).
35

Parquet Dask / Spark interop
● Output of paragraph, sentence, and entities are in Parquet format.
● Things to keep in mind for Spark interoperability when writing from Dask.
○ Column data types must be declared explicitly on the Dask end.
○ Column names should be specified when saving (“hidden” columns visible in Spark).
○ Explicit re-partitioning may be necessary when saving on Dask.
36

Deliverables
● Code
○ Set of Jupyter notebooks deployed on
Saturn Cloud -- sujitpal/saturn-scispacy
● Data
○ Dataframes in Parquet format (approx
70 GB, 35 for Sep 2020, 35 for Oct
2020).
○ Publicly available on s3://els-saturn-
scispacy/cord19-scispacy-entities
(requester pays).
○ Citable as a Mendeley Dataset.
37

Utility
● Generate micro datasets for specific tasks.
● Examples:
○ Human Phenotypes Annotations from HPO co-occurring in same sentence with Disease
Annotations from UMLS or BC5CDR.
○ Gene annotations co-occurring in same sentence with Cancer annotations (both from
BioNLP).
○ Curative Effects of Hydroxychloroquine on COVID-19.
○ Potential of BCG (typhoid) vaccine as protection against COVID-19.
● Annotations can be features for Topic Modeling or Categorization.
● Others...
38

Accelerating NLP with Dask and Saturn Cloud

More Related Content

What's hot (20)

Similar to Accelerating NLP with Dask and Saturn Cloud (10)

More from Sujit Pal (20)

Recently uploaded (20)

Accelerating NLP with Dask and Saturn Cloud

Editor's Notes