SlideShare a Scribd company logo
Dictionary Based Annotation at
scale with Spark, SolrTextTagger
and OpenNLP
Sujit Pal, Elsevier Labs
Introduction
• About Me
– Work at Elsevier Labs.
– Interested in Search, NLP and Distributed Processing.
– URL: labs.elsevier.com
– Email: sujit.pal@elsevier.com
– Twitter: @palsujit
• About Elsevier
– World’s largest publisher of STM Books and Journals.
– Uses Data to inform and enable consumers of STM info.
– And like everybody else, we are hiring!
Agenda
• Overview and Background
• Features and API
• Scaling out
• Q&A
Overview/Background
Problem Definition
• What is the problem?
– Annotate millions of documents from different corpora.
• 14M docs from Science Direct alone.
• More from other corpora, dependency parsing, etc.
– Critical step for Machine Reading and Knowledge Graph applications.
• Why is this such a big deal?
– Takes advantage of existing linked data.
– No model training for multiple complex STM domains.
– However, simple until done at scale.
Annotation Pipeline
Dictionary Based NE Annotator (SoDA)
• Part of Document Annotation Pipeline.
• Annotates text with Named Entities from external Dictionaries.
• Built with Open Source Components
– Apache Solr – Highly reliable, scalable and fault-tolerant search index.
– SolrTextTagger – Solr component for text tagging, uses Lucene FST technology.
– Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text.
– Apache Spark – Lightning fast, large scale data processing.
• Uses ideas from other Open Source libraries
– FuzzyWuzzy – Fuzzy String Matching like a boss.
• Contributed back to Open Source
– https://github.com/elsevierlabs-os/soda
SoDA Architecture
How does it work (exact/case matching)?
• Uses Aho-Corasick algorithm – treats the dictionary as a FST and streams text against
it. Matches all patterns simultaneously. Diagram shows FST for vocabulary {“his”, “he”,
“her”, “hers”, “she”}.
• Michael McCandless implemented FSTs in Lucene (blog post).
• David Smiley built SolrTextTagger to use Lucene FSTs.
• SoDA uses SolrTextTagger for streaming exact and case-insensitive matching.
How does it work (fuzzy matching)?
• Pre-normalizes each dictionary entry into various forms
– Original – “Astrocytoma, Subependymal Giant Cell”
– Lowercased – “astrocytoma, subependymal giant cell”
– Punctuation – “astrocytoma subependymal giant cell”
– Sorted – “astrocytoma cell giant subependymal”
– Stemmed – “astrocytoma cell giant subependym”
• Uses OpenNLP to parse input text into phrases, normalizes each phrase into the desired
normalization level and matches against corresponding field.
• Caller specifies normalization level.
Features and API
Feature Overview
• Provides JSON over HTTP interface
– Compose request as JSON document
– HTTP POST document to JSON endpoint URL (HTTP GET for URL only requests).
– Receive response as JSON document.
• Language-Agnostic and Cross-Platform.
• API can be used from standalone clients, Spark jobs and Databricks notebooks.
• Examples in Scala and Python
Services
• Status
– index.json – returns a JSON (suitable for health check monitoring)
• Single Lexicon Services
– annot.json – annotates a block of text in streaming manner. Supports different levels of
matching (strict to permissive).
– matchphrase.json – annotates short phrases. Supports same matching levels as annot.json.
• Multi-Lexicon Services
– dicts.json – lists all lexicons available.
– coverage.json – returns number of annotations by lexicon found for text across all available
lexicons.
• Indexing Services
– delete.json – deletes entire lexicon from index.
– add.json – adds an entry to the specified lexicon.
Annotation Service I/O
Example annotation request
{
“lexicon” : “countries”,
“text” : “Institute of Clean Coal Technology, East
China University”,
“matching” : “exact”
}
Example annotated response
[
{
“id” : “http://www.geonames.org/CHN”,
“lexicon” : “countries”,
“begin” : 41,
“end” : 46,
“coveredText” : “China”,
“confidence” : 1.0
}
]
Calling Annotation Service
• Client originally written in Python, using built-in json and requests libraries.
• For Scala client, SoDA JAR provides classes to mimic json and requests functionality in Scala.
• Input to both our (somewhat contrived) examples are: (pii: String, affStr: String) tuples as shown.
• Match against a country lexicon to find country names.
Annotation Service – Python Client
Annotation Service – Scala Client
Annotation Service - Outputs
• Each Annotation result provides:
– Entity ID (not shown)
– Begin position in text
– End position in text
– Matched Text
– Confidence (not shown)
• Zero or more Annotations possible per input text.
Loading Dictionaries
• Dictionary entries represented by:
– Lexicon Name
– Entry ID (unique across lexicons)
– List of possible synonym terms
• JSON Request to add an entry for MeSH dictionary.
{ “id”: http://id.nlm.nih.gov/mesh/2015/M0021699,
“lexicon”: “mesh”,
“names”: [“Baby Tooth”, “Dentitions, Primary”, “Milk Tooth”, ...],
“commit”: false }
• Preferable to commit periodically and after batch.
Loading Dictionaries – Scala Client
Scaling Out
SoDA Performance – Expected
• Test: annotate 14M docs in “reasonable time”.
– Approx. 3s/doc with SoDA+Solr on ec2 r3.large box (15.5GB RAM, 32GB SSD, 2vCPU).
– Total estimated time: 16.2 months!
• Questions
– Can we make the process faster?
– Can we scale out the process?
Where is the time being spent?
• Majority of time spent in Solr.
• Some time spent in SoDA (decreases slower than Solr as transactions get shorter).
• Almost no additional time spent in Spark.
Optimization #1: Combine Paragraphs
• Performance measured using 10K random
articles.
• Time to annotate 1 article: Mean 2.9s, Median
2.1s.
• Annotation done per paragraph, 40
paragraphs/article on average.
• Reduce HTTP network + parsing overhead by
sending full document.
• Time to annotate 1 article: Mean 1.4s, Median
0.3s.
• 2x - 7x improvement.
Optimization #2: Tune Solr GC
• OOB Solr would GC very frequently, slowing
down Spark and causing timeouts.
• Current Index Size: 2.1 GB
• Need to size box so approximately 75% RAM
given to OS and remaining 25% allocated to
Solr (Uwe Schindler's Blog Post).
• Heap size should be 3-4x index size (Internal
Guideline).
• Current Solr Heap Size = 8 GB
• RAM is 30.5 GB
• CMS (Concurrent Mark-Sweep) Garbage
Collection.
Optimization #3: Larger Spark Cluster
• Running on cluster Master + 4 Workers.
• Each worker has 1 Executor with 4 Cores.
• Number of simultaneous Solr clients = 16 (4
workers * 1 executor * 4 cores) – measured
with lsof –p in a loop on Solr server.
• Throughput increases with number of
partitions till about 2x the number of worker
cores.
• Best throughput 5 docs/sec with #-
partitions=30 for 16 cores.
Optimization #4: Solr Scaleout
• Upgrade to r3.xlarge (30.5GB RAM, 80GB
SSD, 4vCPU)
– Throughput 7.9 docs/s
• Upgrade to 2x r3.2xlarge (61GB RAM, 160GB
SSD, 8vCPU) with c3.large LB (3.75GB RAM,
32GB Disk, 2vCPU) running HAProxy.
#-workers #-
requests/serv
er
Throughput
(docs/sec)
4 8 8.62
8 16 17.334
12 24 20.64
16 32 26.845
Performance – Did we meet expectations?
• At 26 docs/sec and 14M documents, it will take our current cluster little over 6 days to annotate
against our largest dictionary (8M entries).
• Throughput scales linearly @ 1.5 docs/sec per additional worker, as long as Solr servers have
enough capacity to serve requests.
• Each Solr box (as configured) can serve sustained loads of up to 30-35 simultaneous requests.
• Number of simultaneous requests approximately equal to number of worker cores.
• Example: annotate 14M documents in 3 days.
– Throughput required: 14M / (3 * 86400) = 54 docs/s
– Number of workers: 54 / 1.5 = 36 workers
– Number of simultaneous requests (4 cores/worker) = 36 * 4 = 144
– Number of Solr servers: 144 / 32 = 4.5 = 5 servers
Future Work
• More Lexicons
• Investigate Lexicon-Centric scale out.
– Allows more lexicons.
– Not limited to single index.
• Move to Lucene, eliminate network
overhead.
– Asynchronous model
– Use Kafka topic with multiple partitions
– Lucene based tagging consumers
– Write output to S3.
Q&A
Thank you for listening!
• Questions?
• SoDA available on GitHub
– https://github.com/elsevierlabs-os/soda
• Contact me
– sujit.pal@elsevier.com

More Related Content

What's hot (20)

PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PDF
SSR: Structured Streaming for R and Machine Learning
felixcss
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
PDF
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
PPTX
Apache Spark and Online Analytics
Databricks
 
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PDF
Search at Twitter: Presented by Michael Busch, Twitter
Lucidworks
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PDF
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Apache Spark and Online Analytics
Databricks
 
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Search at Twitter: Presented by Michael Busch, Twitter
Lucidworks
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 

Viewers also liked (15)

PDF
NLP Structured Data Investigation on Non-Text by Casey Stella
Spark Summit
 
PDF
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
PPTX
Advanced Spark Meetup - Jan 12, 2016
Michelle Casbon
 
ODP
Build and Deploy a Python Web App to Amazon in 30 Mins
Jeff Hull
 
PDF
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
PDF
Spark Streaming, Machine Learning and meetup.com streaming API.
Sergey Zelvenskiy
 
PDF
Pandas, Data Wrangling & Data Science
Krishna Sankar
 
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PDF
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
ODP
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
PPTX
Unique ID generation in distributed systems
Dave Gardner
 
PDF
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
NLP Structured Data Investigation on Non-Text by Casey Stella
Spark Summit
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
Advanced Spark Meetup - Jan 12, 2016
Michelle Casbon
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Jeff Hull
 
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Sergey Zelvenskiy
 
Pandas, Data Wrangling & Data Science
Krishna Sankar
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Unique ID generation in distributed systems
Dave Gardner
 
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Ad

Similar to Dictionary Based Annotation at Scale with Spark by Sujit Pal (20)

PDF
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
PPTX
Solr/Elasticsearch for CF Developers (and others)
Mary Jo Sminkey
 
PPT
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
Charlie Hull
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
PPTX
ElasticSearch in Production: lessons learned
BeyondTrees
 
PDF
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Lucidworks
 
PDF
Webinar: Solr & Spark for Real Time Big Data Analytics
Lucidworks
 
PDF
Find it, possibly also near you!
Paul Borgermans
 
PDF
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks
 
PDF
Leveraging the Power of Solr with Spark
QAware GmbH
 
PPTX
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
PPTX
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
PDF
Data Science with Solr and Spark
Lucidworks
 
PPTX
Solr vs. Elasticsearch - Case by Case
Alexandre Rafalovitch
 
PPT
Finite State Queries In Lucene
otisg
 
PDF
Elasticsearch and Spark
Audible, Inc.
 
PPTX
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 
PDF
You're not using ElasticSearch (outdated)
Timon Vonk
 
PDF
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
PPT
Solr and Elasticsearch, a performance study
Charlie Hull
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
Solr/Elasticsearch for CF Developers (and others)
Mary Jo Sminkey
 
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
Charlie Hull
 
Data Engineering with Solr and Spark
Lucidworks
 
ElasticSearch in Production: lessons learned
BeyondTrees
 
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Lucidworks
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Lucidworks
 
Find it, possibly also near you!
Paul Borgermans
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks
 
Leveraging the Power of Solr with Spark
QAware GmbH
 
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
Data Science with Solr and Spark
Lucidworks
 
Solr vs. Elasticsearch - Case by Case
Alexandre Rafalovitch
 
Finite State Queries In Lucene
otisg
 
Elasticsearch and Spark
Audible, Inc.
 
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 
You're not using ElasticSearch (outdated)
Timon Vonk
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Solr and Elasticsearch, a performance study
Charlie Hull
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 

Dictionary Based Annotation at Scale with Spark by Sujit Pal

  • 1. Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs
  • 2. Introduction • About Me – Work at Elsevier Labs. – Interested in Search, NLP and Distributed Processing. – URL: labs.elsevier.com – Email: [email protected] – Twitter: @palsujit • About Elsevier – World’s largest publisher of STM Books and Journals. – Uses Data to inform and enable consumers of STM info. – And like everybody else, we are hiring!
  • 3. Agenda • Overview and Background • Features and API • Scaling out • Q&A
  • 5. Problem Definition • What is the problem? – Annotate millions of documents from different corpora. • 14M docs from Science Direct alone. • More from other corpora, dependency parsing, etc. – Critical step for Machine Reading and Knowledge Graph applications. • Why is this such a big deal? – Takes advantage of existing linked data. – No model training for multiple complex STM domains. – However, simple until done at scale.
  • 7. Dictionary Based NE Annotator (SoDA) • Part of Document Annotation Pipeline. • Annotates text with Named Entities from external Dictionaries. • Built with Open Source Components – Apache Solr – Highly reliable, scalable and fault-tolerant search index. – SolrTextTagger – Solr component for text tagging, uses Lucene FST technology. – Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text. – Apache Spark – Lightning fast, large scale data processing. • Uses ideas from other Open Source libraries – FuzzyWuzzy – Fuzzy String Matching like a boss. • Contributed back to Open Source – https://github.com/elsevierlabs-os/soda
  • 9. How does it work (exact/case matching)? • Uses Aho-Corasick algorithm – treats the dictionary as a FST and streams text against it. Matches all patterns simultaneously. Diagram shows FST for vocabulary {“his”, “he”, “her”, “hers”, “she”}. • Michael McCandless implemented FSTs in Lucene (blog post). • David Smiley built SolrTextTagger to use Lucene FSTs. • SoDA uses SolrTextTagger for streaming exact and case-insensitive matching.
  • 10. How does it work (fuzzy matching)? • Pre-normalizes each dictionary entry into various forms – Original – “Astrocytoma, Subependymal Giant Cell” – Lowercased – “astrocytoma, subependymal giant cell” – Punctuation – “astrocytoma subependymal giant cell” – Sorted – “astrocytoma cell giant subependymal” – Stemmed – “astrocytoma cell giant subependym” • Uses OpenNLP to parse input text into phrases, normalizes each phrase into the desired normalization level and matches against corresponding field. • Caller specifies normalization level.
  • 12. Feature Overview • Provides JSON over HTTP interface – Compose request as JSON document – HTTP POST document to JSON endpoint URL (HTTP GET for URL only requests). – Receive response as JSON document. • Language-Agnostic and Cross-Platform. • API can be used from standalone clients, Spark jobs and Databricks notebooks. • Examples in Scala and Python
  • 13. Services • Status – index.json – returns a JSON (suitable for health check monitoring) • Single Lexicon Services – annot.json – annotates a block of text in streaming manner. Supports different levels of matching (strict to permissive). – matchphrase.json – annotates short phrases. Supports same matching levels as annot.json. • Multi-Lexicon Services – dicts.json – lists all lexicons available. – coverage.json – returns number of annotations by lexicon found for text across all available lexicons. • Indexing Services – delete.json – deletes entire lexicon from index. – add.json – adds an entry to the specified lexicon.
  • 14. Annotation Service I/O Example annotation request { “lexicon” : “countries”, “text” : “Institute of Clean Coal Technology, East China University”, “matching” : “exact” } Example annotated response [ { “id” : “http://www.geonames.org/CHN”, “lexicon” : “countries”, “begin” : 41, “end” : 46, “coveredText” : “China”, “confidence” : 1.0 } ]
  • 15. Calling Annotation Service • Client originally written in Python, using built-in json and requests libraries. • For Scala client, SoDA JAR provides classes to mimic json and requests functionality in Scala. • Input to both our (somewhat contrived) examples are: (pii: String, affStr: String) tuples as shown. • Match against a country lexicon to find country names.
  • 18. Annotation Service - Outputs • Each Annotation result provides: – Entity ID (not shown) – Begin position in text – End position in text – Matched Text – Confidence (not shown) • Zero or more Annotations possible per input text.
  • 19. Loading Dictionaries • Dictionary entries represented by: – Lexicon Name – Entry ID (unique across lexicons) – List of possible synonym terms • JSON Request to add an entry for MeSH dictionary. { “id”: http://id.nlm.nih.gov/mesh/2015/M0021699, “lexicon”: “mesh”, “names”: [“Baby Tooth”, “Dentitions, Primary”, “Milk Tooth”, ...], “commit”: false } • Preferable to commit periodically and after batch.
  • 22. SoDA Performance – Expected • Test: annotate 14M docs in “reasonable time”. – Approx. 3s/doc with SoDA+Solr on ec2 r3.large box (15.5GB RAM, 32GB SSD, 2vCPU). – Total estimated time: 16.2 months! • Questions – Can we make the process faster? – Can we scale out the process?
  • 23. Where is the time being spent? • Majority of time spent in Solr. • Some time spent in SoDA (decreases slower than Solr as transactions get shorter). • Almost no additional time spent in Spark.
  • 24. Optimization #1: Combine Paragraphs • Performance measured using 10K random articles. • Time to annotate 1 article: Mean 2.9s, Median 2.1s. • Annotation done per paragraph, 40 paragraphs/article on average. • Reduce HTTP network + parsing overhead by sending full document. • Time to annotate 1 article: Mean 1.4s, Median 0.3s. • 2x - 7x improvement.
  • 25. Optimization #2: Tune Solr GC • OOB Solr would GC very frequently, slowing down Spark and causing timeouts. • Current Index Size: 2.1 GB • Need to size box so approximately 75% RAM given to OS and remaining 25% allocated to Solr (Uwe Schindler's Blog Post). • Heap size should be 3-4x index size (Internal Guideline). • Current Solr Heap Size = 8 GB • RAM is 30.5 GB • CMS (Concurrent Mark-Sweep) Garbage Collection.
  • 26. Optimization #3: Larger Spark Cluster • Running on cluster Master + 4 Workers. • Each worker has 1 Executor with 4 Cores. • Number of simultaneous Solr clients = 16 (4 workers * 1 executor * 4 cores) – measured with lsof –p in a loop on Solr server. • Throughput increases with number of partitions till about 2x the number of worker cores. • Best throughput 5 docs/sec with #- partitions=30 for 16 cores.
  • 27. Optimization #4: Solr Scaleout • Upgrade to r3.xlarge (30.5GB RAM, 80GB SSD, 4vCPU) – Throughput 7.9 docs/s • Upgrade to 2x r3.2xlarge (61GB RAM, 160GB SSD, 8vCPU) with c3.large LB (3.75GB RAM, 32GB Disk, 2vCPU) running HAProxy. #-workers #- requests/serv er Throughput (docs/sec) 4 8 8.62 8 16 17.334 12 24 20.64 16 32 26.845
  • 28. Performance – Did we meet expectations? • At 26 docs/sec and 14M documents, it will take our current cluster little over 6 days to annotate against our largest dictionary (8M entries). • Throughput scales linearly @ 1.5 docs/sec per additional worker, as long as Solr servers have enough capacity to serve requests. • Each Solr box (as configured) can serve sustained loads of up to 30-35 simultaneous requests. • Number of simultaneous requests approximately equal to number of worker cores. • Example: annotate 14M documents in 3 days. – Throughput required: 14M / (3 * 86400) = 54 docs/s – Number of workers: 54 / 1.5 = 36 workers – Number of simultaneous requests (4 cores/worker) = 36 * 4 = 144 – Number of Solr servers: 144 / 32 = 4.5 = 5 servers
  • 29. Future Work • More Lexicons • Investigate Lexicon-Centric scale out. – Allows more lexicons. – Not limited to single index. • Move to Lucene, eliminate network overhead. – Asynchronous model – Use Kafka topic with multiple partitions – Lucene based tagging consumers – Write output to S3.
  • 30. Q&A
  • 31. Thank you for listening! • Questions? • SoDA available on GitHub – https://github.com/elsevierlabs-os/soda • Contact me – [email protected]