SlideShare a Scribd company logo
Improving Organizational Knowledge
with Natural Language Processing
Enriched Data Pipelines
Dataworks Summit, Washington D.C.
May 23, 2019
Introductions
Eric Wolok
● Over $500 million in commercial
real estate investment sales
● Specializing in Multi-Family
within the Chicago market
● ericw@partnersco.com
Jeff Zemerick
● Cloud/big-data consultant
● Apache OpenNLP PMC member
● ASF Member
● Morgantown, WV
● jeff.zemerick@mtnfog.com
2
Information is Everywhere
The answer to a question is often spread across
multiple locations.
In the real estate domain:
3
Others?
Answers are Distributed
Bringing these sources together brings the full answer together.
4
Real Estate
Includes structured and unstructured data:
● Property ownership history
● Individual buyer’s purchase history
● News articles mentioning properties
● Property listings
● And so on…
This data can be widely distributed across many
sources.
Let’s bring it together.
5
Key Technologies Used
● Apache Kafka
○ Message broker for data ingest.
● Apache NiFi
○ Orchestrate the flow of data in a pipeline.
● Apache MiNiFi
○ Facilitate ingest from edge locations.
● Apache OpenNLP
○ Process unstructured text.
● Apache Superset
○ Create dashboards from the extracted
data.
● Apache ZooKeeper
○ Cluster coordination for Kafka and Nifi.
● Amazon Web Services
○ Managed by CloudFormation for
infrastructure as code.
● Docker
○ Containerization of NLP microservices.
● Relational database
○ Storage of data extracted by the pipeline.
6
Apache Kafka
● Pub/sub message broker for
streaming data.
● Can handle massive amounts of
messages.
● Allows replay of data.
7
By Ch.ko123 - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=59871096
Apache NiFi and MiNiFi
Apache NiFi
● Provides “directed graphs for data routing,
transformation, and system mediation
logic.”
● Has a drag and drop interface where
“processors” are configured to form the
data flow.
Apache MiNiFi
● A sub-project of NiFi that extends NiFi’s
capabilities to IoT and edge devices.
● Facilitates data collection and ingestion
into a NiFi cluster or other system.
8
Apache OpenNLP
9
● A machine learning-based toolkit for processing natural language text.
● Supports:
○ Tokenization
○ Sentence segmentation
○ Part-of-speech tagging
○ Named entity extraction
○ Document classification
○ Chunking
○ Parsing
○ Language detection
The Unstructured Data Pipeline
Data
Ingestion
NLP
Enrichment /
Deduplication
Storage Reporting
10
A happy user!
Sentiment
Analysis
The Tools of the Unstructured Data Pipeline
Data
Ingestion
NLP
Enrichment /
Deduplication
Storage Reporting
11
Sentiment
Analysis
Kafka OpenNLP
DockerMiNiFi
OpenNLP Database Superset
NiFi
ZooKeeper
Amazon Web Services
The Architecture
Data Ingestion
● Plain text is published to Kafka topics.
● Each published message is a single
document.
● Originates from:
○ Bash scripts publishing to the Kafka topic.
○ MiNiFi publishing to the Kafka topic from
database tables (QueryDatabaseTable).
13
NLP Microservices
● Microservices wrap OpenNLP functionality:
○ Language detection microservice
○ Sentence extraction microservice
○ Sentence tokenization microservice
○ Named-entity extraction microservice
● Spring Boot applications as runnable jars.
○ No external dependencies other than JRE.
● Provides simple REST-based interfaces to NLP operations.
○ e.g. /api/detect, /api/sentences, /api/tokens, /api/extract
● Available on GitHub under Apache license.
14
NLP Microservices Pipelines
Language
Detection
Sentence
Extraction
Sentence
Tokenization
Entity
Extraction
Sentence
Extraction
Sentence
Tokenization
Entity
Extraction
Supports independent pipelines supporting multiple source languages.
eng
deu
15
others
NLP Microservices Deployment
● Deployed as containers
○ https://github.com/mtnfog/nlp-building-blocks
● Stateless
○ Each can scale independently.
○ Deployed behind a load balancer.
Load
Balancer
NLP
NLP
NLP
16
NLP Operations - An Example
George Washington was president. He was the first president.
1. Language = eng
2. Sentences = [“George Washington was president.”, “He was
the first president.”]
3. Tokens = {[“George”, “Washington”, “was”, “president”],
[“He”, “was”, “the”, “first”, “president”]}
4. Entities = [“George Washington”]
17
Managing the NLP Models
NLP models are stored externally and pulled by the containers from S3/HTTP
when first needed.
NLP Microservice
Containers
S3 Bucket
containing models
Models organized in storage as: s3://nlp-models/[language]/[model-type]/[model-filename]
18
Extracted Information
Entity types extracted from natural language text:
● Person’s names - John Smith (model-based)
● Street addresses - 123 Main St, Anytown, CA (model-based / regex)
● Currency - $250,000, $300000 (regex)
● Phone numbers - (123) 456-7890, 123-456-7890 (regex)
19
Entity Confidence Thresholds
● Each entity extracted by a model has a confidence value that indicates
model’s confidence it actually is an entity.
● Varies depending on the text and the model.
○ How well does the actual text represent the training text?
● Monitored thresholds to determine best minimum cutoff values.
○ Entities with confidence less than a threshold are filtered.
20
0.00 to 1.00
Storage - Persisting the Extracted Data
● JSON responses from the microservices are converted to SQL.
○ Via NiFi’s ConvertJSONToSQL processor.
● Extracted entities are persisted to a relational database.
○ Via NiFi’s PutSQL processor.
● Extracted entities are persisted to an Elasticsearch index.
○ Via NiFi’s PutElasticsearch processor.
21
Reporting
● Dashboards in Apache Superset show views on the data.
22
Entity Counts by Context
● Shows a high-level view of entity counts per context.
23
A context name is
arbitrary, e.g. a
book, a chapter, a
document, etc.
Entities by Type
● Shows entity counts by type of entity.
24
Stopped ingestion
at 1000 person
entities.
Entity Confidence Values
● Shows the distribution of entity confidence values (0.0 to 1.0).
25
Entities with
confidence = 1 are
phone numbers.
Putting It All Together - The NiFi Pipeline
26
The NiFi Flow
27
George Washington was president. He was the first president.
George Washington was president. He was the first president.
[“George”, “Washington”, “was”, “president”] [“He”, “was”, “the”, “first”, “president”]
[“George Washington”]
{entity: text: “George Washington”, confidence: 0.9}
INSERT INTO entities (TEXT, CONFIDENCE) VALUES (“George
Washington”, 0.9)
One entity sent to the database.
Attribute - eng
[]
{}
Nothing sent to database.
Design Challenges
● Repeatable
○ How to deploy it for testing, for different environments, for other clients, …?
● Scalable
○ How can we handle massive amounts of text efficiently?
● Extensible / Updatable
○ How can we ensure that the process can easily be customized or changed?
28
Repeatable
● 100% infrastructure as code.
○ AWS CloudFormation + Bash scripts = Source control in git
● Fully automated.
○ No part of the deployment/configuration is done manually.
○ Entire architecture is created by kicking off a command.
● Apache NiFi flow is version controlled in the NiFi Registry.
○ Registry is automatically connected to NiFi when NiFi is configured.
○ NiFi flow is also under source control.
29
Scalable
● Leveraged autoscaling from the cloud platform where appropriate.
○ NLP microservice containers and underlying hardware have scaling policies set.
● Number of Kafka, NiFi, ZooKeeper instances can be increased as needed.
30
Extensible
● Using Apache NiFi insulates us from a hard-coded solution.
○ No pipeline code to write and manage!
● We can modify the pipeline simply by adding new processors.
● Architecture is layered logically.
○ Layers can be swapped in and out if technology requirements change.
○ Layers consist of cloud-specific networking, ZooKeeper, Kafka, NiFi, etc.
● Data can be replayed through Kafka.
○ Allows for updating the data captured.
31
Summary
We presented a method for extracting information from distributed data.
● This pipeline…
○ Ingests data
○ Processes the data
○ Extracts the entities
○ Stores the entities
○ Visualizes the results
● … in a scalable, loosely-connected, and repeatable fashion.
● We can build systems, such as recommendation engines, around this data.
32
Questions?
Credits and thanks to:
● The open source projects and contributors who make this work possible.
33

More Related Content

What's hot (20)

PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PDF
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
PDF
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
PPTX
Zero ETL analytics with LLAP in Azure HDInsight
DataWorks Summit
 
PDF
Introducing Kafka Connect and Implementing Custom Connectors
Itai Yaffe
 
PDF
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
Dataconomy Media
 
PDF
Apache Druid 101
Data Con LA
 
PPTX
In Flux Limiting for a multi-tenant logging service
DataWorks Summit/Hadoop Summit
 
PDF
Capital One: Using Cassandra In Building A Reporting Platform
DataStax Academy
 
PDF
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
PDF
Real-time analytics with Druid at Appsflyer
Michael Spector
 
PPTX
Spark Streaming the Industrial IoT
Jim Haughwout
 
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
PPTX
Programmatic Bidding Data Streams & Druid
Charles Allen
 
PDF
Big data real time architectures
Daniel Marcous
 
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Zero ETL analytics with LLAP in Azure HDInsight
DataWorks Summit
 
Introducing Kafka Connect and Implementing Custom Connectors
Itai Yaffe
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
Dataconomy Media
 
Apache Druid 101
Data Con LA
 
In Flux Limiting for a multi-tenant logging service
DataWorks Summit/Hadoop Summit
 
Capital One: Using Cassandra In Building A Reporting Platform
DataStax Academy
 
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
Real-time analytics with Druid at Appsflyer
Michael Spector
 
Spark Streaming the Industrial IoT
Jim Haughwout
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Programmatic Bidding Data Streams & Druid
Charles Allen
 
Big data real time architectures
Daniel Marcous
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 

Similar to Improving Organizational Knowledge with Natural Language Processing Enriched Data Pipelines (20)

PDF
Introduction to data flow management using apache nifi
Anshuman Ghosh
 
PDF
Real time text stream processing - a dynamic and distributed nlp pipeline
Conference Papers
 
PDF
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Bryan Bende
 
PPTX
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
PPTX
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Marco Garcia
 
PDF
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
PDF
Joe witt may2015_kafka_nyc_apachenifi-overview
Joseph Witt
 
PDF
Joe Witt presentation on Apache NiFi
Mark Kerzner
 
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
Timothy Spann
 
PPTX
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
PDF
Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics
semanticsconference
 
PDF
Streaming-based Text Mining using Deep Learning and Semantics
Linked Enterprise Date Services
 
PDF
Ontos NLP Stack, Sep. 2016
Martin Voigt
 
PDF
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
PDF
Real time cloud native open source streaming of any data to apache solr
Timothy Spann
 
PPTX
Apache Deep Learning 201
DataWorks Summit
 
PPTX
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Aldrin Piri
 
PDF
Apache Nifi Crash Course
DataWorks Summit
 
PDF
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
DataWorks Summit
 
PDF
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
DataWorks Summit
 
Introduction to data flow management using apache nifi
Anshuman Ghosh
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Conference Papers
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Bryan Bende
 
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Marco Garcia
 
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
Joe witt may2015_kafka_nyc_apachenifi-overview
Joseph Witt
 
Joe Witt presentation on Apache NiFi
Mark Kerzner
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Timothy Spann
 
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics
semanticsconference
 
Streaming-based Text Mining using Deep Learning and Semantics
Linked Enterprise Date Services
 
Ontos NLP Stack, Sep. 2016
Martin Voigt
 
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
Real time cloud native open source streaming of any data to apache solr
Timothy Spann
 
Apache Deep Learning 201
DataWorks Summit
 
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Aldrin Piri
 
Apache Nifi Crash Course
DataWorks Summit
 
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
DataWorks Summit
 
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
DataWorks Summit
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Improving Organizational Knowledge with Natural Language Processing Enriched Data Pipelines

  • 1. Improving Organizational Knowledge with Natural Language Processing Enriched Data Pipelines Dataworks Summit, Washington D.C. May 23, 2019
  • 2. Introductions Eric Wolok ● Over $500 million in commercial real estate investment sales ● Specializing in Multi-Family within the Chicago market ● [email protected] Jeff Zemerick ● Cloud/big-data consultant ● Apache OpenNLP PMC member ● ASF Member ● Morgantown, WV ● [email protected] 2
  • 3. Information is Everywhere The answer to a question is often spread across multiple locations. In the real estate domain: 3 Others?
  • 4. Answers are Distributed Bringing these sources together brings the full answer together. 4
  • 5. Real Estate Includes structured and unstructured data: ● Property ownership history ● Individual buyer’s purchase history ● News articles mentioning properties ● Property listings ● And so on… This data can be widely distributed across many sources. Let’s bring it together. 5
  • 6. Key Technologies Used ● Apache Kafka ○ Message broker for data ingest. ● Apache NiFi ○ Orchestrate the flow of data in a pipeline. ● Apache MiNiFi ○ Facilitate ingest from edge locations. ● Apache OpenNLP ○ Process unstructured text. ● Apache Superset ○ Create dashboards from the extracted data. ● Apache ZooKeeper ○ Cluster coordination for Kafka and Nifi. ● Amazon Web Services ○ Managed by CloudFormation for infrastructure as code. ● Docker ○ Containerization of NLP microservices. ● Relational database ○ Storage of data extracted by the pipeline. 6
  • 7. Apache Kafka ● Pub/sub message broker for streaming data. ● Can handle massive amounts of messages. ● Allows replay of data. 7 By Ch.ko123 - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=59871096
  • 8. Apache NiFi and MiNiFi Apache NiFi ● Provides “directed graphs for data routing, transformation, and system mediation logic.” ● Has a drag and drop interface where “processors” are configured to form the data flow. Apache MiNiFi ● A sub-project of NiFi that extends NiFi’s capabilities to IoT and edge devices. ● Facilitates data collection and ingestion into a NiFi cluster or other system. 8
  • 9. Apache OpenNLP 9 ● A machine learning-based toolkit for processing natural language text. ● Supports: ○ Tokenization ○ Sentence segmentation ○ Part-of-speech tagging ○ Named entity extraction ○ Document classification ○ Chunking ○ Parsing ○ Language detection
  • 10. The Unstructured Data Pipeline Data Ingestion NLP Enrichment / Deduplication Storage Reporting 10 A happy user! Sentiment Analysis
  • 11. The Tools of the Unstructured Data Pipeline Data Ingestion NLP Enrichment / Deduplication Storage Reporting 11 Sentiment Analysis Kafka OpenNLP DockerMiNiFi OpenNLP Database Superset NiFi ZooKeeper Amazon Web Services
  • 13. Data Ingestion ● Plain text is published to Kafka topics. ● Each published message is a single document. ● Originates from: ○ Bash scripts publishing to the Kafka topic. ○ MiNiFi publishing to the Kafka topic from database tables (QueryDatabaseTable). 13
  • 14. NLP Microservices ● Microservices wrap OpenNLP functionality: ○ Language detection microservice ○ Sentence extraction microservice ○ Sentence tokenization microservice ○ Named-entity extraction microservice ● Spring Boot applications as runnable jars. ○ No external dependencies other than JRE. ● Provides simple REST-based interfaces to NLP operations. ○ e.g. /api/detect, /api/sentences, /api/tokens, /api/extract ● Available on GitHub under Apache license. 14
  • 16. NLP Microservices Deployment ● Deployed as containers ○ https://github.com/mtnfog/nlp-building-blocks ● Stateless ○ Each can scale independently. ○ Deployed behind a load balancer. Load Balancer NLP NLP NLP 16
  • 17. NLP Operations - An Example George Washington was president. He was the first president. 1. Language = eng 2. Sentences = [“George Washington was president.”, “He was the first president.”] 3. Tokens = {[“George”, “Washington”, “was”, “president”], [“He”, “was”, “the”, “first”, “president”]} 4. Entities = [“George Washington”] 17
  • 18. Managing the NLP Models NLP models are stored externally and pulled by the containers from S3/HTTP when first needed. NLP Microservice Containers S3 Bucket containing models Models organized in storage as: s3://nlp-models/[language]/[model-type]/[model-filename] 18
  • 19. Extracted Information Entity types extracted from natural language text: ● Person’s names - John Smith (model-based) ● Street addresses - 123 Main St, Anytown, CA (model-based / regex) ● Currency - $250,000, $300000 (regex) ● Phone numbers - (123) 456-7890, 123-456-7890 (regex) 19
  • 20. Entity Confidence Thresholds ● Each entity extracted by a model has a confidence value that indicates model’s confidence it actually is an entity. ● Varies depending on the text and the model. ○ How well does the actual text represent the training text? ● Monitored thresholds to determine best minimum cutoff values. ○ Entities with confidence less than a threshold are filtered. 20 0.00 to 1.00
  • 21. Storage - Persisting the Extracted Data ● JSON responses from the microservices are converted to SQL. ○ Via NiFi’s ConvertJSONToSQL processor. ● Extracted entities are persisted to a relational database. ○ Via NiFi’s PutSQL processor. ● Extracted entities are persisted to an Elasticsearch index. ○ Via NiFi’s PutElasticsearch processor. 21
  • 22. Reporting ● Dashboards in Apache Superset show views on the data. 22
  • 23. Entity Counts by Context ● Shows a high-level view of entity counts per context. 23 A context name is arbitrary, e.g. a book, a chapter, a document, etc.
  • 24. Entities by Type ● Shows entity counts by type of entity. 24 Stopped ingestion at 1000 person entities.
  • 25. Entity Confidence Values ● Shows the distribution of entity confidence values (0.0 to 1.0). 25 Entities with confidence = 1 are phone numbers.
  • 26. Putting It All Together - The NiFi Pipeline 26
  • 27. The NiFi Flow 27 George Washington was president. He was the first president. George Washington was president. He was the first president. [“George”, “Washington”, “was”, “president”] [“He”, “was”, “the”, “first”, “president”] [“George Washington”] {entity: text: “George Washington”, confidence: 0.9} INSERT INTO entities (TEXT, CONFIDENCE) VALUES (“George Washington”, 0.9) One entity sent to the database. Attribute - eng [] {} Nothing sent to database.
  • 28. Design Challenges ● Repeatable ○ How to deploy it for testing, for different environments, for other clients, …? ● Scalable ○ How can we handle massive amounts of text efficiently? ● Extensible / Updatable ○ How can we ensure that the process can easily be customized or changed? 28
  • 29. Repeatable ● 100% infrastructure as code. ○ AWS CloudFormation + Bash scripts = Source control in git ● Fully automated. ○ No part of the deployment/configuration is done manually. ○ Entire architecture is created by kicking off a command. ● Apache NiFi flow is version controlled in the NiFi Registry. ○ Registry is automatically connected to NiFi when NiFi is configured. ○ NiFi flow is also under source control. 29
  • 30. Scalable ● Leveraged autoscaling from the cloud platform where appropriate. ○ NLP microservice containers and underlying hardware have scaling policies set. ● Number of Kafka, NiFi, ZooKeeper instances can be increased as needed. 30
  • 31. Extensible ● Using Apache NiFi insulates us from a hard-coded solution. ○ No pipeline code to write and manage! ● We can modify the pipeline simply by adding new processors. ● Architecture is layered logically. ○ Layers can be swapped in and out if technology requirements change. ○ Layers consist of cloud-specific networking, ZooKeeper, Kafka, NiFi, etc. ● Data can be replayed through Kafka. ○ Allows for updating the data captured. 31
  • 32. Summary We presented a method for extracting information from distributed data. ● This pipeline… ○ Ingests data ○ Processes the data ○ Extracts the entities ○ Stores the entities ○ Visualizes the results ● … in a scalable, loosely-connected, and repeatable fashion. ● We can build systems, such as recommendation engines, around this data. 32
  • 33. Questions? Credits and thanks to: ● The open source projects and contributors who make this work possible. 33

Editor's Notes

  • #5: This data may be geographically distributed between different organizations or it may exist in a single organization.
  • #6: Can be distributed inside and outside an organization.
  • #19: This allows updating the models without having to redeploy code.