SlideShare a Scribd company logo
Lessons Learned with
Spark at the US Patent &
Trademark Office
Christopher Bradford
Big Data Architect at OpenSource Connections
Christopher Bradford
Twitter: @bradfordcp
GitHub: bradfordcp
OpenSource Connections
Exploring Search Technologies - EST
EST – Technology Stack
EST – Data Loading
CSS Ingestion (CSS2C) Solr Ingestion (C2S)
EST – C2S Process
Note: some connections are omitted for clarity
EST – C2S Process (Scaled Out)
Note: some connections are omitted for clarity
EST – C2S Review
Did it work?
Why change it?
How could we make it better?
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)
EST – Old C2S Process
Note: some connections are omitted for clarity
EST – Spark C2S Process
Note: some connections are omitted for clarity
How did this work out?
Poorly
Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
// Job is done
Poor Performance
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
Optimum Performance
joinedRDD = …
sc = new SolrConnection()
joinedRDD.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
joinedRDD = …
joinedRDD.foreachPartition()
sc = new SolrConnection()
partition.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
Almost
The Solution!
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partition.rows
.collect()
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partitions.rows.count
.collect()
Results?
Solr Indexing
Better Solr Indexing
Note: some connections are omitted for clarity
EST – Spark C2S Process v2
Note: some connections are omitted for clarity
Success?
YUP
5x faster than the original C2S process (with optimizations)
What’s Next?
•  Optimization of the C2S Spark job
•  More Spark jobs
•  Newer version of Spark & DSE
•  Scala Spark jobs instead of Java

More Related Content

What's hot (20)

PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PDF
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PPTX
Spark r under the hood with Hossein Falaki
Databricks
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
Introduction to Spark ML
Holden Karau
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Operational Tips For Deploying Apache Spark
Databricks
 
PDF
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
PDF
Microservices, Containers, and Machine Learning
Paco Nathan
 
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
PDF
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Understanding Query Plans and Spark UIs
Databricks
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Spark r under the hood with Hossein Falaki
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Introduction to Spark ML
Holden Karau
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Operational Tips For Deploying Apache Spark
Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Microservices, Containers, and Machine Learning
Paco Nathan
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 

Viewers also liked (20)

PDF
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark Summit
 
PDF
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Spark Summit
 
PDF
Open Stack Cheat Sheet V1
Anuchit Chalothorn
 
PDF
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
PDF
Linux Filesystems, RAID, and more
Mark Wong
 
PDF
The Hot Rod Protocol in Infinispan
Galder Zamarreño
 
PDF
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 
PDF
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Daniel Krook
 
PDF
Scaling up genomic analysis with ADAM
fnothaft
 
PPTX
ELC-E 2010: The Right Approach to Minimal Boot Times
andrewmurraympc
 
PDF
Velox: Models in Action
Dan Crankshaw
 
ODP
Naïveté vs. Experience
Mike Fogus
 
PDF
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
PDF
SampleClean: Bringing Data Cleaning into the BDAS Stack
jeykottalam
 
PDF
OpenStack Cheat Sheet V2
Anuchit Chalothorn
 
PDF
A Curious Course on Coroutines and Concurrency
David Beazley (Dabeaz LLC)
 
PDF
Lab 5: Interconnecting a Datacenter using Mininet
Zubair Nabi
 
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark Summit
 
PDF
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
PDF
Python in Action (Part 2)
David Beazley (Dabeaz LLC)
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark Summit
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Spark Summit
 
Open Stack Cheat Sheet V1
Anuchit Chalothorn
 
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
Linux Filesystems, RAID, and more
Mark Wong
 
The Hot Rod Protocol in Infinispan
Galder Zamarreño
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Daniel Krook
 
Scaling up genomic analysis with ADAM
fnothaft
 
ELC-E 2010: The Right Approach to Minimal Boot Times
andrewmurraympc
 
Velox: Models in Action
Dan Crankshaw
 
Naïveté vs. Experience
Mike Fogus
 
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
jeykottalam
 
OpenStack Cheat Sheet V2
Anuchit Chalothorn
 
A Curious Course on Coroutines and Concurrency
David Beazley (Dabeaz LLC)
 
Lab 5: Interconnecting a Datacenter using Mininet
Zubair Nabi
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark Summit
 
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
Python in Action (Part 2)
David Beazley (Dabeaz LLC)
 
Ad

Similar to Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections) (20)

PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
PPTX
Spark to DocumentDB connector
Denny Lee
 
PPT
Jdbc drivers
Prabhat gangwar
 
PPTX
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
PDF
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PPT
2 rel-algebra
Mahesh Jeedimalla
 
DOCX
Quick Guide to Refresh Spark skills
Ravindra kumar
 
PPTX
Engineering Document Collaboration with Office 365
JoAnna Cheshire
 
PPTX
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
ScyllaDB
 
PPTX
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Tuning and Debugging in Apache Spark
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
Spark to DocumentDB connector
Denny Lee
 
Jdbc drivers
Prabhat gangwar
 
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
2 rel-algebra
Mahesh Jeedimalla
 
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Engineering Document Collaboration with Office 365
JoAnna Cheshire
 
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
ScyllaDB
 
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
deep dive data management sharepoint apps.ppt
novaprofk
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)