Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

4 likes2,659 views

Christopher Bradford presented lessons learned from using Apache Spark at the US Patent and Trademark Office to improve their process of loading data from Cassandra into Solr. The initial Spark implementation performed poorly due to opening and closing a Solr connection for each document. Optimizations like opening a single connection per partition and pushing documents in batches significantly improved performance, resulting in a solution that was 5 times faster than the original process. Future work involves further optimizing this Spark job and exploring additional uses of Spark.

Data & Analytics

Lessons Learned with
Spark at the US Patent &
Trademark Office
Christopher Bradford
Big Data Architect at OpenSource Connections

Christopher Bradford
Twitter: @bradfordcp
GitHub: bradfordcp

EST – Data Loading
CSS Ingestion (CSS2C) Solr Ingestion (C2S)

EST – C2S Process
Note: some connections are omitted for clarity

EST – C2S Process (Scaled Out)
Note: some connections are omitted for clarity

EST – C2S Review
Did it work?
Why change it?
How could we make it better?

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

EST – Old C2S Process
Note: some connections are omitted for clarity

EST – Spark C2S Process
Note: some connections are omitted for clarity

Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
// Job is done

Poor Performance
sc = new SolrConnection()
sc.push(document)
sc.disconnect()

Optimum Performance
joinedRDD = …
sc = new SolrConnection()
joinedRDD.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
joinedRDD = …
joinedRDD.foreachPartition()
sc = new SolrConnection()
partition.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
Almost

The Solution!
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partition.rows
.collect()
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partitions.rows.count
.collect()

Better Solr Indexing
Note: some connections are omitted for clarity

EST – Spark C2S Process v2
Note: some connections are omitted for clarity

Success?
YUP
5x faster than the original C2S process (with optimizations)

What’s Next?
•  Optimization of the C2S Spark job
•  More Spark jobs
•  Newer version of Spark & DSE
•  Scala Spark jobs instead of Java

More Related Content

What's hot (20)

PDF

How Apache Spark fits into the Big Data landscapePaco Nathan

PDF

Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks

PDF

SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh

PDF

Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks

PDF

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

PDF

Web-Scale Graph Analytics with Apache® Spark™Databricks

PDF

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

PDF

Understanding Query Plans and Spark UIsDatabricks

PDF

Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit

PPTX

Spark r under the hood with Hossein FalakiDatabricks

PDF

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

PPTX

Introduction to Spark MLHolden Karau

PDF

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

PDF

Operational Tips For Deploying Apache SparkDatabricks

PDF

Writing Continuous Applications with Structured Streaming in PySparkDatabricks

PDF

Microservices, Containers, and Machine LearningPaco Nathan

PDF

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks

PDF

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

PDF

Spark Community Update - Spark Summit San Francisco 2015Databricks

PDF

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

How Apache Spark fits into the Big Data landscapePaco Nathan

Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks

SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh

Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

Web-Scale Graph Analytics with Apache® Spark™Databricks

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

Understanding Query Plans and Spark UIsDatabricks

Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit

Spark r under the hood with Hossein FalakiDatabricks

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Introduction to Spark MLHolden Karau

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Operational Tips For Deploying Apache SparkDatabricks

Writing Continuous Applications with Structured Streaming in PySparkDatabricks

Microservices, Containers, and Machine LearningPaco Nathan

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Spark Community Update - Spark Summit San Francisco 2015Databricks

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

Viewers also liked (20)

PDF

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit

PDF

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit

PDF

Open Stack Cheat Sheet V1Anuchit Chalothorn

PDF

Tachyon-2014-11-21-amp-camp5Haoyuan Li

PDF

Linux Filesystems, RAID, and moreMark Wong

PDF

The Hot Rod Protocol in InfinispanGalder Zamarreño

PDF

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community

PDF

Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook

PDF

Scaling up genomic analysis with ADAMfnothaft

PPTX

ELC-E 2010: The Right Approach to Minimal Boot Timesandrewmurraympc

PDF

Velox: Models in ActionDan Crankshaw

ODP

Naïveté vs. ExperienceMike Fogus

PDF

SparkR: Enabling Interactive Data Science at Scalejeykottalam

PDF

SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam

PDF

OpenStack Cheat Sheet V2Anuchit Chalothorn

PDF

A Curious Course on Coroutines and ConcurrencyDavid Beazley (Dabeaz LLC)

PDF

Lab 5: Interconnecting a Datacenter using MininetZubair Nabi

PDF

Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit

PDF

Best Practices for Virtualizing Apache HadoopHortonworks

PDF

Python in Action (Part 2)David Beazley (Dabeaz LLC)

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit

Open Stack Cheat Sheet V1Anuchit Chalothorn

Tachyon-2014-11-21-amp-camp5Haoyuan Li

Linux Filesystems, RAID, and moreMark Wong

The Hot Rod Protocol in InfinispanGalder Zamarreño

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community

Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook

Scaling up genomic analysis with ADAMfnothaft

ELC-E 2010: The Right Approach to Minimal Boot Timesandrewmurraympc

Velox: Models in ActionDan Crankshaw

Naïveté vs. ExperienceMike Fogus

SparkR: Enabling Interactive Data Science at Scalejeykottalam

SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam

OpenStack Cheat Sheet V2Anuchit Chalothorn

A Curious Course on Coroutines and ConcurrencyDavid Beazley (Dabeaz LLC)

Lab 5: Interconnecting a Datacenter using MininetZubair Nabi

Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit

Best Practices for Virtualizing Apache HadoopHortonworks

Python in Action (Part 2)David Beazley (Dabeaz LLC)

Similar to Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections) (20)

PDF

Jump Start on Apache Spark 2.2 with DatabricksAnyscale

PDF

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

PDF

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab

PDF

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

PDF

Tuning and Debugging in Apache SparkDatabricks

PDF

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA

PPTX

Spark to DocumentDB connectorDenny Lee

PPT

Jdbc driversPrabhat gangwar

PPTX

Apache Spark Fundamentals TrainingEren Avşaroğulları

PDF

Building Robust ETL Pipelines with Apache SparkDatabricks

PPTX

Tuning and Debugging in Apache SparkPatrick Wendell

PDF

Spark SQL - 10 Things You Need to KnowKristian Alexander

PDF

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly

PPTX

Building a modern Application with DataFramesSpark Summit

PPTX

Building a modern Application with DataFramesDatabricks

PPT

2 rel-algebraMahesh Jeedimalla

DOCX

Quick Guide to Refresh Spark skillsRavindra kumar

PPTX

Engineering Document Collaboration with Office 365JoAnna Cheshire

PPTX

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark PipelinesScyllaDB

PPTX

Spark Cassandra Connector: Past, Present and FurureDataStax Academy

Jump Start on Apache Spark 2.2 with DatabricksAnyscale

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Tuning and Debugging in Apache SparkDatabricks

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA

Spark to DocumentDB connectorDenny Lee

Jdbc driversPrabhat gangwar

Apache Spark Fundamentals TrainingEren Avşaroğulları

Building Robust ETL Pipelines with Apache SparkDatabricks

Tuning and Debugging in Apache SparkPatrick Wendell

Spark SQL - 10 Things You Need to KnowKristian Alexander

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly

Building a modern Application with DataFramesSpark Summit

Building a modern Application with DataFramesDatabricks

2 rel-algebraMahesh Jeedimalla

Quick Guide to Refresh Spark skillsRavindra kumar

Engineering Document Collaboration with Office 365JoAnna Cheshire

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark PipelinesScyllaDB

Spark Cassandra Connector: Past, Present and FurureDataStax Academy

More from Spark Summit (20)

PDF

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit

PDF

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

PDF

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit

PDF

Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit

PDF

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit

PDF

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit

PDF

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

PDF

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

PDF

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

PDF

Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit

PDF

Powering a Startup with Apache Spark with Kevin KimSpark Summit

PDF

Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit

PDF

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit

PDF

How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit

PDF

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit

PDF

Goal Based Data Production with Sim SimeonovSpark Summit

PDF

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit

PDF

Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit

PDF

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

PDF

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit