SlideShare a Scribd company logo
Blazing-Fast Serverless
MapReduce Indexer for
Apache Solr
Speaker: Daniele Antuzi, R&D Software Engineer @ Sease
BERLIN BUZZWORDS 2024 - 11/06/2024
‣ R&D Search Software Engineer
‣ Master in Computer Science at University of Pisa
‣ Passionate about algorithms and data structures
‣ Food (and sometimes sport) lover
DANIELE ANTUZI
WHO I AM
‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
www.sease.io
SEArch SErvices
AGENDA
Combining DB records
Introduction to Map reduce
Combining DB records with Map reduce
Serverless implementation
The Apache Solr indexer
The problem of combining DB records
DB schema Solr document
Songs
Composers
Albums
Tags
{
id: 235,
title: "House Of The Rising Sun",
albumName: "The Best of The Animals",
composers: ["The Animals"],
tags: ["1964", "folk", "en"]
},
{
id: 594,
title: "That's All Right",
albumName: "Rock 'n' Roll",
composers: ["Elvis Presley"],
tags: ["1946", "us"]
},
The problem of combining DB records
SELECT *
FROM Songs
jOIN Composers ON …
JOIN Tags ON …
JOIN Albums ON …
…
Simple solution
The problem of combining DB records
SELECT *
FROM Songs
jOIN Composers ON …
JOIN Tags ON …
JOIN Albums ON …
…
Simple solution
Not scalable
High number of records
Too much work for DB server
The problem of combining DB records
songs = getDBRecords("SELECT * FROM Songs")
foreach song in songs:
getDBRecords("SELECT * FROM Composers c WHERE c.songId = " + song.id)
getDBRecords("SELECT * FROM Tags t WHERE t.songId = " + song.id)
getDBRecords("SELECT * FROM Albums a WHERE a.songId = " + song.id)
. . .
More scalable
The problem of combining DB records
songs = getDBRecords("SELECT * FROM Songs")
foreach song in songs:
getDBRecords("SELECT * FROM Composers c WHERE c.songId = " + song.id)
getDBRecords("SELECT * FROM Tags t WHERE t.songId = " + song.id)
getDBRecords("SELECT * FROM Albums a WHERE a.songId = " + song.id)
. . .
Simple solution
High Database workload
Too slow
AGENDA
Combining DB records
Introduction to Map reduce
Combining DB records with Map reduce
Serverless implementation
The Apache Solr indexer
Map Reduce
● Programming pattern to access big data from a distributed FS
● Paper "MapReduce: Simplified Data Processing on Large
Clusters" in 2004
● The user only defines the functions Map and Reduce
● Implemented by Apache Hadoop or Apache Spark
Map Reduce - Word Count
● the: 2469493
● quick: 34904
● brown: 45865
● fox: 3547
● jumps: 57843
● over: 29044
● lazy: 346975
● dog: 239685
Map Reduce - Word Count - Split
Node 2
Node 3
Node 1
Map Reduce - Word Count - Map
Node 2
Node 3
Node 1
[ <Berlin, 5>, <Buzzwords, 3> ]
[ <Berlin, 1>, <AI, 7> ]
[ <AI, 5>, <Opensource, 4> ]
Map Reduce - Word Count - Shuffle
Node 2
Node 3
Node 1
<Berlin, 5>, <Buzzwords, 3>
<Berlin, 1>, <AI, 7>
<AI, 5>, <Opensource, 4>
Node 2
Node 3
Node 1
<Berlin, [5, 1]>, <AI, [7,5]>
<Buzzwords, [3]>
<Opensource, [4]>
Map Reduce - Word Count - Reduce
Node 2
Node 3
Node 1
<Berlin, [5, 1]>, <AI,[7, 5]>
<Buzzwords, [3]>
<Opensource, [4]>
● AI: 12
● Berlin: 6
● Buzzwords: 3
● Opensource: 4
AGENDA
Combining DB records
Introduction to Map reduce
Combining DB records with Map reduce
Serverless implementation
The Apache Solr indexer
Combining DB records with Map reduce
Combining DB records with Map reduce
Combining DB records with Map reduce
Combining DB records with Map reduce
Songs
Albums + Song ID
Composers + Song ID
{ songID: 235, title: "House Of The Rising Sun", . . . },
{ songID: 345, title: "The Magic Flute", . . . }
{ songID: 594, title: "The Marriage of Figaro", . . . }
{ songID: 235, albumName: "The Best of The Animals", . . . },
{ songID: 345, albumName: "The Best of Mozart", . . . },
{ songID: 594, albumName: "The Best of Mozart", . . . }
{ songID: 235, composerName: "The Animals", . . . },
{ songID: 345, composerName: "Mozart", . . . },
{ songID: 594, composerName: "Mozart", . . . }
Combining DB records with Map reduce
{songID: 235, title: "House Of The Rising Sun", ... }
{songID: 594, title: "The Marriage of Figaro", . . . }
{songID: 235, albumName: "The Best of The Animals"}
{songID: 594, albumName: "The Best of The Mozart"}
{songID: 235, composerName: "The Animals", ... }
{songID: 594, composerName: "Mozart", ... }
<235, title:"House Of The Rising Sun">,
<594, title:"The Marriage of Figaro">
<235, composerName:"The Animals">,
<594, composerName:"Mozart">
<235, albumName:"The Best of The Animals">,
<594, albumName:"The Best of The Mozart">
Node y
Node x
Combining DB records with Map reduce
<235, title:"House Of The Rising Sun">
<594, title:"The Marriage of Figaro">
<235, composerName:"The Animals">,
<594, composerName:"Mozart">
<235, albumName:"The Best of The Animals">
<594, albumName:"The Best of Mozart">
235
title:"House Of The Rising Sun"
albumName:"The Best of The Animals"
composerName:"The Animals"
594
title:"The Marriage of Figaro"
albumName:"The Best of Mozart"
composerName:"Mozart"
Combining DB records with Map reduce
235
title: "House Of The Rising Sun"
albumName: "The Best of The Animals"
composerName: "The Animals"
. . .
594
title: "The Marriage of Figaro"
albumName: "The Best of Mozart"
composerName: "Mozart"
. . .
{
id: 235,
title: "House Of The Rising Sun",
albumName: "The Best of The Animals",
composers: ["The Animals"],
tags: ["1964", "folk", "en"]
},
{
id: 594,
title: "The Marriage of Figaro",
albumName: "The Best of Mozart",
composers: ["Mozart"],
tags: ["1786", "classic"]
},
AGENDA
Combining DB records
Introduction to Map reduce
Combining DB records with Map reduce
Serverless implementation
The Apache Solr indexer
Serverless Implementation - Ingredients
● SQL Database
● AWS Lambda function
● AWS S3 bucket
● AWS Simple Queue Service (SQS)
● AWS DynamoDB
● AWS Step function - distributed map
● Apache Solr (ElasticSearch, Opensearch)
Serverless Implementation - Pipeline
Serverless Implementation - Pipeline
Composers_part_000
. . .
Serverless Implementation - Fetch
SELECT SongId, Title, Description, …
FROM Songs
….
Songs_part_000
. . .
Albums_part_000
. . .
SELECT SongId, ComposerName, …
FROM Composers JOIN …
….
SELECT SongId, AlbumName, …
FROM Albums JOIN …
….
Songs_part_749
Composers_part_194
Albums_part_033
Serverless Implementation - Pipeline
Serverless Implementation - Map & Shuffle
Albums_part_063
{songId:13, albumName: "A"},
{songId:13, albumName: "B"},
{songId:15, albumName: "B"},
Song_13
Song_15
Song_65
Tags_part_394
{songId:15, tagName: "X"},
{songId:15, tagName: "Y"},
{songId:65, tagName: "Z"},
Serverless Implementation - Distributed lock
ResourceId LockId ExpireAt
13 2049668395 1716812546
245 6739294643 1716812536
ddb_table.put_item(
Item={'ResourceId': resource_id,
'ExpireAt’: now_ms + timeout_ms,
'LockId': lock_id},
ConditionExpression='attribute_not_exists(#ResourceId)
OR ExpireAt <= :now',
ExpressionAttributeNames={"#ResourceId": "ResourceId"},
ExpressionAttributeValues={":now": now_ms})
Atomic put
Serverless Implementation - Pipeline
Serverless Implementation - Reduce
Song_34
{songId:235, albumName: "The Animals"},
{songId:235, tagName: "year:1964"},
{songId:235, composerName: "The Animals"},
{songId:235, title: "House Of The Rising Sun"},
{songId:235, albumName: "The Best of The
Animals"},
{
songId:235,
title: "House Of The Rising Sun",
composer: "The Animals",
albumNames: [
"The Animals",
"The Best of The Animals"
],
tags: ["year:1964"]
}
Serverless Implementation - Pipeline
Serverless Implementation - Batch & Push
Batch size = 2
Batch window = 60 seconds
Maximum concurrency = 2
AGENDA
Combining DB records
Introduction to Map reduce
Combining DB records with Map reduce
Serverless implementation
The Apache Solr indexer
The Apache Serverless Solr indexer
● Solution in production since mid April
● Indexing time from about 150 hours to 5 hours (30 times faster)
● Cost reduced by 90%
THANK YOU!

More Related Content

Similar to Blazing-Fast Serverless MapReduce Indexer for Apache Solr (20)

PDF
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Paul Leclercq
 
PDF
Advanced CouchDB Rotterdam.rb July 2010
Sander van de Graaf
 
PDF
Understanding Graph Databases with Neo4j and Cypher
Ruhaim Izmeth
 
PPTX
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Michael Rys
 
PDF
Spark cassandra integration, theory and practice
Duyhai Doan
 
PDF
Semantic Pipes (London Perl Workshop 2009)
osfameron
 
PDF
Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL mat...
NoSQLmatters
 
KEY
Hadoop london
Yahoo Developer Network
 
KEY
An introduction to CouchDB
David Coallier
 
PDF
Webinar: Data Processing and Aggregation Options
MongoDB
 
PPT
Processing Large Graphs
Nishant Gandhi
 
PDF
Graph Analysis over JSON, Larus
Neo4j
 
PDF
DiscoRank: optimizing discoverability on SoundCloud
Amélie Anglade
 
PDF
Index management in depth
Andrea Giuliano
 
PPT
Hands on Training – Graph Database with Neo4j
Serendio Inc.
 
PDF
Hive at Last.fm
Skills Matter
 
PPTX
Spark_tutorial (1).pptx
0111002
 
PDF
The ARK Identifier Scheme at Ten Years Old
John Kunze
 
PDF
Bids talk 9.18
Travis Oliphant
 
PDF
Spatial Data, KML, and the University Web
Glennon Alan
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Paul Leclercq
 
Advanced CouchDB Rotterdam.rb July 2010
Sander van de Graaf
 
Understanding Graph Databases with Neo4j and Cypher
Ruhaim Izmeth
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Michael Rys
 
Spark cassandra integration, theory and practice
Duyhai Doan
 
Semantic Pipes (London Perl Workshop 2009)
osfameron
 
Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL mat...
NoSQLmatters
 
An introduction to CouchDB
David Coallier
 
Webinar: Data Processing and Aggregation Options
MongoDB
 
Processing Large Graphs
Nishant Gandhi
 
Graph Analysis over JSON, Larus
Neo4j
 
DiscoRank: optimizing discoverability on SoundCloud
Amélie Anglade
 
Index management in depth
Andrea Giuliano
 
Hands on Training – Graph Database with Neo4j
Serendio Inc.
 
Hive at Last.fm
Skills Matter
 
Spark_tutorial (1).pptx
0111002
 
The ARK Identifier Scheme at Ten Years Old
John Kunze
 
Bids talk 9.18
Travis Oliphant
 
Spatial Data, KML, and the University Web
Glennon Alan
 

More from Sease (20)

PPTX
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
PPTX
From Natural Language to Structured Solr Queries using LLMs
Sease
 
PPTX
Hybrid Search With Apache Solr
Sease
 
PPTX
Multi Valued Vectors Lucene
Sease
 
PPTX
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
PDF
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
PPTX
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
PPTX
How does ChatGPT work: an Information Retrieval perspective
Sease
 
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
PPTX
Neural Search Comes to Apache Solr
Sease
 
PPTX
Large Scale Indexing
Sease
 
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
PDF
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
PPTX
How to cache your searches_ an open source implementation.pptx
Sease
 
PDF
Online Testing Learning to Rank with Solr Interleaving
Sease
 
PDF
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
PDF
Apache Lucene/Solr Document Classification
Sease
 
PDF
Advanced Document Similarity with Apache Lucene
Sease
 
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Hybrid Search With Apache Solr
Sease
 
Multi Valued Vectors Lucene
Sease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
How does ChatGPT work: an Information Retrieval perspective
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Neural Search Comes to Apache Solr
Sease
 
Large Scale Indexing
Sease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
How to cache your searches_ an open source implementation.pptx
Sease
 
Online Testing Learning to Rank with Solr Interleaving
Sease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Apache Lucene/Solr Document Classification
Sease
 
Advanced Document Similarity with Apache Lucene
Sease
 
Ad

Recently uploaded (20)

PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Digital Circuits, important subject in CS
contactparinay1
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Ad

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

  • 1. Blazing-Fast Serverless MapReduce Indexer for Apache Solr Speaker: Daniele Antuzi, R&D Software Engineer @ Sease BERLIN BUZZWORDS 2024 - 11/06/2024
  • 2. ‣ R&D Search Software Engineer ‣ Master in Computer Science at University of Pisa ‣ Passionate about algorithms and data structures ‣ Food (and sometimes sport) lover DANIELE ANTUZI WHO I AM
  • 3. ‣ Headquarter in London/distributed ‣ Open-source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch/OpenSearch experts ‣ Community Contributors ‣ Active Researchers ‣ Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning www.sease.io SEArch SErvices
  • 4. AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
  • 5. The problem of combining DB records DB schema Solr document Songs Composers Albums Tags { id: 235, title: "House Of The Rising Sun", albumName: "The Best of The Animals", composers: ["The Animals"], tags: ["1964", "folk", "en"] }, { id: 594, title: "That's All Right", albumName: "Rock 'n' Roll", composers: ["Elvis Presley"], tags: ["1946", "us"] },
  • 6. The problem of combining DB records SELECT * FROM Songs jOIN Composers ON … JOIN Tags ON … JOIN Albums ON … … Simple solution
  • 7. The problem of combining DB records SELECT * FROM Songs jOIN Composers ON … JOIN Tags ON … JOIN Albums ON … … Simple solution Not scalable High number of records Too much work for DB server
  • 8. The problem of combining DB records songs = getDBRecords("SELECT * FROM Songs") foreach song in songs: getDBRecords("SELECT * FROM Composers c WHERE c.songId = " + song.id) getDBRecords("SELECT * FROM Tags t WHERE t.songId = " + song.id) getDBRecords("SELECT * FROM Albums a WHERE a.songId = " + song.id) . . . More scalable
  • 9. The problem of combining DB records songs = getDBRecords("SELECT * FROM Songs") foreach song in songs: getDBRecords("SELECT * FROM Composers c WHERE c.songId = " + song.id) getDBRecords("SELECT * FROM Tags t WHERE t.songId = " + song.id) getDBRecords("SELECT * FROM Albums a WHERE a.songId = " + song.id) . . . Simple solution High Database workload Too slow
  • 10. AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
  • 11. Map Reduce ● Programming pattern to access big data from a distributed FS ● Paper "MapReduce: Simplified Data Processing on Large Clusters" in 2004 ● The user only defines the functions Map and Reduce ● Implemented by Apache Hadoop or Apache Spark
  • 12. Map Reduce - Word Count ● the: 2469493 ● quick: 34904 ● brown: 45865 ● fox: 3547 ● jumps: 57843 ● over: 29044 ● lazy: 346975 ● dog: 239685
  • 13. Map Reduce - Word Count - Split Node 2 Node 3 Node 1
  • 14. Map Reduce - Word Count - Map Node 2 Node 3 Node 1 [ <Berlin, 5>, <Buzzwords, 3> ] [ <Berlin, 1>, <AI, 7> ] [ <AI, 5>, <Opensource, 4> ]
  • 15. Map Reduce - Word Count - Shuffle Node 2 Node 3 Node 1 <Berlin, 5>, <Buzzwords, 3> <Berlin, 1>, <AI, 7> <AI, 5>, <Opensource, 4> Node 2 Node 3 Node 1 <Berlin, [5, 1]>, <AI, [7,5]> <Buzzwords, [3]> <Opensource, [4]>
  • 16. Map Reduce - Word Count - Reduce Node 2 Node 3 Node 1 <Berlin, [5, 1]>, <AI,[7, 5]> <Buzzwords, [3]> <Opensource, [4]> ● AI: 12 ● Berlin: 6 ● Buzzwords: 3 ● Opensource: 4
  • 17. AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
  • 18. Combining DB records with Map reduce
  • 19. Combining DB records with Map reduce
  • 20. Combining DB records with Map reduce
  • 21. Combining DB records with Map reduce Songs Albums + Song ID Composers + Song ID { songID: 235, title: "House Of The Rising Sun", . . . }, { songID: 345, title: "The Magic Flute", . . . } { songID: 594, title: "The Marriage of Figaro", . . . } { songID: 235, albumName: "The Best of The Animals", . . . }, { songID: 345, albumName: "The Best of Mozart", . . . }, { songID: 594, albumName: "The Best of Mozart", . . . } { songID: 235, composerName: "The Animals", . . . }, { songID: 345, composerName: "Mozart", . . . }, { songID: 594, composerName: "Mozart", . . . }
  • 22. Combining DB records with Map reduce {songID: 235, title: "House Of The Rising Sun", ... } {songID: 594, title: "The Marriage of Figaro", . . . } {songID: 235, albumName: "The Best of The Animals"} {songID: 594, albumName: "The Best of The Mozart"} {songID: 235, composerName: "The Animals", ... } {songID: 594, composerName: "Mozart", ... } <235, title:"House Of The Rising Sun">, <594, title:"The Marriage of Figaro"> <235, composerName:"The Animals">, <594, composerName:"Mozart"> <235, albumName:"The Best of The Animals">, <594, albumName:"The Best of The Mozart">
  • 23. Node y Node x Combining DB records with Map reduce <235, title:"House Of The Rising Sun"> <594, title:"The Marriage of Figaro"> <235, composerName:"The Animals">, <594, composerName:"Mozart"> <235, albumName:"The Best of The Animals"> <594, albumName:"The Best of Mozart"> 235 title:"House Of The Rising Sun" albumName:"The Best of The Animals" composerName:"The Animals" 594 title:"The Marriage of Figaro" albumName:"The Best of Mozart" composerName:"Mozart"
  • 24. Combining DB records with Map reduce 235 title: "House Of The Rising Sun" albumName: "The Best of The Animals" composerName: "The Animals" . . . 594 title: "The Marriage of Figaro" albumName: "The Best of Mozart" composerName: "Mozart" . . . { id: 235, title: "House Of The Rising Sun", albumName: "The Best of The Animals", composers: ["The Animals"], tags: ["1964", "folk", "en"] }, { id: 594, title: "The Marriage of Figaro", albumName: "The Best of Mozart", composers: ["Mozart"], tags: ["1786", "classic"] },
  • 25. AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
  • 26. Serverless Implementation - Ingredients ● SQL Database ● AWS Lambda function ● AWS S3 bucket ● AWS Simple Queue Service (SQS) ● AWS DynamoDB ● AWS Step function - distributed map ● Apache Solr (ElasticSearch, Opensearch)
  • 29. Composers_part_000 . . . Serverless Implementation - Fetch SELECT SongId, Title, Description, … FROM Songs …. Songs_part_000 . . . Albums_part_000 . . . SELECT SongId, ComposerName, … FROM Composers JOIN … …. SELECT SongId, AlbumName, … FROM Albums JOIN … …. Songs_part_749 Composers_part_194 Albums_part_033
  • 31. Serverless Implementation - Map & Shuffle Albums_part_063 {songId:13, albumName: "A"}, {songId:13, albumName: "B"}, {songId:15, albumName: "B"}, Song_13 Song_15 Song_65 Tags_part_394 {songId:15, tagName: "X"}, {songId:15, tagName: "Y"}, {songId:65, tagName: "Z"},
  • 32. Serverless Implementation - Distributed lock ResourceId LockId ExpireAt 13 2049668395 1716812546 245 6739294643 1716812536 ddb_table.put_item( Item={'ResourceId': resource_id, 'ExpireAt’: now_ms + timeout_ms, 'LockId': lock_id}, ConditionExpression='attribute_not_exists(#ResourceId) OR ExpireAt <= :now', ExpressionAttributeNames={"#ResourceId": "ResourceId"}, ExpressionAttributeValues={":now": now_ms}) Atomic put
  • 34. Serverless Implementation - Reduce Song_34 {songId:235, albumName: "The Animals"}, {songId:235, tagName: "year:1964"}, {songId:235, composerName: "The Animals"}, {songId:235, title: "House Of The Rising Sun"}, {songId:235, albumName: "The Best of The Animals"}, { songId:235, title: "House Of The Rising Sun", composer: "The Animals", albumNames: [ "The Animals", "The Best of The Animals" ], tags: ["year:1964"] }
  • 36. Serverless Implementation - Batch & Push Batch size = 2 Batch window = 60 seconds Maximum concurrency = 2
  • 37. AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
  • 38. The Apache Serverless Solr indexer ● Solution in production since mid April ● Indexing time from about 150 hours to 5 hours (30 times faster) ● Cost reduced by 90%