Blazing-Fast Serverless MapReduce Indexer for Apache Solr

Blazing-Fast Serverless
MapReduce Indexer for
Apache Solr
Speaker: Daniele Antuzi, R&D Software Engineer @ Sease
BERLIN BUZZWORDS 2024 - 11/06/2024

‣ R&D Search Software Engineer
‣ Master in Computer Science at University of Pisa
‣ Passionate about algorithms and data structures
‣ Food (and sometimes sport) lover
DANIELE ANTUZI
WHO I AM

‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
www.sease.io
SEArch SErvices

AGENDA
Combining DB records
Introduction to Map reduce
Combining DB records with Map reduce
Serverless implementation
The Apache Solr indexer

The problem of combining DB records
DB schema Solr document
Songs
Composers
Albums
Tags
{
id: 235,
title: "House Of The Rising Sun",
albumName: "The Best of The Animals",
composers: ["The Animals"],
tags: ["1964", "folk", "en"]
},
{
id: 594,
title: "That's All Right",
albumName: "Rock 'n' Roll",
composers: ["Elvis Presley"],
tags: ["1946", "us"]
},

SELECT *
FROM Songs
jOIN Composers ON …
JOIN Tags ON …
JOIN Albums ON …
…
Simple solution

SELECT *
FROM Songs
jOIN Composers ON …
JOIN Tags ON …
JOIN Albums ON …
…
Simple solution
Not scalable
High number of records
Too much work for DB server

songs = getDBRecords("SELECT * FROM Songs")
foreach song in songs:
getDBRecords("SELECT * FROM Composers c WHERE c.songId = " + song.id)
getDBRecords("SELECT * FROM Tags t WHERE t.songId = " + song.id)
getDBRecords("SELECT * FROM Albums a WHERE a.songId = " + song.id)
. . .
More scalable

songs = getDBRecords("SELECT * FROM Songs")
foreach song in songs:
getDBRecords("SELECT * FROM Composers c WHERE c.songId = " + song.id)
getDBRecords("SELECT * FROM Tags t WHERE t.songId = " + song.id)
getDBRecords("SELECT * FROM Albums a WHERE a.songId = " + song.id)
. . .
Simple solution
High Database workload
Too slow

Map Reduce
● Programming pattern to access big data from a distributed FS
● Paper "MapReduce: Simplified Data Processing on Large
Clusters" in 2004
● The user only defines the functions Map and Reduce
● Implemented by Apache Hadoop or Apache Spark

Map Reduce - Word Count
● the: 2469493
● quick: 34904
● brown: 45865
● fox: 3547
● jumps: 57843
● over: 29044
● lazy: 346975
● dog: 239685

Map Reduce - Word Count - Split
Node 2
Node 3
Node 1

Map Reduce - Word Count - Map
Node 2
Node 3
Node 1
[ <Berlin, 5>, <Buzzwords, 3> ]
[ <Berlin, 1>, <AI, 7> ]
[ <AI, 5>, <Opensource, 4> ]

Map Reduce - Word Count - Shuffle
Node 2
Node 3
Node 1
<Berlin, 5>, <Buzzwords, 3>
<Berlin, 1>, <AI, 7>
<AI, 5>, <Opensource, 4>
Node 2
Node 3
Node 1
<Berlin, [5, 1]>, <AI, [7,5]>
<Buzzwords, [3]>
<Opensource, [4]>

Map Reduce - Word Count - Reduce
Node 2
Node 3
Node 1
<Berlin, [5, 1]>, <AI,[7, 5]>
<Buzzwords, [3]>
<Opensource, [4]>
● AI: 12
● Berlin: 6
● Buzzwords: 3
● Opensource: 4

Songs
Albums + Song ID
Composers + Song ID
{ songID: 235, title: "House Of The Rising Sun", . . . },
{ songID: 345, title: "The Magic Flute", . . . }
{ songID: 594, title: "The Marriage of Figaro", . . . }
{ songID: 235, albumName: "The Best of The Animals", . . . },
{ songID: 345, albumName: "The Best of Mozart", . . . },
{ songID: 594, albumName: "The Best of Mozart", . . . }
{ songID: 235, composerName: "The Animals", . . . },
{ songID: 345, composerName: "Mozart", . . . },
{ songID: 594, composerName: "Mozart", . . . }

{songID: 235, title: "House Of The Rising Sun", ... }
{songID: 594, title: "The Marriage of Figaro", . . . }
{songID: 235, albumName: "The Best of The Animals"}
{songID: 594, albumName: "The Best of The Mozart"}
{songID: 235, composerName: "The Animals", ... }
{songID: 594, composerName: "Mozart", ... }
<235, title:"House Of The Rising Sun">,
<594, title:"The Marriage of Figaro">
<235, composerName:"The Animals">,
<594, composerName:"Mozart">
<235, albumName:"The Best of The Animals">,
<594, albumName:"The Best of The Mozart">

Node y
Node x
<235, title:"House Of The Rising Sun">
<594, title:"The Marriage of Figaro">
<235, composerName:"The Animals">,
<594, composerName:"Mozart">
<235, albumName:"The Best of The Animals">
<594, albumName:"The Best of Mozart">
235
title:"House Of The Rising Sun"
albumName:"The Best of The Animals"
composerName:"The Animals"
594
title:"The Marriage of Figaro"
albumName:"The Best of Mozart"
composerName:"Mozart"

235
title: "House Of The Rising Sun"
albumName: "The Best of The Animals"
composerName: "The Animals"
. . .
594
title: "The Marriage of Figaro"
albumName: "The Best of Mozart"
composerName: "Mozart"
. . .
{
id: 235,
albumName: "The Best of The Animals",
composers: ["The Animals"],
tags: ["1964", "folk", "en"]
},
{
id: 594,
title: "The Marriage of Figaro",
albumName: "The Best of Mozart",
composers: ["Mozart"],
tags: ["1786", "classic"]
},

Serverless Implementation - Ingredients
● SQL Database
● AWS Lambda function
● AWS S3 bucket
● AWS Simple Queue Service (SQS)
● AWS DynamoDB
● AWS Step function - distributed map
● Apache Solr (ElasticSearch, Opensearch)

Serverless Implementation - Pipeline

Composers_part_000
. . .
Serverless Implementation - Fetch
SELECT SongId, Title, Description, …
FROM Songs
….
Songs_part_000
. . .
Albums_part_000
. . .
SELECT SongId, ComposerName, …
FROM Composers JOIN …
….
SELECT SongId, AlbumName, …
FROM Albums JOIN …
….
Songs_part_749
Composers_part_194
Albums_part_033

Serverless Implementation - Map & Shuffle
Albums_part_063
{songId:13, albumName: "A"},
{songId:13, albumName: "B"},
{songId:15, albumName: "B"},
Song_13
Song_15
Song_65
Tags_part_394
{songId:15, tagName: "X"},
{songId:15, tagName: "Y"},
{songId:65, tagName: "Z"},

Serverless Implementation - Distributed lock
ResourceId LockId ExpireAt
13 2049668395 1716812546
245 6739294643 1716812536
ddb_table.put_item(
Item={'ResourceId': resource_id,
'ExpireAt’: now_ms + timeout_ms,
'LockId': lock_id},
ConditionExpression='attribute_not_exists(#ResourceId)
OR ExpireAt <= :now',
ExpressionAttributeNames={"#ResourceId": "ResourceId"},
ExpressionAttributeValues={":now": now_ms})
Atomic put

Serverless Implementation - Reduce
Song_34
{songId:235, albumName: "The Animals"},
{songId:235, tagName: "year:1964"},
{songId:235, composerName: "The Animals"},
{songId:235, title: "House Of The Rising Sun"},
{songId:235, albumName: "The Best of The
Animals"},
{
songId:235,
composer: "The Animals",
albumNames: [
"The Animals",
"The Best of The Animals"
],
tags: ["year:1964"]
}

Serverless Implementation - Batch & Push
Batch size = 2
Batch window = 60 seconds
Maximum concurrency = 2

The Apache Serverless Solr indexer
● Solution in production since mid April
● Indexing time from about 150 hours to 5 hours (30 times faster)
● Cost reduced by 90%

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

More Related Content

Similar to Blazing-Fast Serverless MapReduce Indexer for Apache Solr (20)

More from Sease (20)

Recently uploaded (20)

Blazing-Fast Serverless MapReduce Indexer for Apache Solr