Automated Cluster Management and Recovery for Large Scale Multi-Tenant Search Infrastructure: Presented by Nitin Sharma & Li Ding, BloomReach

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Nitin Sharma
Bloomreach
Li Ding
Bloomreach

3
01
Large Scale Cluster Management &
Recovery for
Multi-DC SolrCloud

4
02
Abstract
Cluster Management & Recovery for an enterprise grade global search Infrastructure is non-trivial.
•  Serving Hundreds of Millions of Documents & Queries
•  Multi-Tenant
•  Geographically distributed Data Centers
•  Custom SolrCloud Components - Analysis, Ranking and Faceting
•  Dynamic Ranking Elements – Collection/Cluster Level
At BloomReach, we have built an innovative search architecture aimed at reliable Cluster Management and
Recovery.
•  The Infrastructure is data center based.
•  Discovery service: DC Metadata, roles, tenants.
•  Real-time Active Monitoring: Robust failure detection.
•  Recovery Service: One step Recovery, Rollback and Backup.
The presentation will describe the infrastructure in great detail and how it achieves the availability and
performance while making things simple from a platform management standpoint

5
02
About us
BloomReach is a Cloud Marketing Platform. We have developed a personalized discovery platform that
features applications which analyze big data to makes our customers’ digital content more discoverable,
relevant and profitable.
Nitin: I work on Scaling the Search Platform for Bloomreach’s big data. My relevant experience and
background includes scaling real-time services for latency sensitive applications and building performance
and search-quality metrics infrastructure for personalization platforms.
Li : I am a member of technical staff at BloomReach's platform team. My background includes working on
virtualization management platforms, building search performance infrastructures, scaling distributed
services.

BloomReach’s Applications
Organic
Search
Contentunderstanding
What it does
Content optimization,
management and measurement
Benefit
Enhanced discoverability and
customer acquisition in organic search
What it does
Personalized onsite search and
navigation across devices
Benefit
Relevant and consistent onsite
experiences for new and known users
What it does
Merchandising tool that understands
products and identifies opportunities
Benefit
Prioritize and optimize
online merchandising
SNAP
Compass

7
01
Agenda
•  History - Elastic Search Infrastructure
•  Real Time Serving - Scaling & Availability Challenges.
•  Highly Available Multi DC/Multi-Tenant Architecture.
•  Cluster Management Suite
•  Replication & Ranking Config Mgmnt Service
•  Deployment & Recovery Service
•  Active Monitor Service
•  Auto Recovery Service

8
01
History - Elastic Solr Cloud Infrastructure

Zookeeper
SC2
Solr Compute Cloud Infrastructure – Large Scale Pipelines
Backend
Solr
Elastic
Cluster
Elastic
Cluster
Elastic
Cluster
Elastic
Cluster
Indexing Pipeline
Analysis PipelineH
A
F
T
provision
provision
replicate
replicate
Serving
Solr
H A F T
replicate
Read/write

10
01
SC2 HAFT – Open Source
github.com/bloomreach/solrcloud-haft

11
01
ZK Replicator – Open Source
github.com/bloomreach/zk-replicator

12
01
Agenda

13
01
Real Time Serving - Scaling & Availability Challenges

Real Time Serving – Dynamic Elements
Zookeeper
Custom SolrCloud Components & Analyzers
Global Ranking Conﬁgurations
Global Entities
Collection Level Entities
Ranking Files
Ranking conﬁgs
Serving SolrCloud
Load Balancer

Zookeeper
Custom SolrCloud Components & Analyzers
Global Ranking Conﬁgurations
Global Entities
•  Query Parsers, Analyzers
•  Ranking & Scoring Components
•  Non Optimal Performance (Latency, Mem usage)
•  Memory & File Handle Leaks
•  Stale Searchers Left open•  Non Confirming Configurations
•  Size Limit
•  Solr Startup Issues
Serving SolrCloud

Zookeeper
Collection Level Entities
•  Ranking Elements Loaded once per core
•  Collection Reloads
•  Non Optimal Performance (Latency, Mem usage)
•  Files not versioned to support roll back
•  External files not sharded.
Serving SolrCloud
Ranking Files
Ranking conﬁgs
•  Loads when core initializes. Misconfiguration crashes cores.
•  Hot swap of configs requires dynamic loading.

Real Time Serving – Recovery Challenges
Zookeeper
Global EntitiesCollection Level Entities
Serving SolrCloud
Load Balancer
•  The cluster goes down
•  Restoring older release takes time. Restarting 1000s of
collections is unstable and could take hours.
•  Serving is affected
Bad jar deployed
Large ranking
ﬁle
•  Large ranking file – Unsharded. Increases per core mem requirement.
•  Auto Rollback to previous version is non trivial (if new ranking file is produced
by pipelines).
•  Serving is affected. No longer highly Available.

Real Time Serving – Multi Tenancy Challenges
Zookeeper
Serving SolrCloud
Load Balancer
Tenant 1
Tenant 2
Tenant n
•  Tenant is a unique <app, collection> pair in solrcloud.
•  Unique collection type per app. [<Zkconfig, Ranking, Query
Patterns>]
•  Index, Config, Cluster Management strategies
vary drastically
Recovery:
Tenant 1: No dynamic config, static large index
Tenant 2: External Ranking files, aggressive index
refresh & customer generated data.
Tenant N: …
What is a tenant?

Real Time Serving – Multi DC Challenges
Zookeeper (Common)
•  Every data center hosts only part of the tenants
based on geo.
•  Adding a new Geo based DC needs to have
a shared zk ensemble
•  Selective Collection Placement is not possible
•  HA and Latency Guarantees are non trivial
Multi Dc
EU
East
West

20
01
Agenda

21
01
Multi DC /Multi-Tenant Architecture

Multi DC /Multi- Tenant Architecture
Terminologies Definition
Solr Data Center A logical group of solr nodes with metadata. The metadata contains
placement, role, replication factor, apps etc.
Solr Cluster Logical Grouping of Solr Data Centers.
Replication API Replicates Index from Elastic Clusters onto all Datacenters
Ranking File Management API Uploads ranking files to Serving Datacenters

Cluster Topology/ Data Center Definition
•  Where does the dc live?
•  How many nodes ?
•  Name, Type of DC.
•  Role of DC
•  Serve – Behind LB for api requests
•  Replicate – Gets indexing updates
•  LB End point
•  Tenants/apps
•  …

Automated Cluster Management Suite
Solr Serving 1
LoadBalancer
Solr Serving 2
Solr
Backup
Replication API
Cluster
Metadata
Cluster Management
API
Deployment/Recovery
API
R/W
Cluster Ops
Replication API
Ranking Mgmnt API
Cluster Management
API
Deployment/Recovery
API
HA Mode Setup
Ranking Mgmnt API

25
01
Agenda

Replication Management API
Solr Serving 1
LoadBalancer
Solr Serving 2
Solr
Backup
Replication API
Cluster
Metadata
Cluster Management
API
Query = operation: Replicate?
App : app1
Elastic Indexers
Index

Ranking File Management API
Solr Serving 1
LoadBalancer
Solr Serving 2
Solr
Backup
Ranking Mgmnt API
Cluster
Metadata
Cluster Management
API
Query = operation: Serve?
App : app2
Ranking Pipelines
Ranking
ﬁles
S3
Version
ﬁles and
Store
them

28
01
Agenda

Deployment/Recovery Service (Launch New Datacenter)
•  Adding new DC to the cluster (Geo based)
•  Adding temporary capacity for increased traffic.
•  Expanding cluster capacity permanently.

Deployment/Recovery Service (Launch New Datacenter)
Serving3
app1
Datacenter Definition
Data (JSON) ZK ZK ZK SOLR SOLR SOLR SOLR
Multi-threads installation
Smoke Test:
1)  Every collection is queryable
2)  Every collection has the
config it suppose to have
SolrCloud Production Cluster
Load Balancer
Serving1
app1
Backup
app1
app2
Serving2
app1
Others…
Deployment/Recovery Service
Cluster
Metadata
Index
Store the config

Example of additional DC
•  Where does the dc live?
•  How many nodes ?
•  Name, Type of DC.
•  Role of DC
•  Serve – Behind LB for api requests
•  Replicate – Gets indexing updates
•  LB End point
•  Tenants/apps
•  …

Deployment/Recovery Service (Hard Recovery)
•  One or more hosts in a datacenter is down
•  There are network issues with one datacenter
•  AWS decides to retire some instances in a datacenter

Smoke
Test
New
Serving2
Provision hosts using serving2’s config and same
as creating a new dc
New
Serving2
Update config of new
serving2
Add back to LB
Deployment/Recovery Service (Hard Recovery)
Load Balancer
Serving1
Dc
Serving2
app1
Cluster
Metadata
Retrieve config
of serving2

Deployment/Recovery Service (Soft Recovery)
•  One or more datacenters are having high memory usage, CPU usage and doesn’t respond to re
quests
•  Several collections are down in a DC due to Zookeeper state or other issues
•  Deploy code. Our customized component needs a restart of Solr to take effect

Snapshots
•  A snapshot service will take snapshot of a serving dc every 24 hours
•  The snapshot contains global files: customized jar, Zookeeper configs
and per tenant level files like ranking files, synonyms etc. This is done
through HAFT API
•  Index is never snapshoted
•  The snapshot will be timestamped and stored in S3
Load Balancer
Serving1
app1
Serving2
app1
H
A
F
T
S3
Using HAFT to
Take snapshots
Store snapshot
in S3 with timestamp
Base: s3://cluster/production/20151008155637
s3://cluster/production/20151008/jar
s3://cluster/production/20151008/zkconfig
s3://cluster/production/20151008/tenant1/ranking
…
Global Files:
Jar, ZK config
Per Tenant Files

Revert to Snapshots
When revert a DC back to a snapshot point, we use HAFT to replicate index from backup datacenter
and all external files, global config files and per tenant based files from the snapshot S3 location

Code Deploy Mode – Soft Recovery
Take a global
lock of that DC
Load Balancer
Serving1
Dc
Serving2
Dc
Cluster
Metadata
Get DC’s config, ZK,
Solr hosts etc
Release Lock
Take out of LB
Deploy code
Release configs
Post Deployment Tests

Disaster Recovery Mode – Soft Recovery
Load Balancer
Serving1
app1
Serving2
app1
Get the global lock
Take dc out of LB
Rolling restarts Solr
Check host and
collection health
Pass?
Add to LB
Delete all files of Zookeeper
and Solr, wipe out everything
Install Zookeeper
and Solr
Setup global files using
current files or snapshot files
Using HAFT to replicate all
collections with current or
snapshot per tenant files and
indexes from backup
Smoke Test

39
01
Agenda

Active Monitor Service
•  We are using SPM to monitor JVM usage, CPU load on each S
olr host as well
Sematxt
•  Runs every five minutes
•  Check if each Zookeeper node is accessible through HAFT api.
If more than half of the zookeeper node is down, page
•  Check if every Solr node is accessible and every collection on
that node is able to query. If a node is unhealthy, page
•  Checks all data centers
Node level monitorSolrCloud Production Cluster
Load Balancer
Serving1
dc
Serving2
dc
Others…
Auto Recovery Service

•  Runs every five minutes
•  Check if serving data centers with the
same app have same config files
•  Check if all datacneters have same index
•  Either one failed will page us
Cluster level monitorSolrCloud Production Cluster
H
A
F
T
Serving1
app1
ZooKeeper
Serving1
app1
ZooKeeper
"test-tenant":{
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{"core_node1":{
"state":"active",
"base_url":"http://10.99.99.99:8983/solr
",
"core":"test-tenant_shard1_replica1",
"node_name":"10.99.99.99:8983_solr",
"leader":"true"}}}},
"maxShardsPerNode":"500",
"router":{"name":"compositeId"},
"replicationFactor":"1"},
"test-tenant":{
"shards":{"shard1":{
"range":"80000000-7fffffff",
"state":"active",
"replicas":{"core_node1":{
"state":"active",
"base_url":"http://10.99.99.99:8983/solr
",
"core":"test-tenant_shard1_replica1",
"node_name":"10.99.99.99:8983_solr",
"leader":"true"}}}},
"maxShardsPerNode":"500",
"router":{"name":"compositeId"},
"replicationFactor":"1"},

42
01
Agenda

Auto Recovery Service (In Progress)
•  One ZK node is down
•  One Solr node is down
•  Serving1 and Serving2 has different number
of ducments
•  Serving1 and Serving2 has different
versions of config files
•  Serving1 JVM usage is high
•  Serving1’s machines are not accessible
•  Restart ZK, check if ZK is accessible
•  Restart Solr
•  Replicate from backup datacenter
•  Using versioned config files and HAFT api
to recreate the files
•  Soft recovery or rollback
•  Hard recovery
Only Page Us Only When Automation Failed

Automated Cluster Management and Recovery for Large Scale Multi-Tenant Search Infrastructure: Presented by Nitin Sharma & Li Ding, BloomReach

More Related Content

What's hot (20)

Similar to Automated Cluster Management and Recovery for Large Scale Multi-Tenant Search Infrastructure: Presented by Nitin Sharma & Li Ding, BloomReach (20)

More from Lucidworks (20)

Recently uploaded (20)

Automated Cluster Management and Recovery for Large Scale Multi-Tenant Search Infrastructure: Presented by Nitin Sharma & Li Ding, BloomReach