Deploying and managing Solr at scale

Deploying and managing Solr at Scale

Who am I?
• Anshum Gupta, Apache Lucene/Solr committer,
Lucidworks Employee.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
• Organizations I am or have been a part of:

Apache Solr has a huge install base and tremendous momentum
most widely used search
solution on the planet.
8M+
total downloads
Solr is both established & growing
250,000+
monthly downloads
Solr has tens of thousands
of applications in production.
You use Solr everyday.
2500+open Solr jobs.
Activity Summary
30 Day summary
Dec 06, 2014 - Jan 05, 2015
• 135 Commits
• 17 Contributors
via https://www.openhub.net/p/solr
12 Month Summary
Jan 5, 2014 — Jan 5, 2015
• 1363 Commits
• 30 Contributors

Getting started with Solr
• Download
• Untar/Unzip
• bin/solr start -e cloud -noprompt
• open http://localhost:8983/solr

Recent usability improvements
• Start scripts
• Schema APIs
• Conﬁg API - Register custom handlers using API
• Status APIs and more….

SolrCloud Architecture
Shard 1
(leader)
Followers
Shard 2
(leader)
Followers
ZooKeeper
Ensemble
Multiple Nodes = Need for Coordination

Production scale?
• Zk ensemble. NOT embedded
• Multiple nodes
• Manually (or script) the 4 steps for each node?

Solr Scale Toolkit
• Open Source!
• Fabric (Python) toolset for deploying and managing SolrCloud
clusters in the cloud
• Code to support benchmark tests (Pig script for data generation /
indexing, JMeter samplers)
• EC2 for now, more cloud providers coming soon via Apache
libcloud
• No *need* to know Python!

The building blocks: A lot of python!
• boto – Python API for AWS (EC2, S3, etc)
• Fabric – Python-based tool for automating system admin tasks over SSH
• pysolr – Python library for Solr (sending commits, queries, ...)
• kazoo – Python client tools for ZooKeeper
• Supporting Cast:
• JMeter – run tests, generate reports
• collectd – system monitoring
• Logstash4Solr – log aggregation
• JConsole/VisualVM – monitor JVM during indexing / queries

Overview of features:
• Provisioning N machine instances in EC2
• Conﬁguring / starting ZooKeeper (1 to n servers)
• Conﬁguring / starting N Solr instances in cloud
mode (M x N nodes)
• Integrating with Logstash4Solr and other
supporting services, e.g. collectd
• Day-to-day operations on an existing cluster

N X M SolrCloud Nodes
ZK Host N
Node 1: Custom AMI
Architecture
Solr-Scale-Toolkit
SiLK
ZK Host 1
ZooKeeper 1
ZK Ensemble
Meta Node
Solr Node 1: 8983
core
core
core
Solr Node N: 89xx
core
core
core
ZooKeeper N
X M such machines
system monitoring
of M machines w/
collectd and JMX

Provisioning cluster nodes
• Custom built AMI (one for PV instances and one for HVM instances) – Amazon Linux
• Dedicated disk per Solr node
• Launch and then poll status until they are live
• Verify SSH connectivity
• Tag each instance with a cluster ID and username
fab new_ec2_instances:test1,n=3,instance_type=m3.xlarge

Deploy ZooKeeper ensemble
• Two options to use the ensemble:
• Provision 1 to N nodes when you launch Solr cluster
• use existing named ensemble
• Fabric command simply creates the myid ﬁles and zoo.cfg ﬁle for the
ensemble
• and some cron scripts for managing snapshots
• Basic health checking of ZooKeeper status:
• echo srvr | nc localhost 2181
fab
new_zk_ensemble:zk1,n=3

Deploy SolrCloud cluster
• Uses bin/solr in Solr 4.10 to control Solr nodes
• Set system props: jetty.port, host, zkHost, JVM opts
• One or more Solr nodes per machine
• JVM mem opts dependent on instance type and # of Solr nodes per
instance
• Optionally conﬁgure log4j.properties to append messages to Rabbitmq
for SiLK integration
fab
new_solrcloud:test1,zk=zk1,nodesPerHost=2

Demo
• Launch ZooKeeper Ensemble
• 3 nodes to establish quorum
• Launch SolrCloud cluster
• Create new collection and index some docs
• Run a healthcheck on the collection

Other useful stuff
• patch from a local build.
• fab mine: See clusters I’m running (or for other users too)
• fab kill_mine: Terminate all instances I’m running
• fab ssh_to: Quick way to SSH to one of the nodes in a cluster
• fab stop/recover/kill: Basic commands for controlling speciﬁc
Solr nodes in the cluster
• fab jmeter: Execute a JMeter test plan against your cluster
• Example test plan and Java sampler is included with the source

Testing Methodology
• Transparent repeatable results
• Ideally hoping for something owned by the community
• Synthetic docs ~ 1K each on disk, mix of field types
• Data set created using code borrowed from PigMix
• English text fields generated using a Zipfian distribution
• Java 1.7u67, Amazon Linux, r3.2xlarge nodes
• enhanced networking enabled, placement group, same AZ
• Stock Solr (cloud) 4.10
• Using custom GC tuning parameters and auto-commit settings
• Use Elastic MapReduce to generate indexing load
• As many nodes as I need to drive Solr!

Indexing performance
Cluster Size # of Shards # of Replicas Reducers Time (secs) Docs / sec
10 10 1 48 1762 73,780
10 10 2 34 3727 34,881
10 20 1 48 1282 101,404
10 20 2 34 3207 40,536
10 30 1 72 1070 121,495
10 30 2 60 3159 41,152
15 15 1 60 1106 117,541
15 15 2 42 2465 52,738
15 30 1 60 827 157,195
15 30 2 42 2129 61,062

Indexing performance lessons
• Solr has no built-in throttling support – will accept work until it
falls over; need to build this into your indexing application
logic
• Oversharding helps parallelize indexing work and gives you an
easy way to add more hardware to your cluster
• GC tuning is critical
• Auto-hard commit to keep transaction logs manageable
• Auto soft-commit to see docs as they are indexed
• Replication is expensive! (Work in progress, SOLR-6816)

Query Performance
• Still a work in progress!
• Sustained QPS & Execution time of 99th Percentile
• Stable: ~5,000 QPS / 99th at 300ms while indexing ~10,000 docs / sec
• Using the TermsComponent to build queries based on the terms in each
field.
• Harder to accurately simulate user queries over synthetic data
• Need mix of faceting, paging, sorting, grouping, boolean clauses, range
queries, boosting, filters (some cached, some not), etc ...
• Start with one server (1 shard) to determine baseline query performance.
• Look for inefficiencies in your schema and other config settings

More on query performance…
• Higher risk of full GC pauses (facets, filters, sorting)
• Use optimized data structures (DocValues) for facet / sort fields, Trie-
based numeric fields for range queries, facet.method=enum for low
cardinality fields
• Add more replicas; load-balance
• -Dhttp.maxConnections=## (default = 5, increase to accommodate
more threads sending queries)
• Avoid increasing ZooKeeper client timeout ~ 15000 (15 seconds) is
about right
• Don’t just keep throwing more memory at Java! –Xmx128G

Roadmap
• Not just AWS
• No need for custom AMI, conﬁgurable download
paths and versions.

References
• Solr scale toolkit
• Blog: http://lucidworks.com/blog/introducing-
the-solr-scale-toolkit/
• Podcast: http://solrcluster.podbean.com/e/tim-
potter-on-the-solr-scale-toolkit/
• github: https://github.com/LucidWorks/solr-
scale-tk

Connect @
http://www.twitter.com/anshumgupta
http://www.linkedin.com/in/anshumgupta/
anshum@apache.org

Deploying and managing Solr at scale

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Deploying and managing Solr at scale (20)

Recently uploaded (20)

Deploying and managing Solr at scale