Running Solr in the Cloud at Memory Speed with Alluxio

Running Solr at Memory Speed with Alluxio
Timothy Potter
Lucidworks

Agenda
• Overview of Alluxio
• Running Solr on Alluxio
• Interesting Use Cases
• Futures
• Questions?

3
01
Cool things I’ve learned about Alluxio …
• Fastest growing open source project in big data
space
• Baidu reported having an Alluxio cluster with
1000 workers and 50TB of RAM … in Feb 2016!
• Brings cloud-storage into the compute layer; data
access at memory speed
• No need to move / migrate data into Alluxio; just
mount the under storage!
• Apache 2.0 licensed but also has a commercial
offering with support if needed

4
01
Alluxio Basics
• Hadoop FileSystem API: alluxio://…
• Supports single node up to massive
clusters
• Uses ZK for HA stuff; master/worker
model
• Supports many popular storage
systems: HDFS, S3, Azure Blob store,
GCS, GlusterFS …
• Alluxio FUSE to mount as FS on Linux
memory-centric
virtual distributed
storage system

5
01
Configure Solr to use Alluxio
• mkdir or mount Solr root dir in Alluxio
bin/alluxio fs mkdir /solr
• Set start-up options in bin/solr.in.sh:
solr.directoryFactory=HdfsDirectoryFactory
solr.lock.type=hdfs
solr.hdfs.home=alluxio://master:19998/solr
solr.hdfs.confdir=/path/hadoop-conf
• Add a core-site.xml to set:
fs.AbstractFileSystem.alluxio.impl=alluxio.hadoop.AlluxioFileSystem
fs.alluxio.impl.disable.cache=true
alluxio.user.file.writetype.default=CACHE_THROUGH
• Add alluxio client JAR to Solr classpath
Copy alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar to
server/solr-webapp/webapp/WEB-INF/lib/
• Upconfig alluxio configset to ZK
bin/solr zk upconfig -n alluxio -d server/solr/configsets/alluxio/conf
see: http://bit.ly/2y33wQs

6
01
Solr on Alluxio Tips & Tricks
• Run an Alluxio worker on each Solr node
• Write mode should be CACHE_THROUGH to ensure Solr files get
persisted to the under storage, e.g. S3
• Admin can “pin” an index directory to ensure it stays cached in
memory
• Set TTL on index directories that can be freed from memory after a
given timeframe
• Load command moves data from the under storage into Alluxio, such
as after restoring an index from backup

7
01
Use Case 1: Replace the OS cache with Local under FS
• Index performance
~ 5M docs, ~4K docs/sec, <1% diff than local FS, 8GB index on disk
• Query performance (9gb index, 5M docs, r4.xlarge)
* NOTE: ymmv! Utterly un-scientific experiments to get a feel for the technology
Metrics Alluxio MMap/SSD HDFS
QPS 36 42 20
Max QTime 2212 ms 1789 ms 5612 ms
Stddev QTime 335 ms 353 ms 609 ms
Median QTime 70 ms 9 ms 187 ms
75% 372 ms 383 ms 754 ms
95% 972 ms 996 ms 1723 ms
99% 1426 ms 1349 ms 2599 ms

8
01
Use Case 2: Use cloud storage as under FS (S3, GCS, Azure)
• Indexing rate: ~3,650 docs/sec to S3 vs. on 4,000 on local
• As expected, query perf metrics nearly identical 
• Mount the cloud storage system to a directory in Alluxio
bin/alluxio fs mount
alluxio://ec2-34-196-176-70.compute-1.amazonaws.com:19998/s3 s3a://sstk-dev/alluxio
• Deploy cloud instances with lots of memory, e.g. r4’s in EC2
• Use tiered storage to take advantage of the ephemeral disks
(fast SSDs)
• “pin” specific indexes for better performance guarantees S3 or GCS
Alluxio (memory)
10 to 100 Gbps
100 Mbps to
10 Gbps

9
01
Use Case 3: Time-based Partitioning
• Fits nicely with write-once indexes: signals, logs
• Use Alluxio’s TTL feature to “free” indexes on
aged out partitions
• Tiered storage also allows you to have hot
(memory), warm (SSD), cool (HDD), and cold
(S3) partitions
• Allocators and evictors to re-arrange blocks
between tiers; easy to plug-in advanced
strategies
Solr
Partition
9-15
Solr
Partition
9-14
Alluxio (memory)
Alluxio (SSD)
Solr
Partition
9-13
S3 or GCS

1
01
Use Case 4: Cloud-based Recovery
• Solr auto-add replica (have to use
the HdfsUpdateLog)
<updateLog class=“solr.HdfsUpdateLog”> …
• Alluxio will pull the files from memory
on another worker if they’re available
or go back to under FS storage
• Wise to have some auto-warming
queries / caches configured so that
replicas don’t get marked as active in
the cluster until they are warmed up
… thanks Shalin! SOLR-6086
S3 or GCS
Solr
Replica
Alluxio (memory)
Node 1 (us-east-1d)
Node 2 (us-east-1c)
Solr
overseer
Solr
Replica
Add
Replica
Alluxio (memory)

1
01
Synergy with Analytics & Machine Learning
• Solr streaming expressions power analytics jobs that may
require massive result sets at once
• Hybrid solutions that mix Solr with compute frameworks
like Spark and Flink
• Alluxio speeds up SparkSQL and ML jobs
• Fusion SQL ~ Keeping expensive views in Alluxio for
analytics dashboards (complex queries against data
loaded from Solr)

1
01
Work in progress …
• ALLUXIO-2995: Perf issue (fixed in 1.6.0)
Work-around is: alluxio.user.file.cache.partially.read.block=false
• Orphaned write.lock prevents core initialization after crash, SOLR-
8335 and SOLR-8169
bin/alluxio fs rm /solr/alluxio1/core_node1/data/index/write.lock
• SOLR-11335: Closing FileSystem object retrieved from get()
fs.alluxio.impl.disable.cache = true (in core-site.xml)
• SOLR-6237: Shared replicas
• SOLR-9515: Couldn’t get Solr running with s3a w/o Alluxio;
classpath issues 
• Test ASYNC_THROUGH write mode with Solr

1
01
FAQ
• Does Alluxio support running in HA mode?
• How does data locality work with Solr & Alluxio?
• What block size do you recommend for Solr?
• What’s the overhead of CACHE_THROUGH
during indexing?
• What about Solr’s block cache?
• Does Alluxio work with Solr 7?

Running Solr in the Cloud at Memory Speed with Alluxio

More Related Content

What's hot (20)

Similar to Running Solr in the Cloud at Memory Speed with Alluxio (20)

More from thelabdude (10)

Recently uploaded (20)

Running Solr in the Cloud at Memory Speed with Alluxio

Editor's Notes