SlideShare a Scribd company logo
Running Solr at Memory Speed with Alluxio
Timothy Potter
Lucidworks
Agenda
• Overview of Alluxio
• Running Solr on Alluxio
• Interesting Use Cases
• Futures
• Questions?
3
01
Cool things I’ve learned about Alluxio …
• Fastest growing open source project in big data
space
• Baidu reported having an Alluxio cluster with
1000 workers and 50TB of RAM … in Feb 2016!
• Brings cloud-storage into the compute layer; data
access at memory speed
• No need to move / migrate data into Alluxio; just
mount the under storage!
• Apache 2.0 licensed but also has a commercial
offering with support if needed
4
01
Alluxio Basics
• Hadoop FileSystem API: alluxio://…
• Supports single node up to massive
clusters
• Uses ZK for HA stuff; master/worker
model
• Supports many popular storage
systems: HDFS, S3, Azure Blob store,
GCS, GlusterFS …
• Alluxio FUSE to mount as FS on Linux
memory-centric
virtual distributed
storage system
5
01
Configure Solr to use Alluxio
• mkdir or mount Solr root dir in Alluxio
bin/alluxio fs mkdir /solr
• Set start-up options in bin/solr.in.sh:
solr.directoryFactory=HdfsDirectoryFactory
solr.lock.type=hdfs
solr.hdfs.home=alluxio://master:19998/solr
solr.hdfs.confdir=/path/hadoop-conf
• Add a core-site.xml to set:
fs.AbstractFileSystem.alluxio.impl=alluxio.hadoop.AlluxioFileSystem
fs.alluxio.impl.disable.cache=true
alluxio.user.file.writetype.default=CACHE_THROUGH
• Add alluxio client JAR to Solr classpath
Copy alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar to
server/solr-webapp/webapp/WEB-INF/lib/
• Upconfig alluxio configset to ZK
bin/solr zk upconfig -n alluxio -d server/solr/configsets/alluxio/conf
see: http://bit.ly/2y33wQs
6
01
Solr on Alluxio Tips & Tricks
• Run an Alluxio worker on each Solr node
• Write mode should be CACHE_THROUGH to ensure Solr files get
persisted to the under storage, e.g. S3
• Admin can “pin” an index directory to ensure it stays cached in
memory
• Set TTL on index directories that can be freed from memory after a
given timeframe
• Load command moves data from the under storage into Alluxio, such
as after restoring an index from backup
7
01
Use Case 1: Replace the OS cache with Local under FS
• Index performance
~ 5M docs, ~4K docs/sec, <1% diff than local FS, 8GB index on disk
• Query performance (9gb index, 5M docs, r4.xlarge)
* NOTE: ymmv! Utterly un-scientific experiments to get a feel for the technology
Metrics Alluxio MMap/SSD HDFS
QPS 36 42 20
Max QTime 2212 ms 1789 ms 5612 ms
Stddev QTime 335 ms 353 ms 609 ms
Median QTime 70 ms 9 ms 187 ms
75% 372 ms 383 ms 754 ms
95% 972 ms 996 ms 1723 ms
99% 1426 ms 1349 ms 2599 ms
8
01
Use Case 2: Use cloud storage as under FS (S3, GCS, Azure)
• Indexing rate: ~3,650 docs/sec to S3 vs. on 4,000 on local
• As expected, query perf metrics nearly identical 
• Mount the cloud storage system to a directory in Alluxio
bin/alluxio fs mount 
alluxio://ec2-34-196-176-70.compute-1.amazonaws.com:19998/s3 s3a://sstk-dev/alluxio
• Deploy cloud instances with lots of memory, e.g. r4’s in EC2
• Use tiered storage to take advantage of the ephemeral disks
(fast SSDs)
• “pin” specific indexes for better performance guarantees S3 or GCS
Alluxio (memory)
10 to 100 Gbps
100 Mbps to
10 Gbps
9
01
Use Case 3: Time-based Partitioning
• Fits nicely with write-once indexes: signals, logs
• Use Alluxio’s TTL feature to “free” indexes on
aged out partitions
• Tiered storage also allows you to have hot
(memory), warm (SSD), cool (HDD), and cold
(S3) partitions
• Allocators and evictors to re-arrange blocks
between tiers; easy to plug-in advanced
strategies
Solr
Partition
9-15
Solr
Partition
9-14
Alluxio (memory)
Alluxio (SSD)
Solr
Partition
9-13
S3 or GCS
1
01
Use Case 4: Cloud-based Recovery
• Solr auto-add replica (have to use
the HdfsUpdateLog)
<updateLog class=“solr.HdfsUpdateLog”> …
• Alluxio will pull the files from memory
on another worker if they’re available
or go back to under FS storage
• Wise to have some auto-warming
queries / caches configured so that
replicas don’t get marked as active in
the cluster until they are warmed up
… thanks Shalin! SOLR-6086
S3 or GCS
Solr
Replica
Alluxio (memory)
Node 1 (us-east-1d)
Node 2 (us-east-1c)
Solr
overseer
Solr
Replica
Add
Replica
Alluxio (memory)
1
01
Synergy with Analytics & Machine Learning
• Solr streaming expressions power analytics jobs that may
require massive result sets at once
• Hybrid solutions that mix Solr with compute frameworks
like Spark and Flink
• Alluxio speeds up SparkSQL and ML jobs
• Fusion SQL ~ Keeping expensive views in Alluxio for
analytics dashboards (complex queries against data
loaded from Solr)
1
01
Work in progress …
• ALLUXIO-2995: Perf issue (fixed in 1.6.0)
Work-around is: alluxio.user.file.cache.partially.read.block=false
• Orphaned write.lock prevents core initialization after crash, SOLR-
8335 and SOLR-8169
bin/alluxio fs rm /solr/alluxio1/core_node1/data/index/write.lock
• SOLR-11335: Closing FileSystem object retrieved from get()
fs.alluxio.impl.disable.cache = true (in core-site.xml)
• SOLR-6237: Shared replicas
• SOLR-9515: Couldn’t get Solr running with s3a w/o Alluxio;
classpath issues 
• Test ASYNC_THROUGH write mode with Solr
1
01
FAQ
• Does Alluxio support running in HA mode?
• How does data locality work with Solr & Alluxio?
• What block size do you recommend for Solr?
• What’s the overhead of CACHE_THROUGH
during indexing?
• What about Solr’s block cache?
• Does Alluxio work with Solr 7?
Thank You

More Related Content

What's hot (20)

PDF
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio, Inc.
 
PDF
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio, Inc.
 
PDF
Spark Summit EU talk by Jiri Simsa
Spark Summit
 
PDF
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
PDF
How to Develop and Operate Cloud First Data Platforms
Alluxio, Inc.
 
PDF
Accelerating Spark Workloads in a Mesos Environment with Alluxio
Alluxio, Inc.
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Data Con LA
 
PDF
Presto on Alluxio Hands-On Lab
Alluxio, Inc.
 
PDF
Alluxio-FUSE as a data access layer for Dask
Alluxio, Inc.
 
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
PDF
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio, Inc.
 
PDF
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Alluxio, Inc.
 
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Alluxio, Inc.
 
PPTX
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Tachyon Nexus, Inc.
 
PDF
Improving Presto performance with Alluxio at TikTok
Alluxio, Inc.
 
PDF
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Alluxio, Inc.
 
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio, Inc.
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio, Inc.
 
Spark Summit EU talk by Jiri Simsa
Spark Summit
 
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
How to Develop and Operate Cloud First Data Platforms
Alluxio, Inc.
 
Accelerating Spark Workloads in a Mesos Environment with Alluxio
Alluxio, Inc.
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Data Con LA
 
Presto on Alluxio Hands-On Lab
Alluxio, Inc.
 
Alluxio-FUSE as a data access layer for Dask
Alluxio, Inc.
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Alluxio, Inc.
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Alluxio, Inc.
 
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Tachyon Nexus, Inc.
 
Improving Presto performance with Alluxio at TikTok
Alluxio, Inc.
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Alluxio, Inc.
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 

Similar to Running Solr in the Cloud at Memory Speed with Alluxio (20)

PDF
Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
Lucidworks
 
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
PPTX
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
PDF
Spark Summit EU talk by Jiri Simsa
Alluxio, Inc.
 
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
PPTX
(Re)Indexing Large Repositories in Alfresco
Angel Borroy López
 
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
PDF
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Alluxio, Inc.
 
PPTX
Oracle database smart flash cache
Johan Louwers
 
PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio, Inc.
 
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
Alluxio, Inc.
 
PDF
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
Alluxio, Inc.
 
PPTX
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 
PDF
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio, Inc.
 
PDF
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio, Inc.
 
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
PDF
Ippevent : openshift Introduction
kanedafromparis
 
Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
Lucidworks
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
Spark Summit EU talk by Jiri Simsa
Alluxio, Inc.
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
(Re)Indexing Large Repositories in Alfresco
Angel Borroy López
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Alluxio, Inc.
 
Oracle database smart flash cache
Johan Louwers
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio, Inc.
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Alluxio, Inc.
 
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
Alluxio, Inc.
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio, Inc.
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio, Inc.
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
Ippevent : openshift Introduction
kanedafromparis
 
Ad

More from thelabdude (10)

PPTX
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 
PPTX
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
PPTX
Benchmarking Solr Performance at Scale
thelabdude
 
PPTX
Solr Exchange: Introduction to SolrCloud
thelabdude
 
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
thelabdude
 
PPTX
Integrate Solr with real-time stream processing applications
thelabdude
 
PPTX
Scaling Through Partitioning and Shard Splitting in Solr 4
thelabdude
 
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
thelabdude
 
PPT
Boosting Documents in Solr (Lucene Revolution 2011)
thelabdude
 
PPTX
Dachis Group Pig Hackday: Pig 202
thelabdude
 
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
Benchmarking Solr Performance at Scale
thelabdude
 
Solr Exchange: Introduction to SolrCloud
thelabdude
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
thelabdude
 
Integrate Solr with real-time stream processing applications
thelabdude
 
Scaling Through Partitioning and Shard Splitting in Solr 4
thelabdude
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
thelabdude
 
Boosting Documents in Solr (Lucene Revolution 2011)
thelabdude
 
Dachis Group Pig Hackday: Pig 202
thelabdude
 
Ad

Recently uploaded (20)

PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
deep dive data management sharepoint apps.ppt
novaprofk
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 

Running Solr in the Cloud at Memory Speed with Alluxio

  • 1. Running Solr at Memory Speed with Alluxio Timothy Potter Lucidworks
  • 2. Agenda • Overview of Alluxio • Running Solr on Alluxio • Interesting Use Cases • Futures • Questions?
  • 3. 3 01 Cool things I’ve learned about Alluxio … • Fastest growing open source project in big data space • Baidu reported having an Alluxio cluster with 1000 workers and 50TB of RAM … in Feb 2016! • Brings cloud-storage into the compute layer; data access at memory speed • No need to move / migrate data into Alluxio; just mount the under storage! • Apache 2.0 licensed but also has a commercial offering with support if needed
  • 4. 4 01 Alluxio Basics • Hadoop FileSystem API: alluxio://… • Supports single node up to massive clusters • Uses ZK for HA stuff; master/worker model • Supports many popular storage systems: HDFS, S3, Azure Blob store, GCS, GlusterFS … • Alluxio FUSE to mount as FS on Linux memory-centric virtual distributed storage system
  • 5. 5 01 Configure Solr to use Alluxio • mkdir or mount Solr root dir in Alluxio bin/alluxio fs mkdir /solr • Set start-up options in bin/solr.in.sh: solr.directoryFactory=HdfsDirectoryFactory solr.lock.type=hdfs solr.hdfs.home=alluxio://master:19998/solr solr.hdfs.confdir=/path/hadoop-conf • Add a core-site.xml to set: fs.AbstractFileSystem.alluxio.impl=alluxio.hadoop.AlluxioFileSystem fs.alluxio.impl.disable.cache=true alluxio.user.file.writetype.default=CACHE_THROUGH • Add alluxio client JAR to Solr classpath Copy alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar to server/solr-webapp/webapp/WEB-INF/lib/ • Upconfig alluxio configset to ZK bin/solr zk upconfig -n alluxio -d server/solr/configsets/alluxio/conf see: http://bit.ly/2y33wQs
  • 6. 6 01 Solr on Alluxio Tips & Tricks • Run an Alluxio worker on each Solr node • Write mode should be CACHE_THROUGH to ensure Solr files get persisted to the under storage, e.g. S3 • Admin can “pin” an index directory to ensure it stays cached in memory • Set TTL on index directories that can be freed from memory after a given timeframe • Load command moves data from the under storage into Alluxio, such as after restoring an index from backup
  • 7. 7 01 Use Case 1: Replace the OS cache with Local under FS • Index performance ~ 5M docs, ~4K docs/sec, <1% diff than local FS, 8GB index on disk • Query performance (9gb index, 5M docs, r4.xlarge) * NOTE: ymmv! Utterly un-scientific experiments to get a feel for the technology Metrics Alluxio MMap/SSD HDFS QPS 36 42 20 Max QTime 2212 ms 1789 ms 5612 ms Stddev QTime 335 ms 353 ms 609 ms Median QTime 70 ms 9 ms 187 ms 75% 372 ms 383 ms 754 ms 95% 972 ms 996 ms 1723 ms 99% 1426 ms 1349 ms 2599 ms
  • 8. 8 01 Use Case 2: Use cloud storage as under FS (S3, GCS, Azure) • Indexing rate: ~3,650 docs/sec to S3 vs. on 4,000 on local • As expected, query perf metrics nearly identical  • Mount the cloud storage system to a directory in Alluxio bin/alluxio fs mount alluxio://ec2-34-196-176-70.compute-1.amazonaws.com:19998/s3 s3a://sstk-dev/alluxio • Deploy cloud instances with lots of memory, e.g. r4’s in EC2 • Use tiered storage to take advantage of the ephemeral disks (fast SSDs) • “pin” specific indexes for better performance guarantees S3 or GCS Alluxio (memory) 10 to 100 Gbps 100 Mbps to 10 Gbps
  • 9. 9 01 Use Case 3: Time-based Partitioning • Fits nicely with write-once indexes: signals, logs • Use Alluxio’s TTL feature to “free” indexes on aged out partitions • Tiered storage also allows you to have hot (memory), warm (SSD), cool (HDD), and cold (S3) partitions • Allocators and evictors to re-arrange blocks between tiers; easy to plug-in advanced strategies Solr Partition 9-15 Solr Partition 9-14 Alluxio (memory) Alluxio (SSD) Solr Partition 9-13 S3 or GCS
  • 10. 1 01 Use Case 4: Cloud-based Recovery • Solr auto-add replica (have to use the HdfsUpdateLog) <updateLog class=“solr.HdfsUpdateLog”> … • Alluxio will pull the files from memory on another worker if they’re available or go back to under FS storage • Wise to have some auto-warming queries / caches configured so that replicas don’t get marked as active in the cluster until they are warmed up … thanks Shalin! SOLR-6086 S3 or GCS Solr Replica Alluxio (memory) Node 1 (us-east-1d) Node 2 (us-east-1c) Solr overseer Solr Replica Add Replica Alluxio (memory)
  • 11. 1 01 Synergy with Analytics & Machine Learning • Solr streaming expressions power analytics jobs that may require massive result sets at once • Hybrid solutions that mix Solr with compute frameworks like Spark and Flink • Alluxio speeds up SparkSQL and ML jobs • Fusion SQL ~ Keeping expensive views in Alluxio for analytics dashboards (complex queries against data loaded from Solr)
  • 12. 1 01 Work in progress … • ALLUXIO-2995: Perf issue (fixed in 1.6.0) Work-around is: alluxio.user.file.cache.partially.read.block=false • Orphaned write.lock prevents core initialization after crash, SOLR- 8335 and SOLR-8169 bin/alluxio fs rm /solr/alluxio1/core_node1/data/index/write.lock • SOLR-11335: Closing FileSystem object retrieved from get() fs.alluxio.impl.disable.cache = true (in core-site.xml) • SOLR-6237: Shared replicas • SOLR-9515: Couldn’t get Solr running with s3a w/o Alluxio; classpath issues  • Test ASYNC_THROUGH write mode with Solr
  • 13. 1 01 FAQ • Does Alluxio support running in HA mode? • How does data locality work with Solr & Alluxio? • What block size do you recommend for Solr? • What’s the overhead of CACHE_THROUGH during indexing? • What about Solr’s block cache? • Does Alluxio work with Solr 7?

Editor's Notes

  • #3: In this talk, I introduce Alluxio, the fastest growing open source project in the big data ecosystem, and show how to leverage it for optimizing Solr performance. I'll begin with a brief introduction about how Alluxio works and why it's interesting for the Solr community. Next, I describe how to run Solr on Alluxio and cover basic integration scenarios. Lastly, I provide some performance comparisons between running Solr on Alluxio vs. a local FS and HDFS. Attendees will come away with a new toolset to help them use Solr to tackle a wide array of big data problems.
  • #4: Apache Zeppelin interpreter to execute FS shell commands, e.g. ls /mnt/solr Another benefit is you can try this out quickly on EC2
  • #6: See: http://lucene.apache.org/solr/guide/6_6/running-solr-on-hdfs.html#running-solr-on-hdfs
  • #8: r4.xlarge with 4 cpu, 5M docs, 10K random queries, 16 concurrent users (jmeter) Still might be useful to “pin” specific indexes to help ensure performance Overall, using Alluxio was slower for queries, which is expected as MMap is faster than reading from Alluxio even though files are in memory However, Alluxio beat HDFS. Probably could have done some BlockCache tuning but seems complicated
  • #9: Accelerate remote storage I/O Since indexes are in S3, you could run Spark jobs that read the full index w/o impacting search performance Avoid cloud vendor lock-in as Solr doesn’t know anything about the underlying cloud FS Important: Could not get Solr to work against S3 w/o Alluxio due to Hadoop classpath issues and an issue with HttpClient 4.3; this is documented at: https://community.plm.automation.siemens.com/t5/The-Big-Data-Blog/Running-Solr-on-S3/ba-p/388004 However, this is another example of using Alluxio to hide under FS issues from Solr!
  • #10: What happens when an old partition is queried? Does Alluxio pull that into cache and evict other data or ??? How to control this
  • #13: Solr on S3A w/o Alluxio issues: https://community.plm.automation.siemens.com/t5/The-Big-Data-Blog/Running-Solr-on-S3/ba-p/388004
  • #14: Data locality: you’ll want an alluxio worker on every node where you plan to run Solr replicas Be careful with smaller block sizes and merging / optimize CACHE_THROUGH didn’t show much overhead, <%1 diff