SlideShare a Scribd company logo
www.anant.us | solutions@anant.us | 202.905.2818
1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
Large Scale Search with Open Source Technologies
Building Search Engines
What do we do?
Streamline, Organize & Unify
Business Information
Agenda
• Challenge - Why does this matter?
• Search Engine - 30k Foot View
• Open - Lucene, Cassandra & Spark
• Customizing - Apache Lucene/SolR
• Custom Parser - Written in Scala
Challenge – Why does this matter?
Knowledge
Project
Information
Client Service
Information
Corporate
Guides
Collaborative
Documents
Assets
& Files
Corporate
Resources
Appleseed Framework (Portal, Base, Search)
G Drive
Delta
DropBox
G Drive
Delta
Nutshell
Dropbox
Freshbooks
G Drive
G Sites (KB)
G Drive
Workflowy
Evernote
G Drive
DropBox
OwnCloud
Pocket
Leaves
AIC (WP)
Anant (WP)
Search Engine – 30 Thousand Foot View
The search index is only as good as your processed data.
If you put everything you find in your index, you are going to
spend a lot of time telling people how to search.
Lucene – More than meets the eye
Who
Next?
Think of it like a “NoSQL” Database that has great indexing..
everywhere.
Cassandra – NoSQL With Structure
Who
Next?
Think of it like a “NoSQL” Database that has structure. Using
“CQL” You can insert into and select from.. just not join.
Spark – Way Better MapReduce
Who
Next?
Think of it like MapReduce if MapReduce were created with
scala, instead of Java, with streams. It’s also 100 times faster.
Configuring - SolR - 1/3
SolR is like an eighteen wheel truck you can take apart and rebuild from
the ground up with only what you need, or add as much as you want.
• Configuration - Schema
–Data Types
–Pre-Processing
–Collection Definitions
–Managed vs. Unmanaged
• Configuration - ZooKeeper
–Synchronize Configurations
–Distribute Shards
–Manage Replicas
–Elect Leaders
• Configuration - SolrConfig
–Handlers
–Components
–Indexing Configurations
–Memory / Cache
–File System
• Lessons Learned
–Try to use out of the box
–Try to configure your way
–Make sure to upgrade
–Not everything can be configured
Configuring - SolR - 2/3
• Before Docker
–Setup Zookeeper
•Customize zoo.cfg
•Setup Zookeeper Servers
–Setup SolR Distro
•Download SolR
•Clean up SolR
•Customize Schema.xml
•Customize SolrConfig.xml
•Setup Different Solr Servers
–Start the Cloud
•Custom Start Scripts
• Today w/ Docker
– docker run --name zookeeper 
-p 127.0.0.1:2181:2181 
-p 127.0.0.1:2888:2888 
-p 127.0.0.1:3888:3888 
jplock/zookeeper
– docker run --link zookeeper:ZK -i 
-p 127.0.0.1:8983:8983 
-t dockerimages/docker-solr 
/bin/bash -c '
cd /opt/solr/example; 
java -jar 
-Dbootstrap_confdir=./solr/collection1/conf 
-Dcollection.configName=myconf  -
DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PO
RT_2181_TCP_PORT 
-DnumShards=2 
start.jar';
https://hub.docker.com/r/dockerima
ges/docker-solr/
https://cwiki.apache.org/confluence/displa
y/solr/Getting+Started+with+SolrCloud
https://cwiki.apache.org/confluence/displa
y/solr/Taking+Solr+to+Production
Configuring - SolR - 3/3
• SolrConfig - Example • Schema - Example
https://cwiki.apache.org/confluence/displa
y/solr/Configuring+solrconfig.xml
https://wiki.apache.org/solr/SchemaXml
SolR Cloud / Zookeeper
User Interface - Super Advanced
Customizing - SolR - 1/3
SolR is like an eighteen wheel truck you can take apart and rebuild from
the ground up with only what you need, or add as much as you want.
• Customization - Parsing
–Need Specialized Syntax?
–Java or Scala Based
–Open Plugin Structure
–Several Examples
• Customization - Highlighting
–Need Special Coloring?
–Specialized Syntax Aware
–Open Plugin Structure
–Several Examples
• Customization - Term Counts
–Need Specific Information?
–Create the Logic
–Register the Component
–Complicated Examples
• Lessons Learned
–Major version upgrades = pain
–Newer classes can be extended
better
–Long term investment
Customizing - SolR - 2/3
• Custom Component in Scala or Java • Installing the Component
http://wiki.apache.org/solr/SolrPluginshttp://sujitpal.blogspot.com/2011/03/using
-lucenes-new-queryparser-
framework.html
Customizing - SolR - 3/3
Creating a Custom Parser with Scala
Building a parser in Scala wasn’t my first choice, but creating it
in Scala made me see how much better the language is.
• Why a Specialized Syntax?
–Legacy Syntax
–Boolean AND Proximity Queries
–Specialized Fielded Expressions
–Ranges / Classifications
• Why not ANTLR or JavaCC?
–Old Parser was in Parboiled(1)
–Parboiled2 was in Scala
–No need to learn a separate
Syntax for Creating Syntax
• Lessons Learned
–Parboiled2 Documentation = bad
–Understand the syntax
–Interactive REPL in Scala = good
–Write tons of unit tests
–Long term investment
• Customizing SolR with Scala
–Found a good Virtual Mentor
–Learned Scala (not for Spark)
–Started from the ground up
–Reduced from ~1k to 400 LOC
JavaCC vs. parboiled2 (Scala)
• Java CC - SurroundQuery.jj • Scala based Parboiled2
Questions & Contact
www.anant.us | solutions@anant.us | 202.905.2818
1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
@anantcorp
facebook.com/anantCorp
linkedin.com/company/anant
rahul@anant.us
linkedin.com/in/xingh
Rahul Singh
CEO & Founder
Questions & Contact
• Brown Bag Session or Meetup?
• Modern Enterprise
• Mastering Services in the Service of Others
• Hybrid Agile Project Management
• Building Search Engines
• CICD / DevOps
• Connecting Internet Software
www.anant.us | solutions@anant.us | 202.905.2818
1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
Streamlined Data
Integration / Data Pipelines
Organized Knowledge
Search / Data Warehouses
Unified Interfaces
Portals / Dashboards / Mobile

More Related Content

What's hot (20)

PDF
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
PDF
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
PDF
SolrCloud Failover and Testing
Mark Miller
 
PPTX
Solr 4: Run Solr in SolrCloud Mode on your local file system.
gutierrezga00
 
PPTX
How to build your query engine in spark
Peng Cheng
 
PDF
Scaling search with SolrCloud
Saumitra Srivastav
 
PDF
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Shalin Shekhar Mangar
 
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
PPTX
Solrcloud Leader Election
ravikgiitk
 
PDF
Apache Solr 5.0 and beyond
Anshum Gupta
 
PDF
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 
PDF
What's new in Solr 5.0
Anshum Gupta
 
PDF
Introduction to SolrCloud
Varun Thacker
 
PPTX
Solr Exchange: Introduction to SolrCloud
thelabdude
 
ODP
Apache SolrCloud
Michał Warecki
 
PDF
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
 
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
thelabdude
 
PDF
SolrCloud Cluster management via APIs
Anshum Gupta
 
PPTX
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha
 
PPTX
Apache Con 2021 Structured Data Streaming
Shivji Kumar Jha
 
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
SolrCloud Failover and Testing
Mark Miller
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
gutierrezga00
 
How to build your query engine in spark
Peng Cheng
 
Scaling search with SolrCloud
Saumitra Srivastav
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Shalin Shekhar Mangar
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
Solrcloud Leader Election
ravikgiitk
 
Apache Solr 5.0 and beyond
Anshum Gupta
 
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 
What's new in Solr 5.0
Anshum Gupta
 
Introduction to SolrCloud
Varun Thacker
 
Solr Exchange: Introduction to SolrCloud
thelabdude
 
Apache SolrCloud
Michał Warecki
 
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
thelabdude
 
SolrCloud Cluster management via APIs
Anshum Gupta
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha
 
Apache Con 2021 Structured Data Streaming
Shivji Kumar Jha
 

Viewers also liked (16)

PDF
Building Search Engines - Lucene, SolR and Elasticsearch
Anant Corporation
 
PPS
Pecha Kucha Rene
guestcbb062
 
PPTX
Elasticsearch
Ricardo Peres
 
PDF
Sma tee adapter female female female
Chetan Shah
 
PPT
Presentation On Edf 2005
stewart2008sem3
 
PPTX
ORMs Meet SQL
Ricardo Peres
 
PDF
Bomba Jacuzzi Série F-G
Comarx Equipamentos e Serviços
 
PDF
Piscina
Crisanto Roman
 
PDF
Sma female to reverse polarity sma male adapter
Chetan Shah
 
PPTX
Presentación1
alumne3ESO
 
PDF
GLIIFCA 22 final (1)
Amanda Chargin
 
PPTX
Web app first scan
Alex Vetrov
 
DOC
Enrique rojas-el-hombre-light
Carolina Diaz
 
PPTX
Connecting Online Business Software 101 (B2B)
Anant Corporation
 
PDF
Get Intelligent with Metabase
Anant Corporation
 
Building Search Engines - Lucene, SolR and Elasticsearch
Anant Corporation
 
Pecha Kucha Rene
guestcbb062
 
Elasticsearch
Ricardo Peres
 
Sma tee adapter female female female
Chetan Shah
 
Presentation On Edf 2005
stewart2008sem3
 
ORMs Meet SQL
Ricardo Peres
 
Bomba Jacuzzi Série F-G
Comarx Equipamentos e Serviços
 
Sma female to reverse polarity sma male adapter
Chetan Shah
 
Presentación1
alumne3ESO
 
GLIIFCA 22 final (1)
Amanda Chargin
 
Web app first scan
Alex Vetrov
 
Enrique rojas-el-hombre-light
Carolina Diaz
 
Connecting Online Business Software 101 (B2B)
Anant Corporation
 
Get Intelligent with Metabase
Anant Corporation
 
Ad

Similar to Building Enterprise Search Engines using Open Source Technologies (20)

PPTX
Big Data Technologies
Anant Corporation
 
PDF
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
PDF
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
PDF
Meet Solr For The Tirst Again
Varun Thacker
 
KEY
Apache Solr - Enterprise search platform
Tommaso Teofili
 
PDF
Ease of use in Apache Solr
Anshum Gupta
 
PDF
Apache Solr Workshop
Saumitra Srivastav
 
PDF
Apache solr liferay
Binesh Gummadi
 
PPTX
Apache Solr Workshop
JSGB
 
PPTX
Solr/Elasticsearch for CF Developers (and others)
Mary Jo Sminkey
 
PDF
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
PPTX
Implementing full text search with Apache Solr
techprane
 
PDF
A Practical Introduction to Apache Solr
Angel Borroy López
 
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
PDF
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
PDF
Solr at zvents 6 years later & still going strong
lucenerevolution
 
PDF
PLAT-4 Understanding the SOLR Integration
Alfresco Software
 
PPTX
Apache Solr for begginers
Alexander Tokarev
 
PPTX
20130310 solr tuorial
Chris Huang
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
Big Data Technologies
Anant Corporation
 
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
Meet Solr For The Tirst Again
Varun Thacker
 
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Ease of use in Apache Solr
Anshum Gupta
 
Apache Solr Workshop
Saumitra Srivastav
 
Apache solr liferay
Binesh Gummadi
 
Apache Solr Workshop
JSGB
 
Solr/Elasticsearch for CF Developers (and others)
Mary Jo Sminkey
 
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
Implementing full text search with Apache Solr
techprane
 
A Practical Introduction to Apache Solr
Angel Borroy López
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
Solr at zvents 6 years later & still going strong
lucenerevolution
 
PLAT-4 Understanding the SOLR Integration
Alfresco Software
 
Apache Solr for begginers
Alexander Tokarev
 
20130310 solr tuorial
Chris Huang
 
Data Engineering with Solr and Spark
Lucidworks
 
Ad

More from Anant Corporation (20)

PPTX
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
PPTX
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
PDF
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
PDF
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
PDF
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
PPTX
YugabyteDB Developer Tools
Anant Corporation
 
PPTX
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
PPTX
Machine Learning Orchestration with Airflow
Anant Corporation
 
PDF
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
PDF
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
PDF
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
PDF
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
PDF
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
PPTX
CL 121
Anant Corporation
 
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
PDF
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
PPTX
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
YugabyteDB Developer Tools
Anant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
Machine Learning Orchestration with Airflow
Anant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 

Recently uploaded (20)

PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Tally software_Introduction_Presentation
AditiBansal54083
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Human Resources Information System (HRIS)
Amity University, Patna
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 

Building Enterprise Search Engines using Open Source Technologies

  • 1. www.anant.us | [email protected] | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 Large Scale Search with Open Source Technologies Building Search Engines
  • 2. What do we do? Streamline, Organize & Unify Business Information
  • 3. Agenda • Challenge - Why does this matter? • Search Engine - 30k Foot View • Open - Lucene, Cassandra & Spark • Customizing - Apache Lucene/SolR • Custom Parser - Written in Scala
  • 4. Challenge – Why does this matter? Knowledge Project Information Client Service Information Corporate Guides Collaborative Documents Assets & Files Corporate Resources Appleseed Framework (Portal, Base, Search) G Drive Delta DropBox G Drive Delta Nutshell Dropbox Freshbooks G Drive G Sites (KB) G Drive Workflowy Evernote G Drive DropBox OwnCloud Pocket Leaves AIC (WP) Anant (WP)
  • 5. Search Engine – 30 Thousand Foot View The search index is only as good as your processed data. If you put everything you find in your index, you are going to spend a lot of time telling people how to search.
  • 6. Lucene – More than meets the eye Who Next? Think of it like a “NoSQL” Database that has great indexing.. everywhere.
  • 7. Cassandra – NoSQL With Structure Who Next? Think of it like a “NoSQL” Database that has structure. Using “CQL” You can insert into and select from.. just not join.
  • 8. Spark – Way Better MapReduce Who Next? Think of it like MapReduce if MapReduce were created with scala, instead of Java, with streams. It’s also 100 times faster.
  • 9. Configuring - SolR - 1/3 SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. • Configuration - Schema –Data Types –Pre-Processing –Collection Definitions –Managed vs. Unmanaged • Configuration - ZooKeeper –Synchronize Configurations –Distribute Shards –Manage Replicas –Elect Leaders • Configuration - SolrConfig –Handlers –Components –Indexing Configurations –Memory / Cache –File System • Lessons Learned –Try to use out of the box –Try to configure your way –Make sure to upgrade –Not everything can be configured
  • 10. Configuring - SolR - 2/3 • Before Docker –Setup Zookeeper •Customize zoo.cfg •Setup Zookeeper Servers –Setup SolR Distro •Download SolR •Clean up SolR •Customize Schema.xml •Customize SolrConfig.xml •Setup Different Solr Servers –Start the Cloud •Custom Start Scripts • Today w/ Docker – docker run --name zookeeper -p 127.0.0.1:2181:2181 -p 127.0.0.1:2888:2888 -p 127.0.0.1:3888:3888 jplock/zookeeper – docker run --link zookeeper:ZK -i -p 127.0.0.1:8983:8983 -t dockerimages/docker-solr /bin/bash -c ' cd /opt/solr/example; java -jar -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf - DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PO RT_2181_TCP_PORT -DnumShards=2 start.jar'; https://hub.docker.com/r/dockerima ges/docker-solr/ https://cwiki.apache.org/confluence/displa y/solr/Getting+Started+with+SolrCloud https://cwiki.apache.org/confluence/displa y/solr/Taking+Solr+to+Production
  • 11. Configuring - SolR - 3/3 • SolrConfig - Example • Schema - Example https://cwiki.apache.org/confluence/displa y/solr/Configuring+solrconfig.xml https://wiki.apache.org/solr/SchemaXml
  • 12. SolR Cloud / Zookeeper
  • 13. User Interface - Super Advanced
  • 14. Customizing - SolR - 1/3 SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. • Customization - Parsing –Need Specialized Syntax? –Java or Scala Based –Open Plugin Structure –Several Examples • Customization - Highlighting –Need Special Coloring? –Specialized Syntax Aware –Open Plugin Structure –Several Examples • Customization - Term Counts –Need Specific Information? –Create the Logic –Register the Component –Complicated Examples • Lessons Learned –Major version upgrades = pain –Newer classes can be extended better –Long term investment
  • 15. Customizing - SolR - 2/3 • Custom Component in Scala or Java • Installing the Component http://wiki.apache.org/solr/SolrPluginshttp://sujitpal.blogspot.com/2011/03/using -lucenes-new-queryparser- framework.html
  • 17. Creating a Custom Parser with Scala Building a parser in Scala wasn’t my first choice, but creating it in Scala made me see how much better the language is. • Why a Specialized Syntax? –Legacy Syntax –Boolean AND Proximity Queries –Specialized Fielded Expressions –Ranges / Classifications • Why not ANTLR or JavaCC? –Old Parser was in Parboiled(1) –Parboiled2 was in Scala –No need to learn a separate Syntax for Creating Syntax • Lessons Learned –Parboiled2 Documentation = bad –Understand the syntax –Interactive REPL in Scala = good –Write tons of unit tests –Long term investment • Customizing SolR with Scala –Found a good Virtual Mentor –Learned Scala (not for Spark) –Started from the ground up –Reduced from ~1k to 400 LOC
  • 18. JavaCC vs. parboiled2 (Scala) • Java CC - SurroundQuery.jj • Scala based Parboiled2
  • 19. Questions & Contact www.anant.us | [email protected] | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 @anantcorp facebook.com/anantCorp linkedin.com/company/anant [email protected] linkedin.com/in/xingh Rahul Singh CEO & Founder Questions & Contact • Brown Bag Session or Meetup? • Modern Enterprise • Mastering Services in the Service of Others • Hybrid Agile Project Management • Building Search Engines • CICD / DevOps • Connecting Internet Software
  • 20. www.anant.us | [email protected] | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 Streamlined Data Integration / Data Pipelines Organized Knowledge Search / Data Warehouses Unified Interfaces Portals / Dashboards / Mobile