Implementing full text search with Apache Solr

Samuel
Folasayo
Building a Search App with
Apache Solr and Python
A guide to building a full-text search app using FastAPI and Solr
Joe Nyirenda

Learning Objectives
● Grasp the basics of full-text search and its applications
● Set up Apache Solr and integrate it with Python using pysolr
● Index, add, and retrieve data effectively in Solr
● Managing index configuration
● Perform basic and advanced search queries with filters and sorting
● Troubleshoot common issues for optimized data retrieval

Introduction to Apache Solr
What is Solr
A powerful open-source search platform
Built on Apache Lucene
Supports full-text search, faceted search, and analytics
Why Solr?
High performance
Scalable and extensible
Proven technology

eBay: Product search, auto-suggestions, and filtering for millions of listings
Netflix: Content search (movies, TV shows), personalized recommendations,
scalability
Adobe: Document, tutorial, and support content search with faceted filtering
Best Buy: Fast, relevant product search with customizable ranking
LinkedIn: People, jobs, and company search with advanced filters
Top Websites Using Apache Solr

Content Management Systems: Searching for articles or blogs by exact titles or
tags
E-commerce Platforms: Locating products by exact names or categories
Knowledge Bases: Retrieving FAQs or documentation by specific keywords or
phrases
Example Use Cases

The Average UK Salary for a Solr Administrator is £65,000 per year!
Apache Solr Jobs

Scalable and extensible search functionality
Easy-to-integrate Python API for querying Solr
Customizable and user-friendly UI
Benefits of This Implementation

Prerequisites:
Java (JDK 8 or higher)
Apache Solr downloaded and installed
Steps:
Download and extract Solr from the official Solr site.
Start Solr with bin/solr start.
Create a core (or collection) with bin/solr create -c <core_name>.
Setting Up Apache Solr

CPU: At least 2 CPU cores (more cores for high query volume and indexing)
Memory: Minimum 8GB RAM (16GB or more recommended for large datasets and
high traffic)
Disk Space: At least 10GB free disk space (more depending on index size)
Disk Type: SSDs recommended for faster indexing and query performance
Hardware Requirements:

Operating System: Linux, macOS, or Windows (Linux is the most common for
production).
Java: JDK 8 or higher.
Network: Solr instances require access to ports 8983 (default for Solr) and may need
additional ports for replication or clustering.
System Requirements:

Considerations:
Data Volume: Estimate data to be indexed (affects memory & disk usage)
Query Load: Plan for expected queries per second (affects CPU & RAM)
Indexing Load: Allocate resources for frequent indexing without slowing down
queries
Redundancy & High Availability: Use SolrCloud for distributed search and
failover (requires more resources)
Best Practices:
Memory Allocation: Allocate ~50% of system RAM to Solr (max 64GB, but leave
room for other apps)
Disk I/O: Use SSD storage for better performance with large datasets
Replication: Set up SolrCloud replication for load distribution and improved
availability
Sizing an Apache Solr Instance:

Definition:
A Solr schema is a configuration file that defines the structure of the data Solr indexes and
searches
It describes the fields, their types, how data is indexed, stored, and how queries are handled
Key Components:
Fields: Specifies the data attributes (e.g., title, content, date)
Field Types: Defines the data type (e.g., string, text, integer, date) and how they are indexed
Analyzers: Determines how text fields are processed (e.g., tokenization, stemming)
Copy Fields: Allows combining multiple fields for efficient searching
What is a Solr Schema?

Prerequisites:
File: managed-schema.xml
Purpose: Defines field types and the structure of the Solr index
Define Fields in managed-schema.xml:
Setting Up Solr Schema
<field name="title" type="string" indexed="true" stored="true"/>
<field name="content" type="string" indexed="true" stored="true"/>
<field name="id" type="string" indexed="true" stored="true"/>
indexed="true": The field will be searchable
stored="true": The field’s value will be stored and retrievable (useful for
returning field values in search results)

How to Create a Schema in the Solr Admin UI
Access the Schema Tab:
Open the Solr Admin UI (e.g., http://localhost:8983/solr), select the core you want to modify, and
click on the Schema tab.
Add Fields and Field Types:
Under the Fields section, click Add Field to define the field name, type, and attributes like indexing.
Optionally, add new Field Types under the Field Types section.
Apply Changes:
After making changes, go to the Core Admin tab and click Reload to apply your schema updates.

Key Field Types:
text_general:
Tokenized: Breaks text into individual
words/tokens
Supports text analysis: Useful for
full-text search (e.g., search by
keywords in a document)
Impact on Performance:
Pros: Optimized for search relevance and
flexibility
Cons: Requires more CPU and memory
for indexing and querying due to text
analysis and tokenization
Impact on Index Size:
Larger index size due to tokenization and
additional metadata for search analysis

Key Field Types:
string:
Exact Matches: No tokenization,
stores the entire string as a single
value.
Not tokenized: Ideal for fields that
require exact matches (e.g., IDs,
usernames).
Impact on Performance:
Pros: Faster for exact matches (e.g.,
filtering, sorting).
Cons: Less flexible for full-text search
operations.
Impact on Index Size:
Smaller index size compared to
text_general, as no additional
processing is required.

Impact of Field Types on Performance and
Index Size
Performance Impact:
text_general requires more
processing (tokenization, text
analysis), which can impact
indexing speed and query
response time
string offers faster exact
matching but is not useful for
full-text search
Index Size Impact:
text_general creates larger
indexes due to tokenization and
additional metadata
string typically results in
smaller index sizes since it
doesn’t require tokenization

Optimizes Search Performance
Ensures Accurate Data Representation
Customizes Indexing Behavior
Supports Complex Queries
Scalability and Flexibility
Benefits of Proper Solr Schema Configuration

Common Solr Field Types and Their Use Cases
Field Type Purpose Use Case
text_general Tokenized, analyzed
text
Full-text search (e.g.,
articles)
string Exact match, no
analysis
Identifiers (e.g., IDs,
codes)
text_en Tokenized, analyzed
text with language-
specific analysis
(English)
Full-text search for
English language
content (e.g., blog
posts, product
descriptions)
int Integer values, no
analysis
Storing and querying
numerical values like
product prices, user
ages, or ratings
date Date and time values,
no analysis
Filtering or sorting by
date fields, such as
creation or
modification

Deciding on Field Types
Application Use Case:
Use text_general for title and content to enable full-text search
Use string for id as an exact identifier
Numerical Fields: Use int for whole numbers and float for decimal values (e.g.,
prices, ratings)
Date Fields: Use date or pdate when you need to filter or range-query dates (e.g.,
transaction timestamps)
Booleans: Use boolean for binary flags (e.g., active users)

Install Dependencies: pip install “fastapi[standard]” pysolr
Why FastAPI?
This setup leverages FastAPI for fast, asynchronous web requests and PySolr to
integrate Solr with Python, optimizing performance and ease of use
Search App with FastAPI and PySolr
Connecting to Solr
Initialize Solr Client:
import pysolr
solr = pysolr.Solr('http://localhost:8983/solr/my_core', always_commit=True)

Root Endpoint:
Defines the root endpoint (/) with an HTML response using FastAPI
Returns HTML content when the root URL is accessed, typically for landing pages or
documentation.
Asynchronous function ensures efficient handling of multiple requests concurrently
Creating the API: Root Endpoint
@app.get("/", response_class=HTMLResponse)
async def read_root():
return """
<html>...</html>
"""

Handles user queries with search term (query) and pagination (page).
Uses Solr to fetch and return results.
Dynamically generates search results in HTML.
Essential for building interactive search features.
Search Endpoint
@app.get("/search", response_class=HTMLResponse)
async def search(query: str = Query(...), page: int = Query(1)):
# Solr query logic here
return results_html

Example Query:
Query Parameters in Solr
query_params = {
"q": f"title:{query} content:{query}",
"hl": "true",
"start": start,
"rows": results_per_page,
}
results = solr.search(**query_params)
query_params defines fields, search term, highlighting, and pagination.
solr.search(**query_params) fetches results from Solr.
Proper structure ensures effective search and pagination.

Basic Search Form:
HTML UI for the Search Engine
<form action="/search" method="get">
<input type="text" name="query" placeholder="Search..."
required/>
<input type="submit" value="Search"/>
</form>

Enable Highlighting:
Highlighting Search Results
"hl": "true",
"hl.fl": "title,content",
Example Highlighted Result:
Highlighted text appears as <em>Highlighted Text</em> in the search results
to show where matches occur.
Highlighting improves the user experience by visually identifying search term
matches in results.

Handle multiple pages of results:
prev_page = page - 1 if page > 1 else None (for previous page)
next_page = page + 1 if len(results.docs) == results_per_page else
None (for next page)
Pagination
prev_page = page - 1 if page > 1 else None
next_page = page + 1 if len(results.docs) == results_per_page else None
<a href="?page=prev_page">Previous</a>
<a href="?page=next_page">Next</a>
Navigation Links:
<a href="?page=prev_page">Previous</a> (link to previous page)
<a href="?page=next_page">Next</a> (link to next page)

Benefits of Pagination
Improved Performance: Loads smaller, manageable chunks of data, reducing
server load and improving speed.
Better User Experience: Allows users to easily navigate through large datasets
without overwhelming them with too many results at once.
Scalability: Handles large amounts of data efficiently, making it easier to scale
applications without performance issues.

Clear Solr Index:
Data Cleanup Commands
curl "http://localhost:8983/solr/search_core/update?commit=true" -d
'<delete><query>*:*</query></delete>'
Deletes all documents in the specified Solr core (search_core) with the query *:*
commit=true ensures immediate changes
Benefits:
Clears outdated or irrelevant documents
Prevents index bloat, improving search performance

What is Solr Core?
A Solr core is an independent instance that contains its own index, schema, and
configuration, allowing multiple cores to run on a Solr server, each managing separate
datasets.
Explanation:
Base URL: localhost:8983 (Solr server)
Endpoint: /select (used for fetching search
results)
Parameters:
indent=true: Formats the response for
readability
q.op=OR: Default operator for query
terms is "OR"
q=*%3A*: Retrieves all documents (query
*:* encoded)
useParams: For query parameters
Solr Query URL

Viewing Search Results
The screenshot shows the Solr search query result as displayed in the browser.

Conclusion
Full text search is crucial for applications requiring precision and accuracy
Apache Solr offers a reliable platform for implementing robust, full-text search functionality
Integrating Solr with Python and FastAPI ensures flexibility, scalability, and ease of
development

Implementing full text search with Apache Solr

More Related Content

Similar to Implementing full text search with Apache Solr (20)

More from techprane (17)

Recently uploaded (20)

Implementing full text search with Apache Solr

Editor's Notes