Basics on
Elasticsearch
Ruby Shrestha
Overview Session
Elasticsearch: An Introduction
 Written in Java, open source, based on Apache Lucene
 https://github.com/elastic/elasticsearch
 Document storage
 Format: JSON
 Full-text search engine
 Full-text search?
 Every doc, every word
 Search large dataset in few seconds
 How?
 Via Inverted Index, Distributed Nature
 Analytics Platform
 Aggregations and analysis
Use Cases Where ES
Overshadows DB
 Full-text search is more efcient in ES
due to fexible indexing.
 Relevance based searching
Use Cases Where ES
Overshadows DB
 Searching when entered spelling is
wrong
 Synonym based search
 Phonetic based search
 Use of distributed architecture
 Works well with unstructured data
How does Elasticsearch Work?
 Data stored as document
 Format: JSON
How does Elasticsearch Work?
 Querying Document
 Via JSON Based REST API
HTTP Request Method (Get, Put, Post, Delete)
REST Client
(e.g:
Insomnia)
REST
API
Elasticsearch
JSON
Request
JSON
Response
JSON
Response
JSON
Request
All in All
 Easy to get started with
 Complex technology if its full potential is
to be used
 By far, the hottest search engine in
market used by a huge community
Used by a huge
community
Elastic Stack
When Not To Use ES: Use
Cases
 Data Storage
 No/Rare/Simple Analysis
 Analysis on single value text-felds
(usernames, zip-codes), value lookups
 Huge computations (extensive
preprocessing and transformations)
Conceptual Details
Types of Scaling
Vertical Scaling Horizontal Scaling
Scaling Up Scaling Out
Increasing size of a machine Having multiple machines
Has limits Real power of distributed system
comes from here
Architecture of Elasticsearch
 Cluster
Architecture of Elasticsearch
 Nodes
 Can carry out indexing and searching
 Every node is aware of each other
 Every node can forward request to any other node in the cluster.
 Every node can accept HTTP request from REST clients.
 Every node as its own unique name (UUID).
 First seven characters used as node id. Persists even after restart.
 Node is considered as running instance of Elasticsearch
 Categories of Dedicated Nodes:
 Master Node
 Data Node
 Ingest Node
 Coordinating Node
 By default, a node is master eligible, data and ingest node
Architecture of Elasticsearch
 Indices and Types
Parallel concepts between Databases and Elasticsearch
Change in latest ES version : 6.5
Database Table Index
Table Type
Index name, type name and
feld name rules
 Lowercase only
 Cannot include  , / , * , ? , " , < , > , | ,
space (the character, not the word), , , #
 Indices prior to 7.0 could contain a colon
( : ), but that's been deprecated and won't
be supported in 7.0+
 Cannot start with - , _ , +
 Cannot be . or ..
 Cannot be longer than 255 characters.
Sharding
 Size of single index exceeds physical
capacity of available nodes
 Example:
 Each Node: 512 MB
 Size of Index: 1 TB
 Sharding comes to the rescue during
such cases of bottleneck.
Sharding
 Advantages:
 Enables adjusting with growing amount of data
 Better throughput in cases where shards are distributed to multiple nodes
 Parallel execution of queries across nodes possible
Replication
 What if a node fails?
 Is there any fault tolerance mechanism in ES?
 YES, via Replication
 Replication means duplicating available shards
 For high availability/ fault tolerance
 For better throughput (provided hardware is available)
 Shard that is replicated-> Primary Shard
 Replicated version of shared->Replica Shard
 Replication Group= Primary shard + Its Replicas
Defaults
 Cluster Name: elasticsearch
 Number of shards per index: 5
 Number of replicas: 1 for each shard
Keeping Replicas in Sync
Complete Architecture
Characteristics of ES
 Near-real Time Searching
 Indexing
 Distributed Nature
 Multi-Tenancy
Indexing in Elastisearch
{
"statement": "Winter is coming"
}
{
"statement": “Ours is the fury"
}
{
"statement": “The choice is yours"
}
Let’s get started practically!
Monitoring Cluster Health
 localhost:9200/_cluster/health
Statu
s
Reason
Gree
n
All the shards are properly
assigned/allocated to
nodes.
Yello
w
Some/All of the shard’s
replicas are unassigned.
Red Specifc primary shard is
unassigned/unallocated.
In Shard Level:
Index Health: Worst Shard Status
Cluster Health: Worst Index Status
Cluster State
 localhost:9200/_cluster/state
Document Management
 Simple Index Creation
 PUT /<index-name>
 Similar to creation of table in database (if
we are to consider from ES V_6.X)
 Creating Index with Setting
 { "settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 2
} }
File Directory Structure
 The frst time you install ES and run it,
you are running an instance of ES, i.e., a
node.
 data
 Elasticsearch
 Nodes
 0
 _state
 global-<version>.st (contains node/cluster settings)
 node.lock (so that only one ES instance writes to
the directory at a time)
Index Creation Leads To
 Inside node, a new indices folder
appear.
 indices
 <index-name>/<uuid> (you can fnd this
uuid inside localhost:9200/_cluster/state
-> metadata key->indices key
 0 … 5 (shards, default number)
 _state
 state-<version>.st (certain index’s
metadata/setting)
Document Management
 Creating/Indexing/Inserting a new document
 PUT /<index-name>/_doc/1
{“name”:”Basics of Elastic Stack”,
“course”:”Searching and Analytics”
“price”:500}
 POST /<index-name>/_doc
{
"name": "Umagi",
"course": "Fiction",
"price": 2000
}
What actually happens when we create a
new document?
In-Memory Indexing
Bufer
Transaction Log
File System Cache
Disk
• Refresh Rate (Default 1 sec)
{“settings”:{“refresh-interval”:”30s”}}
• File System Cache: Segment Creation
• Disk: Segments fushed into commit point

Elasticsearch: An Overview

  • 1.
  • 2.
  • 3.
    Elasticsearch: An Introduction Written in Java, open source, based on Apache Lucene  https://github.com/elastic/elasticsearch  Document storage  Format: JSON  Full-text search engine  Full-text search?  Every doc, every word  Search large dataset in few seconds  How?  Via Inverted Index, Distributed Nature  Analytics Platform  Aggregations and analysis
  • 4.
    Use Cases WhereES Overshadows DB  Full-text search is more efcient in ES due to fexible indexing.  Relevance based searching
  • 5.
    Use Cases WhereES Overshadows DB  Searching when entered spelling is wrong  Synonym based search  Phonetic based search  Use of distributed architecture  Works well with unstructured data
  • 6.
    How does ElasticsearchWork?  Data stored as document  Format: JSON
  • 7.
    How does ElasticsearchWork?  Querying Document  Via JSON Based REST API HTTP Request Method (Get, Put, Post, Delete) REST Client (e.g: Insomnia) REST API Elasticsearch JSON Request JSON Response JSON Response JSON Request
  • 8.
    All in All Easy to get started with  Complex technology if its full potential is to be used  By far, the hottest search engine in market used by a huge community
  • 9.
    Used by ahuge community
  • 10.
  • 11.
    When Not ToUse ES: Use Cases  Data Storage  No/Rare/Simple Analysis  Analysis on single value text-felds (usernames, zip-codes), value lookups  Huge computations (extensive preprocessing and transformations)
  • 12.
  • 13.
    Types of Scaling VerticalScaling Horizontal Scaling Scaling Up Scaling Out Increasing size of a machine Having multiple machines Has limits Real power of distributed system comes from here
  • 14.
  • 15.
    Architecture of Elasticsearch Nodes  Can carry out indexing and searching  Every node is aware of each other  Every node can forward request to any other node in the cluster.  Every node can accept HTTP request from REST clients.  Every node as its own unique name (UUID).  First seven characters used as node id. Persists even after restart.  Node is considered as running instance of Elasticsearch  Categories of Dedicated Nodes:  Master Node  Data Node  Ingest Node  Coordinating Node  By default, a node is master eligible, data and ingest node
  • 16.
    Architecture of Elasticsearch Indices and Types Parallel concepts between Databases and Elasticsearch Change in latest ES version : 6.5 Database Table Index Table Type
  • 17.
    Index name, type name and feld namerules  Lowercase only  Cannot include , / , * , ? , " , < , > , | , space (the character, not the word), , , #  Indices prior to 7.0 could contain a colon ( : ), but that's been deprecated and won't be supported in 7.0+  Cannot start with - , _ , +  Cannot be . or ..  Cannot be longer than 255 characters.
  • 18.
    Sharding  Size ofsingle index exceeds physical capacity of available nodes  Example:  Each Node: 512 MB  Size of Index: 1 TB  Sharding comes to the rescue during such cases of bottleneck.
  • 19.
    Sharding  Advantages:  Enablesadjusting with growing amount of data  Better throughput in cases where shards are distributed to multiple nodes  Parallel execution of queries across nodes possible
  • 20.
    Replication  What ifa node fails?  Is there any fault tolerance mechanism in ES?  YES, via Replication  Replication means duplicating available shards  For high availability/ fault tolerance  For better throughput (provided hardware is available)  Shard that is replicated-> Primary Shard  Replicated version of shared->Replica Shard  Replication Group= Primary shard + Its Replicas
  • 21.
    Defaults  Cluster Name:elasticsearch  Number of shards per index: 5  Number of replicas: 1 for each shard
  • 22.
  • 23.
  • 24.
    Characteristics of ES Near-real Time Searching  Indexing  Distributed Nature  Multi-Tenancy
  • 25.
    Indexing in Elastisearch { "statement":"Winter is coming" } { "statement": “Ours is the fury" } { "statement": “The choice is yours" }
  • 26.
    Let’s get startedpractically!
  • 27.
    Monitoring Cluster Health localhost:9200/_cluster/health Statu s Reason Gree n All the shards are properly assigned/allocated to nodes. Yello w Some/All of the shard’s replicas are unassigned. Red Specifc primary shard is unassigned/unallocated. In Shard Level: Index Health: Worst Shard Status Cluster Health: Worst Index Status
  • 28.
  • 29.
    Document Management  SimpleIndex Creation  PUT /<index-name>  Similar to creation of table in database (if we are to consider from ES V_6.X)  Creating Index with Setting  { "settings" : { "number_of_shards" : 3, "number_of_replicas" : 2 } }
  • 30.
    File Directory Structure The frst time you install ES and run it, you are running an instance of ES, i.e., a node.  data  Elasticsearch  Nodes  0  _state  global-<version>.st (contains node/cluster settings)  node.lock (so that only one ES instance writes to the directory at a time)
  • 31.
    Index Creation LeadsTo  Inside node, a new indices folder appear.  indices  <index-name>/<uuid> (you can fnd this uuid inside localhost:9200/_cluster/state -> metadata key->indices key  0 … 5 (shards, default number)  _state  state-<version>.st (certain index’s metadata/setting)
  • 32.
    Document Management  Creating/Indexing/Insertinga new document  PUT /<index-name>/_doc/1 {“name”:”Basics of Elastic Stack”, “course”:”Searching and Analytics” “price”:500}  POST /<index-name>/_doc { "name": "Umagi", "course": "Fiction", "price": 2000 }
  • 33.
    What actually happenswhen we create a new document? In-Memory Indexing Bufer Transaction Log File System Cache Disk • Refresh Rate (Default 1 sec) {“settings”:{“refresh-interval”:”30s”}} • File System Cache: Segment Creation • Disk: Segments fushed into commit point