SlideShare a Scribd company logo
RUNNING & SCALING
LARGE ELASTICSEARCH
CLUSTERS
FRED DE VILLAMIL, DIRECTOR OF
INFRASTRUCTURE
@FDEVILLAMIL
OCTOBER 2017
BACKGROUND
• FRED DE VILLAMIL, 39 ANS, TEAM
COFFEE @SYNTHESIO,
• LINUX / (FREE)BSD USER SINCE 1996,
• OPEN SOURCE CONTRIBUTOR SINCE
1998,
• LOVES TENNIS, PHOTOGRAPHY, CUTE
OTTERS, INAPPROPRIATE HUMOR AND
ELASTICSEARCH CLUSTERS OF UNUSUAL
SIZE.
WRITES ABOUT ES AT
HTTPS://THOUGHTS.T37.NET
ELASTICSEARCH @SYNTHESIO
• 8 production clusters, 600 hosts, 1.7PB storage, 37.5TB
RAM, average 15k writes / s, 800 search /s, some inputs >
200MB.
• Data nodes: 6 core Xeon E5v3, 64GB RAM, 4*800GB SSD
RAID0. Sometimes bi Xeon E5-2687Wv4 12 core (160
watts!!!).
• We agregate data from various cold storage and make them
searchable in a giffy.
AN ELASTICSEARCH CLUSTER OF UNUSUAL
SIZE
ENSURING HIGH
AVAILABILITY
NEVER GONNA GIVE YOU UP
• NEVER GONNA LET YOU DOWN,
• NEVER GONNA RUN AROUND AND
DESERT YOU,
• NEVER GONNA MAKE YOU CRY,
• NEVER GONNA SAY GOODBYE,
• NEVER GONNA TELL A LIE & HURT YOU.
AVOIDING DOWNTIME & SPLIT BRAINS
• RUN AT LEAST 3 MASTER NODES INTO 3 DIFFERENT
LOCATIONS.
• NEVER RUN BULK QUERIES ON THE MASTER
NODES.
• ACTUALLY NEVER RUN ANYTHING BUT
ADMINISTRATIVE TASKS ON THE MASTER NODES.
• SPREAD YOUR DATA INTO 2 DIFFERENT LOCATION
WITH AT LEAST A REPLICATION FACTOR OF 1 (1
PRIMARY, 1 REPLICA).
RACK AWARENESS
ALLOCATE A
RACK_ID TO THE
DATA NODES FOR
EVEN REPLICATION.
RESTART A WHOLE
DATA CENTER
@ONCE WITHOUT
DOWNTIME.
RACK AWARENESS + QUERY NODES ==
MAGIC
ES PRIVILEGES THE
DATA NODES WITH
THE SAME RACK_ID
AS THE QUERY.
REDUCES LATENCY
AND BALANCES THE
LOAD.
RACK AWARENESS + QUERY NODES + ZONE ==
MAGIC + FUN
ADD ZONES INTO THE
SAME RACK FOR
EVEN REPLICATION
WITH HIGHER
FACTOR.
USING ZONES FOR FUN & PROFIT
ALLOWING EVEN REPLICATION
WITH A HIGHER FACTOR WITHIN
THE SAME RACK.
ALLOWING MORE RESOURCES
TO THE MOST FREQUENTLY
ACCESSED INDEXES.
…
AVOIDING MEMORY
NIGHTMARE
HOW ELASTICSEARCH USES THE MEMORY
• Starts with allocating memory for Java heap.
• The Java heap contains all Elasticsearch buffers
and caches + a few other things.
• Each Java thread maps a system thread: +128kB
off heap.
• Elected master uses 250kB to store each shard
information inside the cluster.
ALLOCATING MEMORY
• Never allocate more than 31GB
heap to avoid the compressed
pointers issue.
• Use 1/2 of your memory up to
31GB.
• Feed your master and query
nodes, the more the better
(including CPU).
MEMORY LOCK
• Use memory_lock: true at
startup.
• Requires ulimit -l
unlimited.
• Allocates the whole heap at once.
• Uses contiguous memory regions.
• Avoids swapping (you should
disable swap anyway).
CHOSING THE RIGHT GARBAGE COLLECTOR
• ES runs with CMS as a default
garbage collector.
• CMS was designed for heaps <
4GB.
• Stop the world garbage
collection last too long & blocks
the cluster.
• Solution: switching to G1GC
(default in Java9, unsupported).
CMS VS G1GC
• CMS: SHARED CPU TIME WITH THE APPLICATION.
“STOPS THE WORLD” WHEN TOO MANY MEMORY TO
CLEAN UNTIL IT SENDS AN OUTOFMEMORYERROR.
• G1GC: SHORT, MORE FREQUENT, PAUSES. WON’T
STOP A NODE UNTIL IT LEAVES THE CLUSTER.
• ELASTIC SAYS: DON’T USE G1GC FOR REASONS,
SO READ THE DOC.
G1GC OPTIONS
+USEG1GC: ACTIVATES G1GC
MAXGCPAUSEMILLIS: TARGET FOR MAX GC PAUSE
TIME.
GCPAUSEINTERVALMILLIS:TARGET FOR COLLECTION
TIME SPACE
INITIATINGHEAPOCCUPANCYPERCENT: WHEN TO START
COLLECTING?
CHO0SING THE RIGHT STORAGE
• MMAPFS : MAPS LUCENE FILES ON
THE VIRTUAL MEMORY USING
MMAP. NEEDS AS MUCH MEMORY
AS THE FILE BEING MAPPED TO
AVOID ISSUES.
• NIOFS : APPLIES A SHARED LOCK
ON LUCENE FILES AND RELIES
ON VFS CACHE.
BUFFERS AND CACHES
• ELASTICSEARCH HAS MULTIPLE CACHES & BUFFERS, WITH DEFAULT VALUES,
KNOW THEM!
• BUFFERS + CACHE MUST BE < TOTAL JAVA HEAP (OBVIOUS BUT…).
• AUTOMATED EVICTION ON THE CACHE, BUT FORCING IT CAN SAVE YOUR LIFE
WITH A SMALL OVERHEAD.
• IF YOU HAVE OOM ISSUES, DISABLE THE CACHES!
• FROM A USER POV, CIRCUIT BREAKERS ARE A NO GO!
MANAGING LARGE INDEXES
INDEX DESIGN
• VERSION YOUR INDEX BY MAPPING: 1_*, 2_* ETC.
• THE MORE SHARDS, THE BETTER ELASTICITY, BUT
THE MORE CPU AND MEMORY USED ON THE
MASTERS.
• PROVISIONNING 10GB PER SHARDS ALLOWS A
FASTER RECOVERY & REALLOCATION.
REPLICATION TRICKS
• NUMBER OF REPLICAS MUST BE 0 OR ODD.
CONSISTENCY QUORUM: INT( (PRIMARY +
NUMBER_OF_REPLICAS) / 2 ) + 1.
• RAISE THE REPLICATION FACTOR TO SCALE FOR
READING
UP TO 100% OF THE DATA / DATA NODE.
ALIASES
• ACCESS MULTIPLE INDICES AT ONCE.
• READ MULTIPLE, WRITE ONLY ONE.
EXAMPLE ON TIMESTAMPED INDICES:
"18_20171020": { "aliases": { "2017": {}, "201710": {}, "20171020": {} } }
"18_20171021": { "aliases": { "2017": {}, "201710": {}, "20171021": {} } }
Queries:
/2017/_search
/201710/_search
AFTER A MAPPING CHANGE & REINDEX, CHANGE THE ALIAS:
ROLLOVER
• CREATE A NEW INDEX WHEN TOO OLD OR
TOO BIG.
• SUPPORT DATE MATH: DAILY INDEX
CREATION.
• USE ALIASES TO QUERY ALL
ROLLOVERED INDEXES.
PUT "logs-000001" { "aliases": { "logs": {} } }
POST /logs/_rollover { "conditions": { "max_docs": 10000000 } }
DAILY OPERATIONS
CONFIGURATION CHANGES
• PREFER CONFIGURATION FILE UPDATES TO API CALL FOR
PERMANENT CHANGES.
• VERSION YOUR CONFIGURATION CHANGES SO YOU CAN
ROLLBACK, ES REQUIRES LOTS OF FINE TUNING.
• WHEN USING _SETTINGS API, PREFER TRANSIENT TO
PERSISTENT, THEY’RE EASIER TO GET RID OF.
RECONFIGURING THE WHOLE CLUSTER
LOCK SHARD REALLOCATION & RECOVERY:
"cluster.routing.allocation.enable" : "none"
OPTIMIZE FOR RECOVERY:
"cluster.routing.allocation.node_initial_primaries_recoveries": 50
"indices.recovery.max_bytes_per_sec": "2048mb"
RESTART A FULL RACK, WAIT FOR NODES TO COME
THE REINDEX API
• IN CLUSTER AND CLUSTER TO CLUSTER REINDEX API.
• ALLOWS CROSS VERSION INDEXING: 1.7 TO 5.1…
• SLICED SCROLLS ONLY AVAILABLE STARTING 6.0.
• ACCEPT ES QUERIES TO FILTER THE DATA TO REINDEX.
• MERGE MULTIPLE INDEXES INTO 1.
BULK INDEXING TRICKS
LIMIT REBALANCE:
"cluster.routing.allocation.cluster_concurrent_rebalance": 1
"cluster.routing.allocation.balance.shard": "0.15f"
"cluster.routing.allocation.balance.threshold": "10.0f"
DISABLE REFRESH:
"index.refresh_interval:" "0"
NO REPLICA:
"index.number_of_replicas:" "0" // having replica index n times in Lucene, adding one just "rsync" the data.
ALLOCATE ON DEDICATED HARDWARE:
OPTIMIZING FOR SPACE & PERFORMANCES
• LUCENE SEGMENTS ARE IMMUTABLE, THE MORE YOU
WRITE, THE MORE SEGMENTS YOU GET.
• DELETING DOCUMENTS DOES COPY ON WRITE SO NO
REAL DELETE.
index.merge.scheduler.max_thread_count: default CPU/2 with min 4
POST /_force_merge?only_expunge_deletes: faster, only merge segments with deleted
POST /_force_merge?max_num_segments: don’t use on indexes you write on!
WARNING: _FORCE_MERGE HAS A COST IN CPU AND I/OS.
MINOR VERSION UPGRADES
• CHECK YOUR PLUGINS COMPATIBILITY, PLUGINS
MUST BE COMPILED FOR YOUR MINOR VERSION.
• START UPGRADING THE MASTER NODES.
• UPGRADE THE DATA NODES ON A WHOLE RACK AT
ONCE.
OS LEVEL UPGRADES
• ENSURE THE WHOLE CLUSTER RUNS THE SAME JAVA
VERSION.
• WHEN UPGRADING JAVA, CHECK IF YOU DON’T HAVE
TO UPGRADE THE KERNEL.
• PER NODE JAVA / KERNEL VERSION AVAILABLE IN THE
_STATS API.
MONITORING
CAPTAIN OBVIOUS, YOU’RE MY ONLY HOPE!
• GOOD MONITORING IS BUSINESS ORIENTED MONITORING.
• GOOD ALERTING IS ACTIONABLE ALERTING.
• DON’T MONITORE THE CLUSTER ONLY, BUT THE WHOLE PROCESSING
CHAIN.
• USELESS METRICS ARE USELESS.
• LOSING A DATACENTER: OK. LOSING DATA: NOT OK!
MONITORING TOOLING
• ELASTICSEARCH X-
PACK,
• GRAFANA…
LIFE, DEATH & _CLUSTER/HEALTH
• A RED CLUSTER MEANS AT LEAST 1
INDEX HAS MISSING DATA. DON’T
PANIC!
• USING LEVEL={INDEX,SHARD} AND AN
INDEX ID PROVIDES SPECIFIC
INFORMATION.
• LOTS OF PENDING TASKS MEANS YOUR
CLUSTER IS UNDER HEAVY LOAD AND
SOME NODES CAN’T PROCESS THEM
FAST ENOUGH.
• LONG WAITING TASKS MEANS YOU
HAVE A CAPACITY PLANNING PROBLEM.
USE THE _CAT API
• PROVIDES GENERAL INFORMATION
ABOUT YOUR NODES, SHARDS,
INDICES AND THREAD POOLS.
• HIT THE WHOLE CLUSTER, WHEN
IT TIMEOUTS YOU’VE PROBABLY
HAVING A NODE STUCK IN
GARBAGE COLLECTION.
MONITORING AT THE CLUSTER LEVEL
• USE THE _STATS API FOR PRECISE INFORMATION.
• MONITORE THE SHARDS REALLOCATION, TOO MANY
MEANS A DESIGN PROBLEM.
• MONITORE THE WRITES AND CLUSTER WIDE, IF THEY
FALL TO 0 AND IT’S UNUSUAL, A NODE IS STUCK IN
GC.
MONITORING AT THE NODE LEVEL
• USE THE _NODES/{NODE}, _NODES/{NODE}/STATS
AND _CAT/THREAD_POOL API.
• THE GARBAGE COLLECTION DURATION &
FREQUENCY IS A GOOD METRIC OF YOUR NODE
HEALTH.
• CACHE AND BUFFERS ARE MONITORED ON A NODE
LEVEL.
• MONITORING I/OS, SPACE, OPEN FILES & CPU IS
CRITICAL.
MONITORING AT THE INDEX LEVEL
• USE THE {INDEX}/_STATS API.
• MONITORE THE DOCUMENTS / SHARD RATIO.
• MONITORE THE MERGES, QUERY TIME.
• TOO MANY EVICTIONS MEANS YOU HAVE A CACHE
CONFIGURATION LEVEL.
TROUBLESHOOTING
WHAT’S REALLY GOING ON IN YOUR CLUSTER?
• THE _NODES/{NODE}/HOT_THREADS API TELLS WHAT HAPPENS ON
THE HOST.
• THE ELECTED MASTER NODES TELLS YOU MOST THING YOU NEED
TO KNOW.
• ENABLE THE SLOW LOGS TO UNDERSTAND YOUR BOTTLENECK &
OPTIMIZE THE QUERIES. DISABLE THE SLOW LOGS WHEN YOU’RE
DONE!!!
• WHEN NOT ENOUGH, MEMORY PROFILING IS YOUR FRIEND.
MEMORY PROFILING
• LIVE MEMORY OR HPROF FILE AFTER A CRASH.
• ALLOWS YOU TO TO KNOW WHAT IS / WAS IN YOUR
BUFFERS AND CACHES.
• YOURKIT JAVA PROFILER AS A TOOL.
TRACING
• KNOW WHAT’S REALLY HAPPENING IN YOUR JVM.
• LINUX 4.X PROVIDES GREAT PERF TOOLS, LINUX 4.9 EVEN
BETTER:
• LINUX-PERF,
• JAVA PERF MAP.
• VECTOR BY NETFLIX (NOT VEKTOR THE TRASH METAL
BAND).
QUESTIONS
?
@FDEVILLAMI
L
@SYNTHESIO
SLIDES: HTTP://BIT.DO/ELASTICSEARCH-SYSADMIN-201

More Related Content

What's hot (20)

PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
An introduction to Elasticsearch's advanced relevance ranking toolbox
Elasticsearch
 
PPTX
Elasticsearch
Divij Sehgal
 
PDF
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
NTT DATA Technology & Innovation
 
PPTX
RHEL on Azure、初めの一歩
Ryo Fujita
 
PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
PDF
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
PPTX
Discovering the 2 in Alfresco Search Services 2.0
Angel Borroy López
 
PDF
Oracle Cloud Infrastructure:2022年8月度サービス・アップデート
オラクルエンジニア通信
 
PPTX
Elastic search overview
ABC Talks
 
PPTX
An Introduction to Accumulo
Donald Miner
 
PDF
Data Lake - Multitenancy Best Practices
CitiusTech
 
PPTX
ASTERIA WARP開発前に知っておくべき10の鉄則(AUG関西支部編)
ASTERIA User Group
 
PPTX
Migrating on premises workload to azure sql database
PARIKSHIT SAVJANI
 
PDF
Getting Started with Databricks SQL Analytics
Databricks
 
PDF
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
PPTX
Oracle to Postgres Schema Migration Hustle
EDB
 
PDF
オープンソースで提供される第二のJVM:OpenJ9 VMとIBM Javaについて
Takakiyo Tanaka
 
PPTX
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
PPTX
ElasticSearch Basic Introduction
Mayur Rathod
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
Elasticsearch
 
Elasticsearch
Divij Sehgal
 
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
NTT DATA Technology & Innovation
 
RHEL on Azure、初めの一歩
Ryo Fujita
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
Discovering the 2 in Alfresco Search Services 2.0
Angel Borroy López
 
Oracle Cloud Infrastructure:2022年8月度サービス・アップデート
オラクルエンジニア通信
 
Elastic search overview
ABC Talks
 
An Introduction to Accumulo
Donald Miner
 
Data Lake - Multitenancy Best Practices
CitiusTech
 
ASTERIA WARP開発前に知っておくべき10の鉄則(AUG関西支部編)
ASTERIA User Group
 
Migrating on premises workload to azure sql database
PARIKSHIT SAVJANI
 
Getting Started with Databricks SQL Analytics
Databricks
 
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
Oracle to Postgres Schema Migration Hustle
EDB
 
オープンソースで提供される第二のJVM:OpenJ9 VMとIBM Javaについて
Takakiyo Tanaka
 
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
ElasticSearch Basic Introduction
Mayur Rathod
 

Similar to Running & Scaling Large Elasticsearch Clusters (20)

PPTX
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
PDF
CASSANDRA MEETUP - Choosing the right cloud instances for success
Erick Ramirez
 
PPTX
M6d cassandrapresentation
Edward Capriolo
 
PPTX
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
Gaurav "GP" Pal
 
PPTX
stackArmor presentation for DevOpsDC ver 4
Gaurav "GP" Pal
 
PPTX
HBase Operations and Best Practices
Venu Anuganti
 
PDF
Scaling Elasticsearch at Synthesio
Fred de Villamil
 
PPTX
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 
PDF
Scaling with sync_replication using Galera and EC2
Marco Tusa
 
PPTX
Benchmarking Solr Performance at Scale
thelabdude
 
PPS
Storm presentation
Shyam Raj
 
PDF
HBase Sizing Guide
larsgeorge
 
PDF
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
PDF
Is your Elastic Cluster Stable and Production Ready?
DoiT International
 
PPTX
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
DataStax
 
PPTX
Everyday I’m scaling... Cassandra
Instaclustr
 
PDF
Hadoop at datasift
Jairam Chandar
 
PDF
Scaling Ceph at CERN - Ceph Day Frankfurt
Ceph Community
 
PDF
Running MySQL on Linux
Great Wide Open
 
PDF
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
 
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
CASSANDRA MEETUP - Choosing the right cloud instances for success
Erick Ramirez
 
M6d cassandrapresentation
Edward Capriolo
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
Gaurav "GP" Pal
 
stackArmor presentation for DevOpsDC ver 4
Gaurav "GP" Pal
 
HBase Operations and Best Practices
Venu Anuganti
 
Scaling Elasticsearch at Synthesio
Fred de Villamil
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 
Scaling with sync_replication using Galera and EC2
Marco Tusa
 
Benchmarking Solr Performance at Scale
thelabdude
 
Storm presentation
Shyam Raj
 
HBase Sizing Guide
larsgeorge
 
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
Is your Elastic Cluster Stable and Production Ready?
DoiT International
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
DataStax
 
Everyday I’m scaling... Cassandra
Instaclustr
 
Hadoop at datasift
Jairam Chandar
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Ceph Community
 
Running MySQL on Linux
Great Wide Open
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
 
Ad

More from Fred de Villamil (9)

PPTX
Scaling your Engineering Team
Fred de Villamil
 
PDF
Hiring and Managing Happy Engineers - CTO Pizza #3
Fred de Villamil
 
PDF
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Fred de Villamil
 
PDF
Devops commando - Paris Devops 2016-04
Fred de Villamil
 
PDF
The Commando Devops
Fred de Villamil
 
PDF
How People Use Iphone
Fred de Villamil
 
PDF
Zendcon Performance Oci8
Fred de Villamil
 
PDF
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Fred de Villamil
 
PDF
Presentation Rails
Fred de Villamil
 
Scaling your Engineering Team
Fred de Villamil
 
Hiring and Managing Happy Engineers - CTO Pizza #3
Fred de Villamil
 
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Fred de Villamil
 
Devops commando - Paris Devops 2016-04
Fred de Villamil
 
The Commando Devops
Fred de Villamil
 
How People Use Iphone
Fred de Villamil
 
Zendcon Performance Oci8
Fred de Villamil
 
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Fred de Villamil
 
Presentation Rails
Fred de Villamil
 
Ad

Recently uploaded (20)

PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PPTX
GitOps_Without_K8s_Training simple one without k8s
DanialHabibi2
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
MRRS Strength and Durability of Concrete
CivilMythili
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Thermal runway and thermal stability.pptx
godow93766
 
Design Thinking basics for Engineers.pdf
CMR University
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
Hashing Introduction , hash functions and techniques
sailajam21
 
GitOps_Without_K8s_Training simple one without k8s
DanialHabibi2
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
Day2 B2 Best.pptx
helenjenefa1
 

Running & Scaling Large Elasticsearch Clusters

  • 1. RUNNING & SCALING LARGE ELASTICSEARCH CLUSTERS FRED DE VILLAMIL, DIRECTOR OF INFRASTRUCTURE @FDEVILLAMIL OCTOBER 2017
  • 2. BACKGROUND • FRED DE VILLAMIL, 39 ANS, TEAM COFFEE @SYNTHESIO, • LINUX / (FREE)BSD USER SINCE 1996, • OPEN SOURCE CONTRIBUTOR SINCE 1998, • LOVES TENNIS, PHOTOGRAPHY, CUTE OTTERS, INAPPROPRIATE HUMOR AND ELASTICSEARCH CLUSTERS OF UNUSUAL SIZE. WRITES ABOUT ES AT HTTPS://THOUGHTS.T37.NET
  • 3. ELASTICSEARCH @SYNTHESIO • 8 production clusters, 600 hosts, 1.7PB storage, 37.5TB RAM, average 15k writes / s, 800 search /s, some inputs > 200MB. • Data nodes: 6 core Xeon E5v3, 64GB RAM, 4*800GB SSD RAID0. Sometimes bi Xeon E5-2687Wv4 12 core (160 watts!!!). • We agregate data from various cold storage and make them searchable in a giffy.
  • 4. AN ELASTICSEARCH CLUSTER OF UNUSUAL SIZE
  • 6. NEVER GONNA GIVE YOU UP • NEVER GONNA LET YOU DOWN, • NEVER GONNA RUN AROUND AND DESERT YOU, • NEVER GONNA MAKE YOU CRY, • NEVER GONNA SAY GOODBYE, • NEVER GONNA TELL A LIE & HURT YOU.
  • 7. AVOIDING DOWNTIME & SPLIT BRAINS • RUN AT LEAST 3 MASTER NODES INTO 3 DIFFERENT LOCATIONS. • NEVER RUN BULK QUERIES ON THE MASTER NODES. • ACTUALLY NEVER RUN ANYTHING BUT ADMINISTRATIVE TASKS ON THE MASTER NODES. • SPREAD YOUR DATA INTO 2 DIFFERENT LOCATION WITH AT LEAST A REPLICATION FACTOR OF 1 (1 PRIMARY, 1 REPLICA).
  • 8. RACK AWARENESS ALLOCATE A RACK_ID TO THE DATA NODES FOR EVEN REPLICATION. RESTART A WHOLE DATA CENTER @ONCE WITHOUT DOWNTIME.
  • 9. RACK AWARENESS + QUERY NODES == MAGIC ES PRIVILEGES THE DATA NODES WITH THE SAME RACK_ID AS THE QUERY. REDUCES LATENCY AND BALANCES THE LOAD.
  • 10. RACK AWARENESS + QUERY NODES + ZONE == MAGIC + FUN ADD ZONES INTO THE SAME RACK FOR EVEN REPLICATION WITH HIGHER FACTOR.
  • 11. USING ZONES FOR FUN & PROFIT ALLOWING EVEN REPLICATION WITH A HIGHER FACTOR WITHIN THE SAME RACK. ALLOWING MORE RESOURCES TO THE MOST FREQUENTLY ACCESSED INDEXES. …
  • 13. HOW ELASTICSEARCH USES THE MEMORY • Starts with allocating memory for Java heap. • The Java heap contains all Elasticsearch buffers and caches + a few other things. • Each Java thread maps a system thread: +128kB off heap. • Elected master uses 250kB to store each shard information inside the cluster.
  • 14. ALLOCATING MEMORY • Never allocate more than 31GB heap to avoid the compressed pointers issue. • Use 1/2 of your memory up to 31GB. • Feed your master and query nodes, the more the better (including CPU).
  • 15. MEMORY LOCK • Use memory_lock: true at startup. • Requires ulimit -l unlimited. • Allocates the whole heap at once. • Uses contiguous memory regions. • Avoids swapping (you should disable swap anyway).
  • 16. CHOSING THE RIGHT GARBAGE COLLECTOR • ES runs with CMS as a default garbage collector. • CMS was designed for heaps < 4GB. • Stop the world garbage collection last too long & blocks the cluster. • Solution: switching to G1GC (default in Java9, unsupported).
  • 17. CMS VS G1GC • CMS: SHARED CPU TIME WITH THE APPLICATION. “STOPS THE WORLD” WHEN TOO MANY MEMORY TO CLEAN UNTIL IT SENDS AN OUTOFMEMORYERROR. • G1GC: SHORT, MORE FREQUENT, PAUSES. WON’T STOP A NODE UNTIL IT LEAVES THE CLUSTER. • ELASTIC SAYS: DON’T USE G1GC FOR REASONS, SO READ THE DOC.
  • 18. G1GC OPTIONS +USEG1GC: ACTIVATES G1GC MAXGCPAUSEMILLIS: TARGET FOR MAX GC PAUSE TIME. GCPAUSEINTERVALMILLIS:TARGET FOR COLLECTION TIME SPACE INITIATINGHEAPOCCUPANCYPERCENT: WHEN TO START COLLECTING?
  • 19. CHO0SING THE RIGHT STORAGE • MMAPFS : MAPS LUCENE FILES ON THE VIRTUAL MEMORY USING MMAP. NEEDS AS MUCH MEMORY AS THE FILE BEING MAPPED TO AVOID ISSUES. • NIOFS : APPLIES A SHARED LOCK ON LUCENE FILES AND RELIES ON VFS CACHE.
  • 20. BUFFERS AND CACHES • ELASTICSEARCH HAS MULTIPLE CACHES & BUFFERS, WITH DEFAULT VALUES, KNOW THEM! • BUFFERS + CACHE MUST BE < TOTAL JAVA HEAP (OBVIOUS BUT…). • AUTOMATED EVICTION ON THE CACHE, BUT FORCING IT CAN SAVE YOUR LIFE WITH A SMALL OVERHEAD. • IF YOU HAVE OOM ISSUES, DISABLE THE CACHES! • FROM A USER POV, CIRCUIT BREAKERS ARE A NO GO!
  • 22. INDEX DESIGN • VERSION YOUR INDEX BY MAPPING: 1_*, 2_* ETC. • THE MORE SHARDS, THE BETTER ELASTICITY, BUT THE MORE CPU AND MEMORY USED ON THE MASTERS. • PROVISIONNING 10GB PER SHARDS ALLOWS A FASTER RECOVERY & REALLOCATION.
  • 23. REPLICATION TRICKS • NUMBER OF REPLICAS MUST BE 0 OR ODD. CONSISTENCY QUORUM: INT( (PRIMARY + NUMBER_OF_REPLICAS) / 2 ) + 1. • RAISE THE REPLICATION FACTOR TO SCALE FOR READING UP TO 100% OF THE DATA / DATA NODE.
  • 24. ALIASES • ACCESS MULTIPLE INDICES AT ONCE. • READ MULTIPLE, WRITE ONLY ONE. EXAMPLE ON TIMESTAMPED INDICES: "18_20171020": { "aliases": { "2017": {}, "201710": {}, "20171020": {} } } "18_20171021": { "aliases": { "2017": {}, "201710": {}, "20171021": {} } } Queries: /2017/_search /201710/_search AFTER A MAPPING CHANGE & REINDEX, CHANGE THE ALIAS:
  • 25. ROLLOVER • CREATE A NEW INDEX WHEN TOO OLD OR TOO BIG. • SUPPORT DATE MATH: DAILY INDEX CREATION. • USE ALIASES TO QUERY ALL ROLLOVERED INDEXES. PUT "logs-000001" { "aliases": { "logs": {} } } POST /logs/_rollover { "conditions": { "max_docs": 10000000 } }
  • 27. CONFIGURATION CHANGES • PREFER CONFIGURATION FILE UPDATES TO API CALL FOR PERMANENT CHANGES. • VERSION YOUR CONFIGURATION CHANGES SO YOU CAN ROLLBACK, ES REQUIRES LOTS OF FINE TUNING. • WHEN USING _SETTINGS API, PREFER TRANSIENT TO PERSISTENT, THEY’RE EASIER TO GET RID OF.
  • 28. RECONFIGURING THE WHOLE CLUSTER LOCK SHARD REALLOCATION & RECOVERY: "cluster.routing.allocation.enable" : "none" OPTIMIZE FOR RECOVERY: "cluster.routing.allocation.node_initial_primaries_recoveries": 50 "indices.recovery.max_bytes_per_sec": "2048mb" RESTART A FULL RACK, WAIT FOR NODES TO COME
  • 29. THE REINDEX API • IN CLUSTER AND CLUSTER TO CLUSTER REINDEX API. • ALLOWS CROSS VERSION INDEXING: 1.7 TO 5.1… • SLICED SCROLLS ONLY AVAILABLE STARTING 6.0. • ACCEPT ES QUERIES TO FILTER THE DATA TO REINDEX. • MERGE MULTIPLE INDEXES INTO 1.
  • 30. BULK INDEXING TRICKS LIMIT REBALANCE: "cluster.routing.allocation.cluster_concurrent_rebalance": 1 "cluster.routing.allocation.balance.shard": "0.15f" "cluster.routing.allocation.balance.threshold": "10.0f" DISABLE REFRESH: "index.refresh_interval:" "0" NO REPLICA: "index.number_of_replicas:" "0" // having replica index n times in Lucene, adding one just "rsync" the data. ALLOCATE ON DEDICATED HARDWARE:
  • 31. OPTIMIZING FOR SPACE & PERFORMANCES • LUCENE SEGMENTS ARE IMMUTABLE, THE MORE YOU WRITE, THE MORE SEGMENTS YOU GET. • DELETING DOCUMENTS DOES COPY ON WRITE SO NO REAL DELETE. index.merge.scheduler.max_thread_count: default CPU/2 with min 4 POST /_force_merge?only_expunge_deletes: faster, only merge segments with deleted POST /_force_merge?max_num_segments: don’t use on indexes you write on! WARNING: _FORCE_MERGE HAS A COST IN CPU AND I/OS.
  • 32. MINOR VERSION UPGRADES • CHECK YOUR PLUGINS COMPATIBILITY, PLUGINS MUST BE COMPILED FOR YOUR MINOR VERSION. • START UPGRADING THE MASTER NODES. • UPGRADE THE DATA NODES ON A WHOLE RACK AT ONCE.
  • 33. OS LEVEL UPGRADES • ENSURE THE WHOLE CLUSTER RUNS THE SAME JAVA VERSION. • WHEN UPGRADING JAVA, CHECK IF YOU DON’T HAVE TO UPGRADE THE KERNEL. • PER NODE JAVA / KERNEL VERSION AVAILABLE IN THE _STATS API.
  • 35. CAPTAIN OBVIOUS, YOU’RE MY ONLY HOPE! • GOOD MONITORING IS BUSINESS ORIENTED MONITORING. • GOOD ALERTING IS ACTIONABLE ALERTING. • DON’T MONITORE THE CLUSTER ONLY, BUT THE WHOLE PROCESSING CHAIN. • USELESS METRICS ARE USELESS. • LOSING A DATACENTER: OK. LOSING DATA: NOT OK!
  • 36. MONITORING TOOLING • ELASTICSEARCH X- PACK, • GRAFANA…
  • 37. LIFE, DEATH & _CLUSTER/HEALTH • A RED CLUSTER MEANS AT LEAST 1 INDEX HAS MISSING DATA. DON’T PANIC! • USING LEVEL={INDEX,SHARD} AND AN INDEX ID PROVIDES SPECIFIC INFORMATION. • LOTS OF PENDING TASKS MEANS YOUR CLUSTER IS UNDER HEAVY LOAD AND SOME NODES CAN’T PROCESS THEM FAST ENOUGH. • LONG WAITING TASKS MEANS YOU HAVE A CAPACITY PLANNING PROBLEM.
  • 38. USE THE _CAT API • PROVIDES GENERAL INFORMATION ABOUT YOUR NODES, SHARDS, INDICES AND THREAD POOLS. • HIT THE WHOLE CLUSTER, WHEN IT TIMEOUTS YOU’VE PROBABLY HAVING A NODE STUCK IN GARBAGE COLLECTION.
  • 39. MONITORING AT THE CLUSTER LEVEL • USE THE _STATS API FOR PRECISE INFORMATION. • MONITORE THE SHARDS REALLOCATION, TOO MANY MEANS A DESIGN PROBLEM. • MONITORE THE WRITES AND CLUSTER WIDE, IF THEY FALL TO 0 AND IT’S UNUSUAL, A NODE IS STUCK IN GC.
  • 40. MONITORING AT THE NODE LEVEL • USE THE _NODES/{NODE}, _NODES/{NODE}/STATS AND _CAT/THREAD_POOL API. • THE GARBAGE COLLECTION DURATION & FREQUENCY IS A GOOD METRIC OF YOUR NODE HEALTH. • CACHE AND BUFFERS ARE MONITORED ON A NODE LEVEL. • MONITORING I/OS, SPACE, OPEN FILES & CPU IS CRITICAL.
  • 41. MONITORING AT THE INDEX LEVEL • USE THE {INDEX}/_STATS API. • MONITORE THE DOCUMENTS / SHARD RATIO. • MONITORE THE MERGES, QUERY TIME. • TOO MANY EVICTIONS MEANS YOU HAVE A CACHE CONFIGURATION LEVEL.
  • 43. WHAT’S REALLY GOING ON IN YOUR CLUSTER? • THE _NODES/{NODE}/HOT_THREADS API TELLS WHAT HAPPENS ON THE HOST. • THE ELECTED MASTER NODES TELLS YOU MOST THING YOU NEED TO KNOW. • ENABLE THE SLOW LOGS TO UNDERSTAND YOUR BOTTLENECK & OPTIMIZE THE QUERIES. DISABLE THE SLOW LOGS WHEN YOU’RE DONE!!! • WHEN NOT ENOUGH, MEMORY PROFILING IS YOUR FRIEND.
  • 44. MEMORY PROFILING • LIVE MEMORY OR HPROF FILE AFTER A CRASH. • ALLOWS YOU TO TO KNOW WHAT IS / WAS IN YOUR BUFFERS AND CACHES. • YOURKIT JAVA PROFILER AS A TOOL.
  • 45. TRACING • KNOW WHAT’S REALLY HAPPENING IN YOUR JVM. • LINUX 4.X PROVIDES GREAT PERF TOOLS, LINUX 4.9 EVEN BETTER: • LINUX-PERF, • JAVA PERF MAP. • VECTOR BY NETFLIX (NOT VEKTOR THE TRASH METAL BAND).