SlideShare a Scribd company logo
HA PostgreSQL with Patroni
Oleksii Kliukin, Zalando SE
@alexeyklyukin
FOSDEM PGDay 2016
January 29th, 2016, Brussels
What happens if the master is down?
● Built-in streaming replication is great!
● Only one writable node (primary, master)
● Multiple read-only standbys (replicas)
● Manual failover
pg_ctl promote -D /home/postgres/data
Re-joining the former master
Before 9.3:
rm -rf /home/postgres/data && pg_basebackup …
Before 9.5
git clone -b PGREWIND1_0_0_PG9_4 --depth 1 https://github.
com/vmware/pg_rewind.git  && cd pg_rewind && apt-get source
postgresql-9.4 -y && USE_PGXS=1 make top_srcdir=$(find . -name
"postgresql*" -type d) install;
pg_rewind in 9.5 and above
● pg_rewind available in contrib (apt-get install postgresql-contrib-9.5)
● wal_log_hints = ‘on’ or enable data checksums
● rewind your former master to be able to follow the current one:
pg_rewind -D /home/postgres/data --source-server=’
host=localhost port=5433 sslmode=prefer’
● requires superuser access
No fixed address
● Pgbouncer
● Pgpool
● HAProxy
● Floating IP/DNS
MASTER REPLICA
FORMER
MASTER
WAL storage
connection
router
CLIENTS
Streaming replication
pg_rewind
archive
com
m
and
restore
com
m
and
How much downtime can you tolerate?
Automatic failover
master
replica
master
replica
promote
replica
master
Network issues
master
replica
master
replica
promote
master
master
?
What about an arbiter?
replica
master
arbiter
ping
ping
master
master
arbiter
vote
master
replica
Do we need a distributed consensus?
Master election
The consensus problem requires agreement among a number of processes
(or agents) for a single data value.
● leader (master) value defines the current master
● no leader - which node takes the master key
● leader is present - should be the same for all nodes
● leader has disappeared - should be the same for all nodes
● etcd from CoreOS
● distributed key-value storage
● directory-tree like
● implements RAFT
● talks REST
● key expiration with TTL and test and set operations
3-rd party to enforce a consensus
RAFT
● Distributed consensus algorithm (like Paxos)
● Achieves consensus by directing all changes to the leader
● Only commit the change if it’s acknowledged by the majority of nodes
● 2 stages
○ leader election
○ log replication
● Implemented in etcd, consul.
http://thesecretlivesofdata.com/raft/
Patroni
● Manages a single PostgreSQL node
● Commonly runs on the same host as PostgreSQL
● Talks to etcd
● Promotes/demotes the managed node depending on the leader key
PostgreSQL master election
set leader lock
set leader lock set leader lock
● every node tries to set the leader lock (key)
● the leader lock can only be set when it’s not present
● once the leader lock is set - no one else can obtain it
PostgreSQL master election
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql0"
ttl=30
HTTP/1.1 201 Created
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2045
X-Raft-Index: 13006
X-Raft-Term: 2
{
"action": "create",
"node": {
"createdIndex": 2045,
"expiration": "2016-01-28T13:38:19.717822356Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2045,
"ttl": 30,
"value": "postgresql0"
}
}
ELECTED
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql1"
ttl=30
HTTP/1.1 412 Precondition Failed
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2047
{
"cause": "/service/fosdem/leader",
"errorCode": 105,
"index": 2047,
"message": "Key already exists"
}
Only one leader at a time
PostgreSQL master election
I’m the member
I’m
the leader with the lock
I’m
the member
Streaming replication
How do you know the leader is alive?
● leader updates its key periodically (by default every 10 seconds)
● only the leader is allowed to update the key (via compare and swap)
● if the key is not updated in 30 seconds - it expires (via TTL)
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="bar" value="bar"
HTTP/1.1 412 Precondition Failed
Content-Length: 89
Content-Type: application/json
Date: Thu, 28 Jan 2016 13:45:27 GMT
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2090
{
"cause": "[bar != postgresql0]",
"errorCode": 101,
"index": 2090,
"message": "Compare failed"
}
Only the leader can update the lock
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="postgresql0" value="
postgresql0" ttl=30
{
"action": "compareAndSwap",
"node": {
"createdIndex": 2052,
"expiration": "2016-01-28T13:47:05.38531821Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2119,
"ttl": 30,
"value": "postgresql0"
},
"prevNode": {
"createdIndex": 2052,
"expiration": "2016-01-28T13:47:05.226784451Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2116,
"ttl": 22,
"value": "postgresql0"
}
}
How do you know where to connect?
$ etcdctl ls --recursive /service/fosdem
/service/fosdem/members
/service/fosdem/members/postgresql0
/service/fosdem/members/postgresql1
/service/fosdem/initialize
/service/fosdem/leader
/service/fosdem/optime
/service/fosdem/optime/leader
$ http http://127.0.0.1:2379/v2/keys/service/fosdem/members/postgresql0
HTTP/1.1 200 OK
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 3114
X-Raft-Index: 20102
X-Raft-Term: 2
{
"action": "get",
"node": {
"createdIndex": 3111,
"expiration": "2016-01-28T14:28:25.221011955Z",
"key": "/service/fosdem/members/postgresql0",
"modifiedIndex": 3111,
"ttl": 22,
"value": "{"conn_url":"postgres://replicator:rep-pass@127.0.0.1:5432/postgres","
api_url":"http://127.0.0.1:8008/patroni","tags":{"nofailover":false,"noloadbalance":false,
"clonefrom":false},"state":"running","role":"master","xlog_location":234881568}"
}
}
Avoiding the split brain
Worst case scenario
Streaming replication in 140 characters
Patroni configuration parameters
● YAML file with sections
● general parameters
○ ttl: time to leave for the leader and member keys
○ loop_wait: minimum time one iteration of the eventloop takes
○ scope: name of the cluster to run
○ auth: ‘username:password’ string for the REST API
● postgresql section
○ name - name of the postgresql member (should be unique)
○ listen - address:port to listen to (or multiple, i.e. 127.0.0.1,127.0.0.2:5432)
○ connect_address: address:port to advertise to other members (only one, i.e. 127.0.0.5:5432)
○ data_dir: PGDATA (can be initially not empty)
○ maximum_lag_on_failover: do not failover if slave is more than this number of bytes behind
○ use_slots: whether to use replication slots (9.4 and above)
postgresql subsections
● initdb: section to specify initdb options (i.e. encoding, default auth mode)
● pg_rewind: section with username/password for the user used by pg_rewind
● pg_hba: entries to be added to pg_hba.conf
● replication: replication user, password, and network (for pg_hba.conf)
● superuser: username/password for the superuser account (to be created)
● admin: username/password for the user with createdb/createrole permissions
● create_replica_methods: list of methods to image replicas from the master:
● recovery.conf: parameters put into the recovery.conf (primary_conninfo is
written automatically)
● parameters: postgresql.conf parameters (i.e. wal_log_hints or shared_buffers)
tags (patroni configuration)
tags modify behavior of the node they are applied to
● nofailover: the node should not participate in elections or ever become the
master
● noloadbalance: the node should be excluded from the load balancer (TODO)
● clonefrom: this node should be bootstrapped from (TODO)
● replicatefrom: this node should do streaming replication from (pull request)
REST API
● command and control interface
● GET /master and /replica endpoints for the load balancer
● GET /patroni in order to get system information
● POST /restart in order to restart the node
● POST /reinitialize in order to remove the data directory and reinitialize from
the master
● POST /failover with leader and optional member names in order to do a
controlled failover
● patronictl to do it in a more user-friendly way
REST API (master)
$ http http://127.0.0.1:8008/master
HTTP/1.0 200 OK
...
Server: BaseHTTP/0.3 Python/2.7.10
{
"postmaster_start_time": "2016-01-27 23:23:21.873 CET",
"role": "master",
"state": "running",
"tags": {
"clonefrom": false,
"nofailover": false,
"noloadbalance": false
},
"xlog": {
"location": 301990984
}
}
REST API (replica)
http http://127.0.0.1:8009/master
HTTP/1.0 503 Service Unavailable
...
Server: BaseHTTP/0.3 Python/2.7.10
{
"postmaster_start_time": "2016-01-27 23:23:24.367 CET",
"role": "replica",
"state": "running",
"tags": {
"clonefrom": false,
"nofailover": false,
"noloadbalance": false
},
"xlog": {
"paused": false,
"received_location": 301990984,
"replayed_location": 301990984
}
Configuring HA Proxy for Patroni
global
maxconn 100
defaults
log global
mode tcp
retries 2
timeout client 30m
timeout connect 4s
timeout server 30m
timeout check 5s
frontend ft_postgresql
bind *:5000
default_backend bk_db
backend bk_db
option httpchk
server postgresql_127.0.0.1_5432 127.0.0.1:5432 maxconn 100 check port 8008
server postgresql_127.0.0.1_5433 127.0.0.1:5433 maxconn 100 check port 8009
Implementation details
Separate nodes for etcd and patroni
Multi-threading to avoid blocking the
event loop
Use synchronous_standby_names=’*’ for
synchronous replication
Use etcd/Zookeeper watches to speed
up the failover
Callbacks
Call monitoring code or do some application-specific actions (i.e. change
pgbouncer configuration)
User-defined scripts set in the configuration file.
● on start
● on stop
● on restart
● on change role
pg_rewind support
● remove recovery.conf if present
● run a checkpoint on a promoted master (due to the fast promote)
● remove archive status to avoid losing archived segments to be removed
● start in a single-user mode with archive_command set to false
● stop to produce a clean shutdown
● only if checksums or enabled or wal_log_hints are set (via pg_controldata)
● Many installations already have Zookeeper running
● No TTL
● Session-specific (ephemeral) keys
● No dynamic nodes (use Exhibitor)
Zookeeper support
Spilo: Patroni on AWS
High Availability PostgreSQL with Zalando Patroni
Up next
● scheduled failovers
● full support for cascading replication
● consul joins etcd and zookeeper
● manage BDR nodes
Thank you!
Feedback: @alexeyklyukin
alexk@hintbits.com
Links
github.com/zalando/patroni
spilo.readthedocs.org
coreos.com/etcd/docs/latest/getting-started-with-etcd.html
raft.github.io

More Related Content

What's hot (20)

PDF
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
PDF
PostgreSQL HA
haroonm
 
PDF
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
ScaleGrid.io
 
PDF
MySQL Advanced Administrator 2021 - 네오클로바
NeoClova
 
ODP
Logical replication with pglogical
Umair Shahid
 
PDF
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
PDF
Parallel Replication in MySQL and MariaDB
Mydbops
 
PDF
eBPF Trace from Kernel to Userspace
SUSE Labs Taipei
 
PDF
BPF: Tracing and more
Brendan Gregg
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PDF
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
distributed matters
 
PDF
ProxySQL - High Performance and HA Proxy for MySQL
René Cannaò
 
PDF
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
KubeAcademy
 
PDF
Troubleshooting PostgreSQL Streaming Replication
Alexey Lesovsky
 
PDF
Patroni: Kubernetes-native PostgreSQL companion
Alexander Kukushkin
 
PDF
Understanding PostgreSQL LW Locks
Jignesh Shah
 
PDF
A crash course in CRUSH
Sage Weil
 
PDF
Postgresql database administration volume 1
Federico Campoli
 
PPTX
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
DOCX
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
NeoClova
 
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
PostgreSQL HA
haroonm
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
ScaleGrid.io
 
MySQL Advanced Administrator 2021 - 네오클로바
NeoClova
 
Logical replication with pglogical
Umair Shahid
 
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
Parallel Replication in MySQL and MariaDB
Mydbops
 
eBPF Trace from Kernel to Userspace
SUSE Labs Taipei
 
BPF: Tracing and more
Brendan Gregg
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
distributed matters
 
ProxySQL - High Performance and HA Proxy for MySQL
René Cannaò
 
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
KubeAcademy
 
Troubleshooting PostgreSQL Streaming Replication
Alexey Lesovsky
 
Patroni: Kubernetes-native PostgreSQL companion
Alexander Kukushkin
 
Understanding PostgreSQL LW Locks
Jignesh Shah
 
A crash course in CRUSH
Sage Weil
 
Postgresql database administration volume 1
Federico Campoli
 
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
NeoClova
 

Similar to High Availability PostgreSQL with Zalando Patroni (20)

PDF
515_Patroni-training_postgres_high_availability.pdf
RobertoGiordano16
 
PDF
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
Equnix Business Solutions
 
PDF
Out of the box replication in postgres 9.4
Denish Patel
 
PDF
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
PDF
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
PDF
Out of the Box Replication in Postgres 9.4(pgconfsf)
Denish Patel
 
PDF
Out of the box replication in postgres 9.4(pg confus)
Denish Patel
 
PDF
Out of the Box Replication in Postgres 9.4(PgConfUS)
Denish Patel
 
PDF
How to Replicate PostgreSQL Database
SangJin Kang
 
PDF
Streaming replication in practice
Alexey Lesovsky
 
PDF
ProstgreSQLFailoverConfiguration
Suyog Shirgaonkar
 
PPTX
Built in physical and logical replication in postgresql-Firat Gulec
FIRAT GULEC
 
PDF
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
Aleksander Alekseev
 
PPTX
Streaming Replication Made Easy in v9.3
Sameer Kumar
 
PDF
60f04d9ae4105.pdf
ssuser42779e
 
PDF
PostgreSQL Streaming Replication Cheatsheet
Alexey Lesovsky
 
PPTX
Built-in-Physical-and-Logical-Replication-in-Postgresql-Firat-Gulec.pptx
nadirpervez2
 
PDF
PostgreSQL - Haute disponibilité avec Patroni
slardiere
 
PPTX
An overview of reference architectures for Postgres
EDB
 
PDF
Replication in PostgreSQL tutorial given in Postgres Conference 2019
Abbas Butt
 
515_Patroni-training_postgres_high_availability.pdf
RobertoGiordano16
 
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
Equnix Business Solutions
 
Out of the box replication in postgres 9.4
Denish Patel
 
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
Out of the Box Replication in Postgres 9.4(pgconfsf)
Denish Patel
 
Out of the box replication in postgres 9.4(pg confus)
Denish Patel
 
Out of the Box Replication in Postgres 9.4(PgConfUS)
Denish Patel
 
How to Replicate PostgreSQL Database
SangJin Kang
 
Streaming replication in practice
Alexey Lesovsky
 
ProstgreSQLFailoverConfiguration
Suyog Shirgaonkar
 
Built in physical and logical replication in postgresql-Firat Gulec
FIRAT GULEC
 
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
Aleksander Alekseev
 
Streaming Replication Made Easy in v9.3
Sameer Kumar
 
60f04d9ae4105.pdf
ssuser42779e
 
PostgreSQL Streaming Replication Cheatsheet
Alexey Lesovsky
 
Built-in-Physical-and-Logical-Replication-in-Postgresql-Firat-Gulec.pptx
nadirpervez2
 
PostgreSQL - Haute disponibilité avec Patroni
slardiere
 
An overview of reference architectures for Postgres
EDB
 
Replication in PostgreSQL tutorial given in Postgres Conference 2019
Abbas Butt
 
Ad

More from Zalando Technology (14)

PDF
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
PDF
How We Made our Tech Organization and Architecture Converge Towards Scalability
Zalando Technology
 
PDF
Powering Radical Agility with Docker
Zalando Technology
 
PDF
Flink in Zalando's World of Microservices
Zalando Technology
 
PDF
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Zalando Technology
 
PDF
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Technology
 
PDF
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Zalando Technology
 
PPTX
Building a Reactive RESTful API with Akka Http & Slick
Zalando Technology
 
PDF
Radical Agility with Autonomous Teams and Microservices
Zalando Technology
 
PDF
Order Processing at Scale: Zalando at Camunda Community Day
Zalando Technology
 
PDF
ZMON: Monitoring Zalando's Engineering Platform
Zalando Technology
 
PPTX
Mobile Testing Challenges at Zalando Tech
Zalando Technology
 
PDF
Auto-scaling your API: Insights and Tips from the Zalando Team
Zalando Technology
 
PDF
Radical Agility with Autonomous Teams and Microservices in the Cloud
Zalando Technology
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
How We Made our Tech Organization and Architecture Converge Towards Scalability
Zalando Technology
 
Powering Radical Agility with Docker
Zalando Technology
 
Flink in Zalando's World of Microservices
Zalando Technology
 
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Zalando Technology
 
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Technology
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Zalando Technology
 
Building a Reactive RESTful API with Akka Http & Slick
Zalando Technology
 
Radical Agility with Autonomous Teams and Microservices
Zalando Technology
 
Order Processing at Scale: Zalando at Camunda Community Day
Zalando Technology
 
ZMON: Monitoring Zalando's Engineering Platform
Zalando Technology
 
Mobile Testing Challenges at Zalando Tech
Zalando Technology
 
Auto-scaling your API: Insights and Tips from the Zalando Team
Zalando Technology
 
Radical Agility with Autonomous Teams and Microservices in the Cloud
Zalando Technology
 
Ad

Recently uploaded (20)

PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 

High Availability PostgreSQL with Zalando Patroni

  • 1. HA PostgreSQL with Patroni Oleksii Kliukin, Zalando SE @alexeyklyukin FOSDEM PGDay 2016 January 29th, 2016, Brussels
  • 2. What happens if the master is down? ● Built-in streaming replication is great! ● Only one writable node (primary, master) ● Multiple read-only standbys (replicas) ● Manual failover pg_ctl promote -D /home/postgres/data
  • 3. Re-joining the former master Before 9.3: rm -rf /home/postgres/data && pg_basebackup … Before 9.5 git clone -b PGREWIND1_0_0_PG9_4 --depth 1 https://github. com/vmware/pg_rewind.git && cd pg_rewind && apt-get source postgresql-9.4 -y && USE_PGXS=1 make top_srcdir=$(find . -name "postgresql*" -type d) install;
  • 4. pg_rewind in 9.5 and above ● pg_rewind available in contrib (apt-get install postgresql-contrib-9.5) ● wal_log_hints = ‘on’ or enable data checksums ● rewind your former master to be able to follow the current one: pg_rewind -D /home/postgres/data --source-server=’ host=localhost port=5433 sslmode=prefer’ ● requires superuser access
  • 5. No fixed address ● Pgbouncer ● Pgpool ● HAProxy ● Floating IP/DNS
  • 6. MASTER REPLICA FORMER MASTER WAL storage connection router CLIENTS Streaming replication pg_rewind archive com m and restore com m and
  • 7. How much downtime can you tolerate?
  • 10. What about an arbiter? replica master arbiter ping ping master master arbiter vote master replica
  • 11. Do we need a distributed consensus? Master election
  • 12. The consensus problem requires agreement among a number of processes (or agents) for a single data value. ● leader (master) value defines the current master ● no leader - which node takes the master key ● leader is present - should be the same for all nodes ● leader has disappeared - should be the same for all nodes
  • 13. ● etcd from CoreOS ● distributed key-value storage ● directory-tree like ● implements RAFT ● talks REST ● key expiration with TTL and test and set operations 3-rd party to enforce a consensus
  • 14. RAFT ● Distributed consensus algorithm (like Paxos) ● Achieves consensus by directing all changes to the leader ● Only commit the change if it’s acknowledged by the majority of nodes ● 2 stages ○ leader election ○ log replication ● Implemented in etcd, consul. http://thesecretlivesofdata.com/raft/
  • 15. Patroni ● Manages a single PostgreSQL node ● Commonly runs on the same host as PostgreSQL ● Talks to etcd ● Promotes/demotes the managed node depending on the leader key
  • 16. PostgreSQL master election set leader lock set leader lock set leader lock
  • 17. ● every node tries to set the leader lock (key) ● the leader lock can only be set when it’s not present ● once the leader lock is set - no one else can obtain it PostgreSQL master election
  • 18. http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql0" ttl=30 HTTP/1.1 201 Created ... X-Etcd-Cluster-Id: 7e27652122e8b2ae X-Etcd-Index: 2045 X-Raft-Index: 13006 X-Raft-Term: 2 { "action": "create", "node": { "createdIndex": 2045, "expiration": "2016-01-28T13:38:19.717822356Z", "key": "/service/fosdem/leader", "modifiedIndex": 2045, "ttl": 30, "value": "postgresql0" } } ELECTED
  • 19. http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql1" ttl=30 HTTP/1.1 412 Precondition Failed ... X-Etcd-Cluster-Id: 7e27652122e8b2ae X-Etcd-Index: 2047 { "cause": "/service/fosdem/leader", "errorCode": 105, "index": 2047, "message": "Key already exists" } Only one leader at a time
  • 20. PostgreSQL master election I’m the member I’m the leader with the lock I’m the member Streaming replication
  • 21. How do you know the leader is alive? ● leader updates its key periodically (by default every 10 seconds) ● only the leader is allowed to update the key (via compare and swap) ● if the key is not updated in 30 seconds - it expires (via TTL)
  • 22. http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="bar" value="bar" HTTP/1.1 412 Precondition Failed Content-Length: 89 Content-Type: application/json Date: Thu, 28 Jan 2016 13:45:27 GMT X-Etcd-Cluster-Id: 7e27652122e8b2ae X-Etcd-Index: 2090 { "cause": "[bar != postgresql0]", "errorCode": 101, "index": 2090, "message": "Compare failed" } Only the leader can update the lock
  • 23. http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="postgresql0" value=" postgresql0" ttl=30 { "action": "compareAndSwap", "node": { "createdIndex": 2052, "expiration": "2016-01-28T13:47:05.38531821Z", "key": "/service/fosdem/leader", "modifiedIndex": 2119, "ttl": 30, "value": "postgresql0" }, "prevNode": { "createdIndex": 2052, "expiration": "2016-01-28T13:47:05.226784451Z", "key": "/service/fosdem/leader", "modifiedIndex": 2116, "ttl": 22, "value": "postgresql0" } }
  • 24. How do you know where to connect? $ etcdctl ls --recursive /service/fosdem /service/fosdem/members /service/fosdem/members/postgresql0 /service/fosdem/members/postgresql1 /service/fosdem/initialize /service/fosdem/leader /service/fosdem/optime /service/fosdem/optime/leader
  • 25. $ http http://127.0.0.1:2379/v2/keys/service/fosdem/members/postgresql0 HTTP/1.1 200 OK ... X-Etcd-Cluster-Id: 7e27652122e8b2ae X-Etcd-Index: 3114 X-Raft-Index: 20102 X-Raft-Term: 2 { "action": "get", "node": { "createdIndex": 3111, "expiration": "2016-01-28T14:28:25.221011955Z", "key": "/service/fosdem/members/postgresql0", "modifiedIndex": 3111, "ttl": 22, "value": "{"conn_url":"postgres://replicator:[email protected]:5432/postgres"," api_url":"http://127.0.0.1:8008/patroni","tags":{"nofailover":false,"noloadbalance":false, "clonefrom":false},"state":"running","role":"master","xlog_location":234881568}" } }
  • 28. Streaming replication in 140 characters
  • 29. Patroni configuration parameters ● YAML file with sections ● general parameters ○ ttl: time to leave for the leader and member keys ○ loop_wait: minimum time one iteration of the eventloop takes ○ scope: name of the cluster to run ○ auth: ‘username:password’ string for the REST API ● postgresql section ○ name - name of the postgresql member (should be unique) ○ listen - address:port to listen to (or multiple, i.e. 127.0.0.1,127.0.0.2:5432) ○ connect_address: address:port to advertise to other members (only one, i.e. 127.0.0.5:5432) ○ data_dir: PGDATA (can be initially not empty) ○ maximum_lag_on_failover: do not failover if slave is more than this number of bytes behind ○ use_slots: whether to use replication slots (9.4 and above)
  • 30. postgresql subsections ● initdb: section to specify initdb options (i.e. encoding, default auth mode) ● pg_rewind: section with username/password for the user used by pg_rewind ● pg_hba: entries to be added to pg_hba.conf ● replication: replication user, password, and network (for pg_hba.conf) ● superuser: username/password for the superuser account (to be created) ● admin: username/password for the user with createdb/createrole permissions ● create_replica_methods: list of methods to image replicas from the master: ● recovery.conf: parameters put into the recovery.conf (primary_conninfo is written automatically) ● parameters: postgresql.conf parameters (i.e. wal_log_hints or shared_buffers)
  • 31. tags (patroni configuration) tags modify behavior of the node they are applied to ● nofailover: the node should not participate in elections or ever become the master ● noloadbalance: the node should be excluded from the load balancer (TODO) ● clonefrom: this node should be bootstrapped from (TODO) ● replicatefrom: this node should do streaming replication from (pull request)
  • 32. REST API ● command and control interface ● GET /master and /replica endpoints for the load balancer ● GET /patroni in order to get system information ● POST /restart in order to restart the node ● POST /reinitialize in order to remove the data directory and reinitialize from the master ● POST /failover with leader and optional member names in order to do a controlled failover ● patronictl to do it in a more user-friendly way
  • 33. REST API (master) $ http http://127.0.0.1:8008/master HTTP/1.0 200 OK ... Server: BaseHTTP/0.3 Python/2.7.10 { "postmaster_start_time": "2016-01-27 23:23:21.873 CET", "role": "master", "state": "running", "tags": { "clonefrom": false, "nofailover": false, "noloadbalance": false }, "xlog": { "location": 301990984 } }
  • 34. REST API (replica) http http://127.0.0.1:8009/master HTTP/1.0 503 Service Unavailable ... Server: BaseHTTP/0.3 Python/2.7.10 { "postmaster_start_time": "2016-01-27 23:23:24.367 CET", "role": "replica", "state": "running", "tags": { "clonefrom": false, "nofailover": false, "noloadbalance": false }, "xlog": { "paused": false, "received_location": 301990984, "replayed_location": 301990984 }
  • 35. Configuring HA Proxy for Patroni global maxconn 100 defaults log global mode tcp retries 2 timeout client 30m timeout connect 4s timeout server 30m timeout check 5s frontend ft_postgresql bind *:5000 default_backend bk_db backend bk_db option httpchk server postgresql_127.0.0.1_5432 127.0.0.1:5432 maxconn 100 check port 8008 server postgresql_127.0.0.1_5433 127.0.0.1:5433 maxconn 100 check port 8009
  • 37. Separate nodes for etcd and patroni
  • 38. Multi-threading to avoid blocking the event loop
  • 40. Use etcd/Zookeeper watches to speed up the failover
  • 41. Callbacks Call monitoring code or do some application-specific actions (i.e. change pgbouncer configuration) User-defined scripts set in the configuration file. ● on start ● on stop ● on restart ● on change role
  • 42. pg_rewind support ● remove recovery.conf if present ● run a checkpoint on a promoted master (due to the fast promote) ● remove archive status to avoid losing archived segments to be removed ● start in a single-user mode with archive_command set to false ● stop to produce a clean shutdown ● only if checksums or enabled or wal_log_hints are set (via pg_controldata)
  • 43. ● Many installations already have Zookeeper running ● No TTL ● Session-specific (ephemeral) keys ● No dynamic nodes (use Exhibitor) Zookeeper support
  • 46. Up next ● scheduled failovers ● full support for cascading replication ● consul joins etcd and zookeeper ● manage BDR nodes