SlideShare a Scribd company logo
Scalable Uniques in
Postgres -
Craig Kerstiens
Heroku Postgres
Postgresql-HLL
Truviso
• Extended Postgres to do streaming
• Various markets
• Ad space
• Wanted unique impressions
• Sort of wanted unique impressions
SELECT count(*)
Approx Top K
Compressed Bitmap
HyperLogLog
HyperLogLog
• KMV - K minimum value
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
• Implemented by Aggregate Knowledge
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
HyperLogLog
Probabilistic uniques with small footprint
HyperLogLog
Probabilistic uniques with small footprint
Close enough distinct with small footprint
Use cases
Use cases
• Semi distinct count
• Think pg_stat_statements
• Ad networks
• Web traffic
Use cases
• Semi distinct count
• Think pg_stat_statements
• Ad networks
• Web traffic
• With rollups/groupings
Digging in
CREATE	
  EXTENSION	
  hll;
	
  	
  CREATE	
  TABLE	
  helloworld	
  (
	
  	
  	
  	
  	
  	
  id	
  	
  	
  	
  integer,
	
  	
  	
  	
  	
  	
  set	
  	
  	
  hll
	
  	
  );
Digging in
CREATE	
  EXTENSION	
  hll;
	
  	
  CREATE	
  TABLE	
  helloworld	
  (
	
  	
  	
  	
  	
  	
  id	
  	
  	
  	
  integer,
	
  	
  	
  	
  	
  	
  set	
  	
  	
  hll
	
  	
  );
Inserting data
UPDATE	
  helloworld	
  
SET	
  set	
  =	
  hll_add(set,	
  hll_hash_integer(12345))	
  
WHERE	
  id	
  =	
  1;
UPDATE	
  helloworld	
  
SET	
  set	
  =	
  hll_add(set,	
  hll_hash_text('hello	
  world'))	
  
WHERE	
  id	
  =	
  1;
Real world
CREATE	
  TABLE	
  daily_uniques	
  (
	
  	
  	
  	
  date	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  date	
  UNIQUE,
	
  	
  	
  	
  users	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  hll
);
Real world
INSERT	
  INTO	
  daily_uniques(date,	
  users)
	
  	
  SELECT	
  date,	
  hll_add_agg(hll_hash_integer(user_id))
	
  	
  FROM	
  users
	
  	
  GROUP	
  BY	
  1;
Real world
SELECT	
  
	
  	
  	
  	
  	
  	
  	
  EXTRACT(MONTH	
  FROM	
  date)	
  AS	
  month,	
  
	
  	
  	
  	
  	
  	
  	
  hll_cardinality(hll_union_agg(users))
FROM	
  daily_uniques
WHERE	
  date	
  >=	
  '2012-­‐01-­‐01'	
  AND
	
  	
  	
  	
  	
  	
  date	
  <	
  	
  '2013-­‐01-­‐01'
GROUP	
  BY	
  1;
Real world
SELECT	
  
	
  	
  	
  	
  	
  	
  	
  EXTRACT(MONTH	
  FROM	
  date)	
  AS	
  month,	
  
	
  	
  	
  	
  	
  	
  	
  hll_cardinality(hll_union_agg(users))
FROM	
  daily_uniques
WHERE	
  date	
  >=	
  '2012-­‐01-­‐01'	
  AND
	
  	
  	
  	
  	
  	
  date	
  <	
  	
  '2013-­‐01-­‐01'
GROUP	
  BY	
  1;
Good practices
Good practices
Good practices
• It uses update
Good practices
• It uses update
• Do as a batch in most cases
Good practices
• It uses update
• Do as a batch in most cases
• Tweak the config
Tuning Parameters
Tuning Parameters
• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
Tuning Parameters
• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
Tuning Parameters
• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
• expthresh - threshold for explicit vs sparse
Tuning Parameters
• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
• expthresh - threshold for explicit vs sparse
• spareson - on/off for sparse
Is it better?
1280 bytes
Estimate count of 10s of billions
Few percent error
Resources
• https://github.com/aggregateknowledge/
postgresql-hll
• http://blog.aggregateknowledge.com/
2013/02/04/open-source-release-
postgresql-hll/
• http://tapoueh.org/blog/2013/02/25-
postgresql-hyperloglog
Questions

More Related Content

What's hot (20)

DOCX
empirical analysis modeling of power dissipation control in internet data ce...
saadjamil31
 
PDF
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
PDF
Presto talk @ Global AI conference 2018 Boston
kbajda
 
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
Minsk MongoDB User Group
 
PDF
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu
 
PDF
刘诚忠:Running cloudera impala on postgre sql
hdhappy001
 
PPTX
InfluxDb and Grafana fighting with data
Ivan Vaskevych
 
PPTX
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
PDF
J-Day Kraków: Listen to the sounds of your application
Maciej Bilas
 
PDF
Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu
 
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Altinity Ltd
 
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
PPT
ApexMeetup Geode - Talk2 2016-03-17
Apache Apex Organizer
 
PPTX
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
MongoDB
 
PPTX
An Intro to Elasticsearch and Kibana
ObjectRocket
 
PDF
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
PDF
Small intro to Big Data - Old version
SoftwareMill
 
PPTX
Open source big data landscape and possible ITS applications
SoftwareMill
 
PDF
Clickhouse at Cloudflare. By Marek Vavrusa
Valery Tkachenko
 
PPTX
New Thor & Roxie Hardware Architecture
HPCC Systems
 
empirical analysis modeling of power dissipation control in internet data ce...
saadjamil31
 
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
Presto talk @ Global AI conference 2018 Boston
kbajda
 
Meetup#2: Building responsive Symbology & Suggest WebService
Minsk MongoDB User Group
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu
 
刘诚忠:Running cloudera impala on postgre sql
hdhappy001
 
InfluxDb and Grafana fighting with data
Ivan Vaskevych
 
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
J-Day Kraków: Listen to the sounds of your application
Maciej Bilas
 
Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Altinity Ltd
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
ApexMeetup Geode - Talk2 2016-03-17
Apache Apex Organizer
 
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
MongoDB
 
An Intro to Elasticsearch and Kibana
ObjectRocket
 
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
Small intro to Big Data - Old version
SoftwareMill
 
Open source big data landscape and possible ITS applications
SoftwareMill
 
Clickhouse at Cloudflare. By Marek Vavrusa
Valery Tkachenko
 
New Thor & Roxie Hardware Architecture
HPCC Systems
 

Viewers also liked (14)

PDF
xPad - Building Simple Tablet OS with Gtk/WebKit
Ping-Hsun Chen
 
PDF
Ari xivo astricon_2016
Sylvain Boily
 
PDF
WEIGHT MANAGEMENT Do it yourself Motivation and Tips
Ryan Fernando
 
PDF
Useful PostgreSQL Extensions
EDB
 
PPTX
Architectures for High Availability - QConSF
Adrian Cockcroft
 
PDF
Fabric, Cuisine and Watchdog for server administration in Python
FFunction inc
 
PPTX
KazooCon 2014 - Kazoo Scalability
2600Hz
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PPT
Astricon 2010: Scaling Asterisk installations
Olle E Johansson
 
PDF
Performance optimization 101 - Erlang Factory SF 2014
lpgauth
 
PDF
CoreOS, or How I Learned to Stop Worrying and Love Systemd
Richard Lister
 
PDF
Responsive design: techniques and tricks to prepare your websites for the mul...
Andreas Bovens
 
PDF
Scaling LoL Chat to 70M Players
Michał Ptaszek
 
PPTX
Culture
Reed Hastings
 
xPad - Building Simple Tablet OS with Gtk/WebKit
Ping-Hsun Chen
 
Ari xivo astricon_2016
Sylvain Boily
 
WEIGHT MANAGEMENT Do it yourself Motivation and Tips
Ryan Fernando
 
Useful PostgreSQL Extensions
EDB
 
Architectures for High Availability - QConSF
Adrian Cockcroft
 
Fabric, Cuisine and Watchdog for server administration in Python
FFunction inc
 
KazooCon 2014 - Kazoo Scalability
2600Hz
 
Introduction to Kafka Streams
Guozhang Wang
 
Astricon 2010: Scaling Asterisk installations
Olle E Johansson
 
Performance optimization 101 - Erlang Factory SF 2014
lpgauth
 
CoreOS, or How I Learned to Stop Worrying and Love Systemd
Richard Lister
 
Responsive design: techniques and tricks to prepare your websites for the mul...
Andreas Bovens
 
Scaling LoL Chat to 70M Players
Michał Ptaszek
 
Culture
Reed Hastings
 
Ad

Similar to Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open (20)

PDF
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
Citus Data
 
PDF
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Data Con LA
 
PDF
Around the world with extensions | PostgreSQL Conference Europe 2018 | Craig ...
Citus Data
 
PDF
HyperLogLog in Hive - How to count sheep efficiently?
bzamecnik
 
PDF
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
Citus Data
 
PDF
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
PROIDEA
 
PDF
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
PPTX
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Chartio
 
PPTX
PostgreSQL Performance Problems: Monitoring and Alerting
Grant Fritchey
 
PDF
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
PDF
Does PostgreSQL respond to the challenge of analytical queries?
Andrey Lepikhov
 
PDF
Advanced pg_stat_statements: Filtering, Regression Testing & more
Lukas Fittl
 
PDF
Overview of Postgres 9.5
EDB
 
PDF
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC
 
PDF
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
Equnix Business Solutions
 
PDF
Advanced Int->Bigint Conversions
Robert Treat
 
PDF
query_tuning.pdf
ssuserf99076
 
PDF
PostgreSQL 9.0 & The Future
Aaron Thul
 
PPTX
HyperLogLog and friends
Simon Lia-Jonassen
 
ODP
PostgreSQL 8.4 TriLUG 2009-11-12
Andrew Dunstan
 
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
Citus Data
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Data Con LA
 
Around the world with extensions | PostgreSQL Conference Europe 2018 | Craig ...
Citus Data
 
HyperLogLog in Hive - How to count sheep efficiently?
bzamecnik
 
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
Citus Data
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
PROIDEA
 
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Chartio
 
PostgreSQL Performance Problems: Monitoring and Alerting
Grant Fritchey
 
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Does PostgreSQL respond to the challenge of analytical queries?
Andrey Lepikhov
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Lukas Fittl
 
Overview of Postgres 9.5
EDB
 
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
Equnix Business Solutions
 
Advanced Int->Bigint Conversions
Robert Treat
 
query_tuning.pdf
ssuserf99076
 
PostgreSQL 9.0 & The Future
Aaron Thul
 
HyperLogLog and friends
Simon Lia-Jonassen
 
PostgreSQL 8.4 TriLUG 2009-11-12
Andrew Dunstan
 
Ad

More from PostgresOpen (18)

PDF
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
PostgresOpen
 
PDF
Gurjeet Singh - How Postgres is Different From (Better Tha) Your RDBMS @ Post...
PostgresOpen
 
PDF
Keith Fiske - When PostgreSQL Can't, You Can @ Postgres Open
PostgresOpen
 
PPTX
David Keeney - SQL Database Server Requests from the Browser @ Postgres Open
PostgresOpen
 
PDF
Keith Paskett - Postgres on ZFS @ Postgres Open
PostgresOpen
 
PDF
Kevin Kempter - PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
PDF
Henrietta Dombrovskaya - A New Approach to Resolve Object-Relational Impedanc...
PostgresOpen
 
PDF
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
PostgresOpen
 
PDF
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
PostgresOpen
 
PDF
Koichi Suzuki - Postgres-XC Dynamic Cluster Management @ Postgres Open
PostgresOpen
 
PDF
Selena Deckelmann - Sane Schema Management with Alembic and SQLAlchemy @ Pos...
PostgresOpen
 
PDF
Robert Bernier - Recovering From A Damaged PostgreSQL Cluster @ Postgres Open
PostgresOpen
 
PDF
Michael Paquier - Taking advantage of custom bgworkers @ Postgres Open
PostgresOpen
 
PDF
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
PDF
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
PostgresOpen
 
PDF
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
PostgresOpen
 
PDF
Ryan Jarvinen Open Shift Talk @ Postgres Open 2013
PostgresOpen
 
PDF
Andrew Dunstan 9.3 JSON Presentation @ Postgres Open 2013
PostgresOpen
 
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
PostgresOpen
 
Gurjeet Singh - How Postgres is Different From (Better Tha) Your RDBMS @ Post...
PostgresOpen
 
Keith Fiske - When PostgreSQL Can't, You Can @ Postgres Open
PostgresOpen
 
David Keeney - SQL Database Server Requests from the Browser @ Postgres Open
PostgresOpen
 
Keith Paskett - Postgres on ZFS @ Postgres Open
PostgresOpen
 
Kevin Kempter - PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
Henrietta Dombrovskaya - A New Approach to Resolve Object-Relational Impedanc...
PostgresOpen
 
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
PostgresOpen
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
PostgresOpen
 
Koichi Suzuki - Postgres-XC Dynamic Cluster Management @ Postgres Open
PostgresOpen
 
Selena Deckelmann - Sane Schema Management with Alembic and SQLAlchemy @ Pos...
PostgresOpen
 
Robert Bernier - Recovering From A Damaged PostgreSQL Cluster @ Postgres Open
PostgresOpen
 
Michael Paquier - Taking advantage of custom bgworkers @ Postgres Open
PostgresOpen
 
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
PostgresOpen
 
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
PostgresOpen
 
Ryan Jarvinen Open Shift Talk @ Postgres Open 2013
PostgresOpen
 
Andrew Dunstan 9.3 JSON Presentation @ Postgres Open 2013
PostgresOpen
 

Recently uploaded (20)

PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 

Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open