SlideShare a Scribd company logo
Honey, I Shrunk the DatabaseFor Test and Development EnvironmentsVanessa HurstPaperless PostPostgres Open, September 2011
Honey I Shrunk the Database
User Data
Why Shrink?AccuracyYou don’t truly know how your app will behave in production unless you use real data.Production data is the ultimate in accuracy.
Why Shrink?AccuracyFreshnessNew data should be available regularly.Full database refreshes should be timely.
Why Shrink?AccuracyFreshnessResource LimitationsStaging and developer machines cannot handle production load.
Why Shrink?AccuracyFreshnessResource LimitationsData ProtectionLimit spread of sensitive user or client data.
Why Shrink?AccuracyFreshnessResource LimitationsData Protection
Case Study: Paperless PostRequirementsFreshness – Daily, On command for non-developersShrinkage – Slices, Mutations
Case Study: Paperless PostRequirementsFreshness – Daily, On command for non-developersShrinkage – Slices, MutationsResourcesSource – extra disk space, RAM, and CPUsDestination – limited, often entirely un-optimizedDevelopment -- constrained DBA resources
Shrink StrategiesCopiesRestored backups or live replicas of entire production database
Shrink StrategiesCopiesSlicesSelect portions of exact data
Shrink StrategiesCopiesSlicesMutationsSanitized, anonymized, or otherwise changed data
Shrink StrategiesCopiesSlicesMutationsAssumptionsSeed databases, fixtures, test data
Shrink StrategiesCopiesSlicesMutationsAssumptions
SlicesVertical SliceDifficult to obtain a valid, useful subset of data.Example: Include some entire tables, exclude others
SlicesVertical SliceDifficult to obtain a valid, useful subset of data.Example: Include some entire tables, exclude othersHorizontal SliceDifficult to write and maintain.Example: SQL or application code to determine subset of data
PG Tools – Vertical SliceFlexibility at Source (Production)pg_dumpInclude data only [-a --data-only]Include table schema only [-s --schema-only]Select tables [-t table1 table2 --table table1 table2]Select schemas [-nschema --schema=schema]Exclude schemas [-N schema --exclude-schema=schema]
PG Tools – Vertical SliceFlexibility at Destination (Staging, Development)pg_restoreInclude data only [-a --data-only]Select indexes [-iindex --index=index]Tune processing [-jnumber-of-jobs --jobs=number-of-jobs]Select schemas [-nschema --schema=schema]Select triggers[-T trigger --trigger=trigger]Exclude privileges [-x --no-privileges --no-acl]
Honey I Shrunk the Database
MutationsExternal Data ProtectionHIPAA RegulationsPCI ComplianceAPI Terms of Use
MutationsExternal Data ProtectionHIPAA RegulationsPCI ComplianceAPI Terms of UseInternal Data ProtectionProtecting your users’ personal dataProtecting your users from accidents, e.g. staging emailsYour Terms of Service
User Data
Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses
Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemaspg_dump --clean --schema-only --schema public db-01 > slice.sql
Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemaspg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static contentpg_dump --data-only --schema public -t cards db-01 >> slice.sql
Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemaspg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static contentpg_dump --data-only --schema public -t cards db-01 >> slice.sql	Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses
Case Study: Paperless PostCREATE SCHEMA staging;
Case Study: Paperless PostHorizontal SliceCustom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);
Case Study: Paperless PostHorizontal SliceCustom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);
Case Study: Paperless PostHorizontal SliceCustom SQLDynamic relative to full data set or newly created sliceMutationsEmail AddressesUse regular expressions to clean non-admin addressese.g. dude@gmail.com => staging+dudegmailcom@paperlesspost.comCached DataClear cached short link from link-shortening API
Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemaspg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static contentpg_dump --data-only --schema public -t cards db-01 >> slice.sql	Horizontal Slice – Subset of users and their dataMutation – Changed user email addressespg_dump --data-only --schema staging db-01 >> slice.sql
Case Study: Paperless PostRebuildPrepare new database as standbyGracefully close connectionsRotate by renaming databasesSecurity				Dedicated database build userMembership in application user roleApplication user role & privileges remain
Case Study: Paperless PostRebuild$ bzcat slice.sql.bz2 | psql db-newStaging schema has not been created, so all data loads to default schema
Case Study: Paperless PostWe hacked our rebuild by importing across schemas!Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.
Secret Weapon --Updates all serial sequences for ID columns onlyBEGINFOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOPtable_name = table_record.relname::text;	EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || ' 	WHERE EXISTS (SELECT 1 FROM ' || table_name || ')';END LOOP;
Case Study: Paperless PostRebuild$ bzcat slice.sql.bz2 | psql db-newStaging schema has not been created, so all data loads to default schemaecho “select 1 from update_id_sequences();” >> slice.sqlVacuumReindex
Case Study: Paperless PostSecurity					Database build userCREATE DB privilegesMember of Application user roleApplication user remains database ownerApplication user privileges remain limitedBuild only works in predetermined environments
Case Study: Paperless PostRequirementsFreshness – Daily, On command for non-developersShrinkage – Slices, MutationsResourcesSource – extra disk space, RAM, and CPUsDestination – limited, often entirely un-optimizedDevelopment -- constrained DBA resources
Questions?Vanessa HurstPaperless Post@DBNessPostgres Open, September 2011
More ToolsCopies -- LVMSnapshotsSee talk by Jon Erdman at PG Conf EUGreat for all readsData stays virtualized & doesn’t take up space until changedIdeal for DDL changes without actual data changes
More ToolsCopies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_stagingSimple -- pauses pgbouncer & restores backupEfficient -- leverage bulk loadingFlexible -- supports varying psql filesCustom -- limitedSlices -- replicate by rtomayko of Github	http://github.com/rtomayko/replicateSimple - Preserves object relations via ActiveRecordInefficient -- Creates text-based .dumpInflexible -- Corrupts id sequences on data insertCustom -- highly

More Related Content

PDF
Robert Meyer- pypet
PyData
 
PDF
Dr. Andreas Lattner- Setting up predictive services with Palladium
PyData
 
PPTX
Entity framework
Rajeev Harbola
 
PDF
Data Exploration with Apache Drill: Day 1
Charles Givre
 
PDF
Don’t optimize my queries, optimize my data!
Julian Hyde
 
PPTX
Odtug2011 adf developers make the database work for you
Luc Bors
 
PDF
Don't optimize my queries, organize my data!
Julian Hyde
 
PPT
Drill / SQL / Optiq
Julian Hyde
 
Robert Meyer- pypet
PyData
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
PyData
 
Entity framework
Rajeev Harbola
 
Data Exploration with Apache Drill: Day 1
Charles Givre
 
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Odtug2011 adf developers make the database work for you
Luc Bors
 
Don't optimize my queries, organize my data!
Julian Hyde
 
Drill / SQL / Optiq
Julian Hyde
 

What's hot (19)

PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
PPT
Hive ICDE 2010
ragho
 
PPTX
Dapper performance
Suresh Loganatha
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
PPTX
Advanced .NET Data Access with Dapper
David Paquette
 
PPT
HW09 Hadoop Vaidya
Cloudera, Inc.
 
PDF
Apache Drill Workshop
Charles Givre
 
PDF
Spatial query on vanilla databases
Julian Hyde
 
PPTX
Hive and HiveQL - Module6
Rohit Agrawal
 
PPTX
Lazy beats Smart and Fast
Julian Hyde
 
PDF
Data Profiling in Apache Calcite
Julian Hyde
 
PDF
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
PPTX
Salesforce Summer 14 Release
Jyothylakshmy P.U
 
PDF
20140908 spark sql & catalyst
Takuya UESHIN
 
PPTX
Hadoop workshop
Purna Chander
 
PDF
HEPData workshop talk
Eamonn Maguire
 
PDF
Streaming SQL
Julian Hyde
 
PDF
HEPData Open Repositories 2016 Talk
Eamonn Maguire
 
PPTX
January 2011 HUG: Howl Presentation
Yahoo Developer Network
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Hive ICDE 2010
ragho
 
Dapper performance
Suresh Loganatha
 
Hive User Meeting August 2009 Facebook
ragho
 
Advanced .NET Data Access with Dapper
David Paquette
 
HW09 Hadoop Vaidya
Cloudera, Inc.
 
Apache Drill Workshop
Charles Givre
 
Spatial query on vanilla databases
Julian Hyde
 
Hive and HiveQL - Module6
Rohit Agrawal
 
Lazy beats Smart and Fast
Julian Hyde
 
Data Profiling in Apache Calcite
Julian Hyde
 
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
Salesforce Summer 14 Release
Jyothylakshmy P.U
 
20140908 spark sql & catalyst
Takuya UESHIN
 
Hadoop workshop
Purna Chander
 
HEPData workshop talk
Eamonn Maguire
 
Streaming SQL
Julian Hyde
 
HEPData Open Repositories 2016 Talk
Eamonn Maguire
 
January 2011 HUG: Howl Presentation
Yahoo Developer Network
 
Ad

Similar to Honey I Shrunk the Database (20)

PPTX
Advance Sql Server Store procedure Presentation
Amin Uddin
 
PPTX
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
PPT
Giga Spaces Data Grid / Data Caching Overview
jimliddle
 
PPTX
SQL Server 2008 Development for Programmers
Adam Hutson
 
PPT
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
JISC GECO
 
PPT
Getting started with PostGIS geographic database
EDINA, University of Edinburgh
 
PPTX
Sql storeprocedure
ftz 420
 
PPTX
Stata Python Rosetta Stone Side-by-side code examples
lahurtc22
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
PDF
Amazon EMR Masterclass
Ian Massingham
 
PPT
NoCOUG Presentation on Oracle RAT
HenryBowers
 
DOCX
Lab manual asp.net
Vivek Kumar Sinha
 
PPT
Hadoop & Zing
Long Dao
 
PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
PPT
Qtp Training Deepti 4 Of 4493
Azhar Satti
 
PDF
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
PPT
Prog1 chap1 and chap 2
rowensCap
 
PDF
Mongodb in-anger-boston-rb-2011
bostonrb
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
Advance Sql Server Store procedure Presentation
Amin Uddin
 
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
Giga Spaces Data Grid / Data Caching Overview
jimliddle
 
SQL Server 2008 Development for Programmers
Adam Hutson
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
JISC GECO
 
Getting started with PostGIS geographic database
EDINA, University of Edinburgh
 
Sql storeprocedure
ftz 420
 
Stata Python Rosetta Stone Side-by-side code examples
lahurtc22
 
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
 
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
Amazon EMR Masterclass
Ian Massingham
 
NoCOUG Presentation on Oracle RAT
HenryBowers
 
Lab manual asp.net
Vivek Kumar Sinha
 
Hadoop & Zing
Long Dao
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
Qtp Training Deepti 4 Of 4493
Azhar Satti
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
Prog1 chap1 and chap 2
rowensCap
 
Mongodb in-anger-boston-rb-2011
bostonrb
 
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
Ad

More from Vanessa Hurst (7)

PDF
Girl Geek Dinner NYC 2013
Vanessa Hurst
 
PDF
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Vanessa Hurst
 
PDF
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Vanessa Hurst
 
PDF
Coders as Superheroes
Vanessa Hurst
 
PPTX
Get Your Website Off the Ground
Vanessa Hurst
 
PPT
Defense Against the Dark Arts: Protecting Your Data from ORMs
Vanessa Hurst
 
PPT
WTF Web Lecture
Vanessa Hurst
 
Girl Geek Dinner NYC 2013
Vanessa Hurst
 
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Vanessa Hurst
 
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Vanessa Hurst
 
Coders as Superheroes
Vanessa Hurst
 
Get Your Website Off the Ground
Vanessa Hurst
 
Defense Against the Dark Arts: Protecting Your Data from ORMs
Vanessa Hurst
 
WTF Web Lecture
Vanessa Hurst
 

Recently uploaded (20)

PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Software Development Methodologies in 2025
KodekX
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 

Honey I Shrunk the Database

  • 1. Honey, I Shrunk the DatabaseFor Test and Development EnvironmentsVanessa HurstPaperless PostPostgres Open, September 2011
  • 4. Why Shrink?AccuracyYou don’t truly know how your app will behave in production unless you use real data.Production data is the ultimate in accuracy.
  • 5. Why Shrink?AccuracyFreshnessNew data should be available regularly.Full database refreshes should be timely.
  • 6. Why Shrink?AccuracyFreshnessResource LimitationsStaging and developer machines cannot handle production load.
  • 7. Why Shrink?AccuracyFreshnessResource LimitationsData ProtectionLimit spread of sensitive user or client data.
  • 9. Case Study: Paperless PostRequirementsFreshness – Daily, On command for non-developersShrinkage – Slices, Mutations
  • 10. Case Study: Paperless PostRequirementsFreshness – Daily, On command for non-developersShrinkage – Slices, MutationsResourcesSource – extra disk space, RAM, and CPUsDestination – limited, often entirely un-optimizedDevelopment -- constrained DBA resources
  • 11. Shrink StrategiesCopiesRestored backups or live replicas of entire production database
  • 16. SlicesVertical SliceDifficult to obtain a valid, useful subset of data.Example: Include some entire tables, exclude others
  • 17. SlicesVertical SliceDifficult to obtain a valid, useful subset of data.Example: Include some entire tables, exclude othersHorizontal SliceDifficult to write and maintain.Example: SQL or application code to determine subset of data
  • 18. PG Tools – Vertical SliceFlexibility at Source (Production)pg_dumpInclude data only [-a --data-only]Include table schema only [-s --schema-only]Select tables [-t table1 table2 --table table1 table2]Select schemas [-nschema --schema=schema]Exclude schemas [-N schema --exclude-schema=schema]
  • 19. PG Tools – Vertical SliceFlexibility at Destination (Staging, Development)pg_restoreInclude data only [-a --data-only]Select indexes [-iindex --index=index]Tune processing [-jnumber-of-jobs --jobs=number-of-jobs]Select schemas [-nschema --schema=schema]Select triggers[-T trigger --trigger=trigger]Exclude privileges [-x --no-privileges --no-acl]
  • 21. MutationsExternal Data ProtectionHIPAA RegulationsPCI ComplianceAPI Terms of Use
  • 22. MutationsExternal Data ProtectionHIPAA RegulationsPCI ComplianceAPI Terms of UseInternal Data ProtectionProtecting your users’ personal dataProtecting your users from accidents, e.g. staging emailsYour Terms of Service
  • 24. Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses
  • 25. Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemaspg_dump --clean --schema-only --schema public db-01 > slice.sql
  • 26. Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemaspg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static contentpg_dump --data-only --schema public -t cards db-01 >> slice.sql
  • 27. Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemaspg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static contentpg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses
  • 28. Case Study: Paperless PostCREATE SCHEMA staging;
  • 29. Case Study: Paperless PostHorizontal SliceCustom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);
  • 30. Case Study: Paperless PostHorizontal SliceCustom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);
  • 31. Case Study: Paperless PostHorizontal SliceCustom SQLDynamic relative to full data set or newly created sliceMutationsEmail AddressesUse regular expressions to clean non-admin addressese.g. [email protected] => [email protected] DataClear cached short link from link-shortening API
  • 32. Case Study: Paperless PostComposite Slice includingVertical Slice – All application object schemaspg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static contentpg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their dataMutation – Changed user email addressespg_dump --data-only --schema staging db-01 >> slice.sql
  • 33. Case Study: Paperless PostRebuildPrepare new database as standbyGracefully close connectionsRotate by renaming databasesSecurity Dedicated database build userMembership in application user roleApplication user role & privileges remain
  • 34. Case Study: Paperless PostRebuild$ bzcat slice.sql.bz2 | psql db-newStaging schema has not been created, so all data loads to default schema
  • 35. Case Study: Paperless PostWe hacked our rebuild by importing across schemas!Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.
  • 36. Secret Weapon --Updates all serial sequences for ID columns onlyBEGINFOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOPtable_name = table_record.relname::text; EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || ' WHERE EXISTS (SELECT 1 FROM ' || table_name || ')';END LOOP;
  • 37. Case Study: Paperless PostRebuild$ bzcat slice.sql.bz2 | psql db-newStaging schema has not been created, so all data loads to default schemaecho “select 1 from update_id_sequences();” >> slice.sqlVacuumReindex
  • 38. Case Study: Paperless PostSecurity Database build userCREATE DB privilegesMember of Application user roleApplication user remains database ownerApplication user privileges remain limitedBuild only works in predetermined environments
  • 39. Case Study: Paperless PostRequirementsFreshness – Daily, On command for non-developersShrinkage – Slices, MutationsResourcesSource – extra disk space, RAM, and CPUsDestination – limited, often entirely un-optimizedDevelopment -- constrained DBA resources
  • 41. More ToolsCopies -- LVMSnapshotsSee talk by Jon Erdman at PG Conf EUGreat for all readsData stays virtualized & doesn’t take up space until changedIdeal for DDL changes without actual data changes
  • 42. More ToolsCopies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_stagingSimple -- pauses pgbouncer & restores backupEfficient -- leverage bulk loadingFlexible -- supports varying psql filesCustom -- limitedSlices -- replicate by rtomayko of Github http://github.com/rtomayko/replicateSimple - Preserves object relations via ActiveRecordInefficient -- Creates text-based .dumpInflexible -- Corrupts id sequences on data insertCustom -- highly

Editor's Notes

  • #2: I am Vanessa Hurst and I lead Data and Analytics at Paperless Post, a customizable online stationery startup in New York. I studied Computer Science and Systems and Information Engineering at the University of Virginia. I have experience in databases ranging from a few hundred megabyte CMSes for non-profits to terabytes of financial data and high traffic consumer websites. I've worked in data processing, product development, and business intelligence. I am happy open-source convert and lone data wrangler in a land of web developers using Ruby on Rails.
  • #3: Static Data
  • #8: This may include external, legal regulations or internal regulations such as terms of service.Data protection can also include mitigating risk or proactively screening before data is even available.HIPAA RegulationsPCI ComplianceAPI Terms of Use
  • #9: Any other reasons?
  • #10: RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • #11: RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • #12: Quick vocabularyBackup & restore, trigger-based replication, there are plenty of options that are all straight forward, but don’t give you a lot of leeway on resources.
  • #13: Most common case
  • #16: If you’re doing Business Intelligence, you need a copy of your production database. Figure it out.
  • #17: Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  • #18: Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  • #19: http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  • #20: http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  • #21: Static Data
  • #29: Dedicated schema preserves all table, index, sequence names, etc
  • #34: Only the build process is staging-specific, all other privileges and settings match production
  • #35: Only the build process is staging-specific, all other privileges and settings match production
  • #38: Only the build process is staging-specific, all other privileges and settings match production
  • #39: Only the build process is staging-specific, all other privileges and settings match production
  • #40: RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • #43: http://github.com/rtomayko/replicate