SlideShare a Scribd company logo
MySQL and Search at Craigslist Jeremy Zawodny [email_address] http://craigslist.org/ [email_address] http://jeremy.zawodny.com/blog/
Who Am I? Creator and co-author of High Performance MySQL Creator of mytop Perl Hacker MySQL Geek Craigslist Engineer (as of July, 2008) MySQL, Data, Search, Perl Ex-Yahoo (Perl, MySQL, Search, Web Services)
What is Craigslist?
What is Craigslist? Local Classifieds Jobs, Housing, Autos, Goods, Services ~500 cities world-wide Free Except for jobs in ~18 cities and brokered apartments in NYC Over 20B pageviews/month 50M monthly users 50+ countries, multiple languages 40+M ads/month, 10+M images
What is Craigslist? Forums 100M posts 100s of forums
Technical and other Challenges High ad churn rate Post half-life can be short Growth High traffic volume Back-end tools and data analysis needs Growth Need to archive postings... forever! 100s of millions, searchable Internationalization and UTF-8
Technical and other Challenges Small Team Fires take priority Infrastructure gets creaky Organic code and schema growth over years Growth Lack of abstractions Too much embedded SQL in code Documentation vs. Institutional Knowledge “Why do we have things configured like this?”
Goals Use Open Source Keep infrastructure small and simple Lower power is good! Efficiency all around Do more with less Keep site easy and appraochable Don't overload with features People are easily confuse
Craigslist Internals Overview Perl + memcached Apache 1.3 + mod_perl Perl + memcached MySQL 5.0.xx Sphinx ... Load Balancer Read Proxy Array Write Proxy Array Web Read Array Object Cache Read DB Cluster Not Included : - user db, image db - async tasks, email - accounting, internal tools - and more! Search Cluster
Vertical Partitioning: Roles Users Classifieds Users Classifieds Forums Stats Archive Write Read Long Trash
Vertical Partitioning Different roles have different access patterns Sub-roles based on query type Easier to manage and scale Logical, self-contained data Servers may not need to be as big/fast/expensive Difficult to do retroactively Various named db “handles” in code
Horizontal Partitioning: Hydra cluster_01 cluster_02 cluster_03 cluster_N ... client
Horizontal Partitioning: Hydra Need to retrofit a lot of code Need non-blocking Perl MySQL client Wrapped  http://code.google.com/p/perl-mysql-async/ Eventually can size DB boxes based on price/power and adjust mapping function(s) Choose hardware first Make the db “fit” Archiving lets us age a cluster instead of migrating it's data to a new one.
Search Evolution Problem:  Users want to find stuff. Solution:  Use MySQL Full Text. ...time passes... Problem:  MySQL Full Text Doesn't Scale! Solution:  Use Sphinx. ...time passes... Problem:  Sphinx doesn't scale! Solution:  Patch Sphinx.
MySQL Full-Text Problems Hitting invisible limits CPU not pegged, Memory available Disk I/O not  unreasonable Locking / Mutex contention? Probably. MyISAM has occasional crashing / corruption 5 clusters of 5 machines Partitioning based on city and category All “hand balanced” and high-maintenance ~30M queries/day Close to limits
Sphinx: My First CL Project Sphinx is designed for text search Fast and lean C++ code Forking model scales well on multi-core Control over indexing, weighting, etc. Also spent some time looking at Apache Solr
Search Implementation Details Partitioning based on cities (each has a numeric id) Attributes vs. Keywords Persistent Connections Custom client and server modifications Minimal stopword List Partition into 2 clusters (1 master, 4 slaves)
Sphinx Incremental  Indexing Re-index every N minutes Use main + delta strategy Adopted as: index + today + delta One set per city (~500 * 3) Slaves handle live queries, update via rsync Need lots of FDs Use all 4 cores to index Every night, perform “daily merge” Generate config files via Perl
Sphinx Incremental Indexing
Sphinx Issues Merge bugs [fixed] File descriptor corruption [fixed] Persistent connections [fixed] Overhead of fork() was substantial in our testing 200 queries/sec vs. 1,000 queries/sec per box Missing attribute updates [unreported] Bogus docids in responses We need to upgrade to latest Sphinx soon Andrew and team have been excellent!
Search Project Results From 25 MySQL Boxes to 10 Sphinx Lots more headroom! New Features Nearby Search No seizing or locking issues 1,000+ qps during peak w/room to grow 50M queries per day w/steady growth Cluster partitioning built but not needed (yet?) Better separation of code
Sphinx Wishlist Efficient delete handling  (kill lists) Non-fatal “missing” indexes Index dump tool Live document add/change/delete Built-in replication Stats and counters Text attributes Protocol checksum
Data Archiving, Replication, Indexes Problem: We want to keep everything. Solution: Archive to an archive cluster. Problem: Archiving is too painful.  Index updates are expensive!  Slaves affected. Solution: Archive with home-grown eventually consistent replication.
Data Archiving: OOB Replication Eventual Consistency Master process SET SQL_LOG_BIN=0 Select expired IDs Export records from live master Import records into archive master Delete expired from live master Add IDs to list
Data Archiving: OOB Replication Slave process One per MySQL slave Throttled to minimize impact State kept on slave Clone friendly Simple logic Select expired IDs added since my sequence number Delete expired records Update local “last seen” sequence number
Long Term Data Archiving Schema coupling is bad ALTER TABLE takes forever Lots of NULLs flying around CouchDB or similar long-term? Schema-free feels like a good fit Tested some home grown solutions already Separate storage and indexing? Indexing with Sphinx?
Drizzle, XtraDB, Future Stuff CouchDB looks very interesting.  Maybe for archive? XtraDB / InnoDB plugin Better concurrency Better tuning of InnoDB internals libdrizzle + Perl DBI/DBD may not fit an async model well Can talk to both MySQL and Drizzle! Oracle buying Sun?!?!
We're Hiring! Work in San Francisco Flexible, Small Company Excellent Benefits Help Millions of People Every Week We Need Perl/MySQL Hackers Come Help us Scale and Grow
Questions?

More Related Content

What's hot (20)

PPTX
Text data mining1
KU Leuven
 
PPTX
Data mining presentation.ppt
neelamoberoi1030
 
PPT
Big Data
Vinayak Kamath
 
PPTX
Data Visualization Techniques in Power BI
Angel Abundez
 
PPTX
Big data
RameshwariPatil3
 
PPTX
web mining
Arpit Verma
 
PPTX
Types of Big Data.pptx
varun453331
 
PPTX
fundamental Roles of Information System
Self-employed
 
PDF
Mapreduce Algorithms
Amund Tveit
 
PDF
Web scraping in python
Viren Rajput
 
PPTX
Type hints in python & mypy
Anirudh
 
PDF
Digital Transformation
Heru WIjayanto
 
PPTX
Ppt
bullsrockr666
 
PPTX
Big Data - Applications and Technologies Overview
Sivashankar Ganapathy
 
PPTX
FinDart by Nelito
Nelito Systems Ltd
 
PPTX
Canonical data model
Govind Mulinti
 
PPTX
Data mart
Prachi Agarwal
 
PPTX
Data analytics
Bhanu Pratap
 
PPTX
The art of implementing data lineage
Leigh Hill
 
PPTX
Computer application in business
MONCY KURIAKOSE
 
Text data mining1
KU Leuven
 
Data mining presentation.ppt
neelamoberoi1030
 
Big Data
Vinayak Kamath
 
Data Visualization Techniques in Power BI
Angel Abundez
 
web mining
Arpit Verma
 
Types of Big Data.pptx
varun453331
 
fundamental Roles of Information System
Self-employed
 
Mapreduce Algorithms
Amund Tveit
 
Web scraping in python
Viren Rajput
 
Type hints in python & mypy
Anirudh
 
Digital Transformation
Heru WIjayanto
 
Big Data - Applications and Technologies Overview
Sivashankar Ganapathy
 
FinDart by Nelito
Nelito Systems Ltd
 
Canonical data model
Govind Mulinti
 
Data mart
Prachi Agarwal
 
Data analytics
Bhanu Pratap
 
The art of implementing data lineage
Leigh Hill
 
Computer application in business
MONCY KURIAKOSE
 

Viewers also liked (20)

PPTX
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Jeremy Zawodny
 
PDF
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Jeremy Zawodny
 
PPTX
Fusion-io and MySQL at Craigslist
Jeremy Zawodny
 
KEY
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Jeremy Zawodny
 
PDF
Social Media Trends - Content Curation
Chris Mikulin
 
PPT
SphinxSearch
Przemyslaw Wroblewski
 
PDF
Managing Big Data with MySQL
mwasaha mwagambo
 
KEY
Sphinx at Craigslist in 2012
Jeremy Zawodny
 
PPTX
Chipotle Buyer Persona
Crismerly Santibañez
 
PPTX
Chipotle Buyer Persona
Esther Khoudari
 
PDF
Ahlstrom Financial Statements 2016 & Interim Report Q4/2016
Ahlstrom-Munksjö
 
PDF
Top 5 Trends in Local Advertising
David Shaner
 
PPTX
Sphinx - High performance full-text search for MySQL
Nguyen Van Vuong
 
PPTX
Chipolte buyer persona
Caroline Redmond
 
PDF
Craigee Pitch Presentation
craigee
 
PDF
Four Tech Trends for 2017
Peter Pajor
 
PPTX
Localyser - An Introduction
sps:affinity
 
PPTX
Nancy Kruse - Spotting Millennial Food Trends
John Blue
 
PPTX
Why Your MongoDB Needs Redis
Itamar Haber
 
PPTX
Mobile UX-COE
Satyajit Roy
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Jeremy Zawodny
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Jeremy Zawodny
 
Fusion-io and MySQL at Craigslist
Jeremy Zawodny
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Jeremy Zawodny
 
Social Media Trends - Content Curation
Chris Mikulin
 
SphinxSearch
Przemyslaw Wroblewski
 
Managing Big Data with MySQL
mwasaha mwagambo
 
Sphinx at Craigslist in 2012
Jeremy Zawodny
 
Chipotle Buyer Persona
Crismerly Santibañez
 
Chipotle Buyer Persona
Esther Khoudari
 
Ahlstrom Financial Statements 2016 & Interim Report Q4/2016
Ahlstrom-Munksjö
 
Top 5 Trends in Local Advertising
David Shaner
 
Sphinx - High performance full-text search for MySQL
Nguyen Van Vuong
 
Chipolte buyer persona
Caroline Redmond
 
Craigee Pitch Presentation
craigee
 
Four Tech Trends for 2017
Peter Pajor
 
Localyser - An Introduction
sps:affinity
 
Nancy Kruse - Spotting Millennial Food Trends
John Blue
 
Why Your MongoDB Needs Redis
Itamar Haber
 
Mobile UX-COE
Satyajit Roy
 
Ad

Similar to MySQL And Search At Craigslist (20)

PDF
My Sql And Search At Craigslist
MySQLConference
 
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
PDF
Open source Technology
Amardeep Vishwakarma
 
PDF
Object- Relational Persistence in Smalltalk
ESUG
 
PDF
Introduction to MongoDB
Justin Smestad
 
PPTX
Scaling your website
Alejandro Marcu
 
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
PPTX
MinneBar 2013 - Scaling with Cassandra
Jeff Smoley
 
PPTX
Why databases cry at night
Michael Yarichuk
 
PPTX
Agility and Scalability with MongoDB
MongoDB
 
PDF
From a student to an apache committer practice of apache io tdb
jixuan1989
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
PPTX
Introduction to Azure DocumentDB
Denny Lee
 
PPTX
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Hirofumi Iwasaki
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PPTX
SQLCAT: Tier-1 BI in the World of Big Data
Denny Lee
 
PPTX
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
PPTX
22059 slides
pholden1
 
My Sql And Search At Craigslist
MySQLConference
 
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
Open source Technology
Amardeep Vishwakarma
 
Object- Relational Persistence in Smalltalk
ESUG
 
Introduction to MongoDB
Justin Smestad
 
Scaling your website
Alejandro Marcu
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
MinneBar 2013 - Scaling with Cassandra
Jeff Smoley
 
Why databases cry at night
Michael Yarichuk
 
Agility and Scalability with MongoDB
MongoDB
 
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
Introduction to Azure DocumentDB
Denny Lee
 
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Hirofumi Iwasaki
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
SQLCAT: Tier-1 BI in the World of Big Data
Denny Lee
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
22059 slides
pholden1
 
Ad

Recently uploaded (20)

PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Designing Production-Ready AI Agents
Kunal Rai
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
July Patch Tuesday
Ivanti
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 

MySQL And Search At Craigslist

  • 1. MySQL and Search at Craigslist Jeremy Zawodny [email_address] http://craigslist.org/ [email_address] http://jeremy.zawodny.com/blog/
  • 2. Who Am I? Creator and co-author of High Performance MySQL Creator of mytop Perl Hacker MySQL Geek Craigslist Engineer (as of July, 2008) MySQL, Data, Search, Perl Ex-Yahoo (Perl, MySQL, Search, Web Services)
  • 4. What is Craigslist? Local Classifieds Jobs, Housing, Autos, Goods, Services ~500 cities world-wide Free Except for jobs in ~18 cities and brokered apartments in NYC Over 20B pageviews/month 50M monthly users 50+ countries, multiple languages 40+M ads/month, 10+M images
  • 5. What is Craigslist? Forums 100M posts 100s of forums
  • 6. Technical and other Challenges High ad churn rate Post half-life can be short Growth High traffic volume Back-end tools and data analysis needs Growth Need to archive postings... forever! 100s of millions, searchable Internationalization and UTF-8
  • 7. Technical and other Challenges Small Team Fires take priority Infrastructure gets creaky Organic code and schema growth over years Growth Lack of abstractions Too much embedded SQL in code Documentation vs. Institutional Knowledge “Why do we have things configured like this?”
  • 8. Goals Use Open Source Keep infrastructure small and simple Lower power is good! Efficiency all around Do more with less Keep site easy and appraochable Don't overload with features People are easily confuse
  • 9. Craigslist Internals Overview Perl + memcached Apache 1.3 + mod_perl Perl + memcached MySQL 5.0.xx Sphinx ... Load Balancer Read Proxy Array Write Proxy Array Web Read Array Object Cache Read DB Cluster Not Included : - user db, image db - async tasks, email - accounting, internal tools - and more! Search Cluster
  • 10. Vertical Partitioning: Roles Users Classifieds Users Classifieds Forums Stats Archive Write Read Long Trash
  • 11. Vertical Partitioning Different roles have different access patterns Sub-roles based on query type Easier to manage and scale Logical, self-contained data Servers may not need to be as big/fast/expensive Difficult to do retroactively Various named db “handles” in code
  • 12. Horizontal Partitioning: Hydra cluster_01 cluster_02 cluster_03 cluster_N ... client
  • 13. Horizontal Partitioning: Hydra Need to retrofit a lot of code Need non-blocking Perl MySQL client Wrapped http://code.google.com/p/perl-mysql-async/ Eventually can size DB boxes based on price/power and adjust mapping function(s) Choose hardware first Make the db “fit” Archiving lets us age a cluster instead of migrating it's data to a new one.
  • 14. Search Evolution Problem: Users want to find stuff. Solution: Use MySQL Full Text. ...time passes... Problem: MySQL Full Text Doesn't Scale! Solution: Use Sphinx. ...time passes... Problem: Sphinx doesn't scale! Solution: Patch Sphinx.
  • 15. MySQL Full-Text Problems Hitting invisible limits CPU not pegged, Memory available Disk I/O not unreasonable Locking / Mutex contention? Probably. MyISAM has occasional crashing / corruption 5 clusters of 5 machines Partitioning based on city and category All “hand balanced” and high-maintenance ~30M queries/day Close to limits
  • 16. Sphinx: My First CL Project Sphinx is designed for text search Fast and lean C++ code Forking model scales well on multi-core Control over indexing, weighting, etc. Also spent some time looking at Apache Solr
  • 17. Search Implementation Details Partitioning based on cities (each has a numeric id) Attributes vs. Keywords Persistent Connections Custom client and server modifications Minimal stopword List Partition into 2 clusters (1 master, 4 slaves)
  • 18. Sphinx Incremental Indexing Re-index every N minutes Use main + delta strategy Adopted as: index + today + delta One set per city (~500 * 3) Slaves handle live queries, update via rsync Need lots of FDs Use all 4 cores to index Every night, perform “daily merge” Generate config files via Perl
  • 20. Sphinx Issues Merge bugs [fixed] File descriptor corruption [fixed] Persistent connections [fixed] Overhead of fork() was substantial in our testing 200 queries/sec vs. 1,000 queries/sec per box Missing attribute updates [unreported] Bogus docids in responses We need to upgrade to latest Sphinx soon Andrew and team have been excellent!
  • 21. Search Project Results From 25 MySQL Boxes to 10 Sphinx Lots more headroom! New Features Nearby Search No seizing or locking issues 1,000+ qps during peak w/room to grow 50M queries per day w/steady growth Cluster partitioning built but not needed (yet?) Better separation of code
  • 22. Sphinx Wishlist Efficient delete handling (kill lists) Non-fatal “missing” indexes Index dump tool Live document add/change/delete Built-in replication Stats and counters Text attributes Protocol checksum
  • 23. Data Archiving, Replication, Indexes Problem: We want to keep everything. Solution: Archive to an archive cluster. Problem: Archiving is too painful. Index updates are expensive! Slaves affected. Solution: Archive with home-grown eventually consistent replication.
  • 24. Data Archiving: OOB Replication Eventual Consistency Master process SET SQL_LOG_BIN=0 Select expired IDs Export records from live master Import records into archive master Delete expired from live master Add IDs to list
  • 25. Data Archiving: OOB Replication Slave process One per MySQL slave Throttled to minimize impact State kept on slave Clone friendly Simple logic Select expired IDs added since my sequence number Delete expired records Update local “last seen” sequence number
  • 26. Long Term Data Archiving Schema coupling is bad ALTER TABLE takes forever Lots of NULLs flying around CouchDB or similar long-term? Schema-free feels like a good fit Tested some home grown solutions already Separate storage and indexing? Indexing with Sphinx?
  • 27. Drizzle, XtraDB, Future Stuff CouchDB looks very interesting. Maybe for archive? XtraDB / InnoDB plugin Better concurrency Better tuning of InnoDB internals libdrizzle + Perl DBI/DBD may not fit an async model well Can talk to both MySQL and Drizzle! Oracle buying Sun?!?!
  • 28. We're Hiring! Work in San Francisco Flexible, Small Company Excellent Benefits Help Millions of People Every Week We Need Perl/MySQL Hackers Come Help us Scale and Grow