SlideShare a Scribd company logo
Scaling Magento
Reid Parham, Aaron Edmonds, and Kyle Terry
Public distribution: sensitive information omitted.
www.Copiousinc.com
COPIOUS
● User-Centered Digital Experience Agency
● Strategy
● Experience
● Engineering
● http://copio.us/
Scale Your Code
A.K.A. Magento is hard
Code Management
● Magento is big!
o Our project has over 820,000 lines of PHP
● Multi-lingual, multi-currency, multi-store
● Classes can have complex names
o *cough*
Enterprise_Reward_Block_Adminhtml_Customer_Edit_T
ab_Reward_History_Grid_Column_Renderer_Reason
*cough*
Code Management (cont.)
● Configuration is driven by XML
● The dreaded EAV
● Magento Indices
● Event-Observer
Code Management (Tools)
Good tools make the job easier!
● A good IDE
o Magicento
● Commerce Bug 2
● n98-magerun
Code Management
● NEVER modify core files
o Magento’s forum never helped
● NEVER* add files to app/code/local/Mage
o Magento was built to be modular**
● Test your code with flat catalog enabled
and disabled
● Before overwriting classes, check for events
Code Optimization (Quick Wins)
Caching Magento Blocks
● DIY! Event to add cache data:
core_block_abstract_to_html_before
● OR use a module
https://github.com/aligent/CacheObserver
Code Optimization (Quick Wins)
Mage::getModel(‘catalog/product’)->load($_product-
>getId());
● This is bad in templates and when looping
over product collections
● Load with initial data select
o used_in_product_listing attribute option
Code Optimization
Make efficient use of Magento indices
● Example: Catalog URL Rewrites
o Includes all products by default (including products
marked as “Not Visible Individually”)
o Do you need SEO friendly URLs for products that
will never be seen???
o Reduce your index size by up to 95%
o Mage_Catalog_Model_Resource_Url::_getProducts
Code Optimization (Quick Wins)
Mage_Catalog_Model_Resource_Product_Typ
e_Configurable_Product_Collection::isEnabled
Flat?
FALSE
Systems
● Hardware Profile
● Cluster Design
● Scaling
Scaling Magento
Hardware Profile (overview)
● 2 racks of hardware and dozens of servers
● Top quality of available (and compatible)
chipsets and memory
● Buffered DDR3; 1 channel per CPU
● 126 kW of stable, reliable, redundant, and
backed up power
● Minor kernel tweaks
Hardware Profile (network)
● NetScaler for load balancing
○ Vserver pools
○ Balances web, database, admin and endeca
○ Monitors will remove downed hosts
● Redundant Network Infrastructure
○ Backplane uses LACP (link aggregation) for
redundancy, load balancing and failover
○ HA pairing of configurations
Hardware Profile (network)
Dynamic port forwarding for browsing:
kyle@localhost $ ssh -L 2221:127.0.0.1:2221 whitelistedhost.example.com
kyle@whitelistedhost $ ssh -D 2221 cluster.example.com
Static port forwarding for Navicat SSH tunneling (tunneling through a tunnel):
kyle@localhost $ ssh -L 2222:127.0.0.1:2222 whitelistedhost.example.com
kyle@whitelistedhost $ ssh -L 2222:127.0.0.1:22 cluster.example.com
Hardware Profile (web)
● Dual Intel Xeon E3-1230 @ 3.30GHz
● 32 GB RAM
● Dozens of servers
● nginx and PHP5-FPM
● 6:1 ratio of PHP processes to CPU cores
Hardware Profile (database)
● Redundant database hosts
● MySQL 5.6 chosen for scaling capability
● tcmalloc further improves throughput
● Master/slave replication
● Standby hosts for warm failover
● Failure point: > 4,000 checkouts/hour
Hardware Profile (database)
● Quad Intel Xeon E7-2860
○ 10 cores + hyperthreading each totalling 80 threads
● 128 GBs of RAM
● RAID10 SSDs for data
○ writeback cache; noatime,noexec mount options
● RAID1 HDDs for OS
Oops!
Hardware Profile (cache)
● Powering discrete instances of Redis
○ Sessions
○ Full page cache
○ Magento back end cache
○ Background processing queues
● Discrete instances are for threading, differing
memory limits, differing backup rules, and
multi-db deprecation
Hardware Profile (cache)
● Content is compressed with LZF
○ Compression and decompression with LZF is faster
than gzip so it’s an ideal solution
● Decreased utilization of network capacity
● Sentinel for failover (soon)
● RDB BGSAVE: prime number intervals
Compression Outcomes
Hardware Profile (cache)
● Quad Intel Xeon E5-2620 @ 2.00GHz
● 128 GBs of RAM
● 4 bonded network interfaces
○ Prevents saturation of private network
○ 4 Gb/s
○ Bonding mode 5 (balance-tlb)
■ No special switch support
■ Nice when the colo manages the switch
Hardware Profile (utility)
● Cron and systems jobs
● Scripts
● Deploys
● Chef Server 10 for deploy and configuration
● Tests
○ Database test suite in Perl (Test::DatabaseRow)
● Backups (and copies)
Cluster Overview
● Production
○ Most hardware serves production
● Staging
○ Some data promoted to production nightly
● Preview{1..n}
○ Instances for testing and previewing new features,
bug fixes and design changes.
● Aggregate hardware availability exceeds
six nines (99.9999%)
● Software availability is ~99.999%
● Software, including deployments: 99.98%
● Software, including maintenance: 99.9%
● Non-recoverable human errors: 98%
Production Uptime
Scale Your Team
Team Profile
● 16 committers; 8.25 FTE
● 4 Project Managers
● 5 departments
● 31 vendors
● 5 time zones
Team Values
● State your needs; respect others’
● Respect is given, then adjusted
● Process can always change and improve
● Work/life balance
● Mature and non-aggressive; mediate conflict
● Honesty and transparency
Team Mantras
● Trust (relevant) data; make things visible
● Measurable, repeatable, falsifiable
(scientific method)
● Redundancy reduces risks (if documented)
● Set expectations (timing, contents, formats)
and deliver on them
Team Mantras
● Automate what is repeated
● Use known patterns and
proven architectures
● Grow talent from within
● Compartmentalization of some data,
code, and knowledge
Scaling Magento
10 Integrated Vendors
Adobe, Akamai, tax calculation,
legacy software, Ebay, gift cards,
ERP (fulfillment and inventory), Oracle,
Tierpoint (Dallas, Seattle, Spokane),
Endeca provider
advertising, application analytics, email,
hardware analysis and functionality, maps,
offsite storage, promotions, payment gateways,
remarketing, shipping estimates, SMS,
social networks, uptime
21 Accessory Vendors
● Group emails: avoid general questions,
assign actions to people, minimize
distribution lists
● Identify urgency of requests
● Use email filters
● Coach and mentor
Effective Communication
● Daily phone calls: only while needed
● Set an agenda; keep to a schedule
● Encourage people to skip calls
or to leave early
● End the call when completed
Effective Communication
Tools
● GitHub
● Google Docs
● Pivotal Tracker
● Conference calls, Skype, and IM
● BugHerd
Scaling Magento
Scaling Magento
Launch Day
Release Day
QA Preparation
Productive!
Off-hours chaos
Build Knowledge
● Document the “obvious”
● 1000-line README
● Capture failures and solutions
● What happens when?
● Which database and server?
Automation Schedule
“This is how we work.”
Example Git Workflow
Learn from previous failures.
Code Review
● Standardize pull request structures
● Constructive feedback; ask questions
● emoji-cheat-sheet.com
Code Review
Pull requests can also be workspaces
Releases and Git flow: rhythm, ownership, and pride.
Scaling Magento
Deployments
● Monday through Thursday only!
● Communication: tickets, cross references,
pull requests, QA status, and releases
● Set expectations: timings for outages,
maintenance, and degraded functionality
● Are we done, yet?
● Explain outcomes and options
Community Participation
● Patches submitted
o Redis
o Cm_RedisSession
o Cm_Cache_Backend_Redis
o https://github.com/magento/magento2
● Modules improved
o CacheObserver
o VF_CustomMenu
Community Participation
● http://magento.stackexchange.com/
● http://stackoverflow.com/
● phpredis bug(s)
● Spence, Muneera U. Collaborative
Processes lecture. 13 Apr. 2006.
● Marks, Andrea. "The Role of Writing in a
Design Curriculum." AIGA: Design Education
(2004).
● Katzenbach, Jon R., and Douglas K. Smith.
The Wisdom of Teams. HarperCollins, 2003.
Collaboration Texts
● Bennis, Warren, and Patricia W. Biederman.
Organizing Genius. Perseus, 1997.
● Marcum, James W. After the Information
Age. Peter Lang, 2006.
● https://en.wikipedia.org/wiki/Collaboration
(and collaborative method)
Collaboration Texts
See Also
GitHub (and Gist)
@parhamr
@kyleterry
@aedmonds
Questions?

More Related Content

What's hot (20)

PPTX
Drupal commerce performance profiling and tunning using loadstorm experiments...
Andy Kucharski
 
PDF
Postgres Vision 2018: WAL: Everything You Want to Know
EDB
 
ODP
20 cool things python
Pippi Labradoodle
 
PPTX
Extreme replication at IOUG Collaborate 15
Bobby Curtis
 
PPTX
How many ways to monitor oracle golden gate-Collaborate 14
Bobby Curtis
 
PPTX
Ren cao kafka connect
Nitin Kumar
 
PDF
BlackRay - The open Source Data Engine
fschupp
 
PPTX
Troubleshooting K1000
Dell World
 
PDF
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Bobby Curtis
 
PDF
Oracle GoldenGate Architecture Performance
Enkitec
 
PDF
Oracle GoldenGate DB2 to Oracle11gR2 Configuration
grigorianvlad
 
PPTX
Inventory Tips & Tricks
Dell World
 
PDF
OSMC 2010 | Monitoring mit Icinga by Icinga Team
NETWAYS
 
PDF
Ipc mysql php
Anis Berejeb
 
PPTX
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Andrejs Prokopjevs
 
PDF
Reporting Large Environment Zabbix Database
Alain Ganuchaud
 
PDF
Nagios Conference 2012 - Scott Wilkerson - Passive Monitoring Solutions For R...
Nagios
 
PDF
Oracle to Postgres Migration - part 2
PgTraining
 
PPTX
Kace & SQL
Dell World
 
PDF
Take your database source code and data under control
Marcin Przepiórowski
 
Drupal commerce performance profiling and tunning using loadstorm experiments...
Andy Kucharski
 
Postgres Vision 2018: WAL: Everything You Want to Know
EDB
 
20 cool things python
Pippi Labradoodle
 
Extreme replication at IOUG Collaborate 15
Bobby Curtis
 
How many ways to monitor oracle golden gate-Collaborate 14
Bobby Curtis
 
Ren cao kafka connect
Nitin Kumar
 
BlackRay - The open Source Data Engine
fschupp
 
Troubleshooting K1000
Dell World
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Bobby Curtis
 
Oracle GoldenGate Architecture Performance
Enkitec
 
Oracle GoldenGate DB2 to Oracle11gR2 Configuration
grigorianvlad
 
Inventory Tips & Tricks
Dell World
 
OSMC 2010 | Monitoring mit Icinga by Icinga Team
NETWAYS
 
Ipc mysql php
Anis Berejeb
 
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Andrejs Prokopjevs
 
Reporting Large Environment Zabbix Database
Alain Ganuchaud
 
Nagios Conference 2012 - Scott Wilkerson - Passive Monitoring Solutions For R...
Nagios
 
Oracle to Postgres Migration - part 2
PgTraining
 
Kace & SQL
Dell World
 
Take your database source code and data under control
Marcin Przepiórowski
 

Viewers also liked (10)

PPTX
Simple Helix Presentation | The 2nd Annual eCommerce Expo South Florida
Rand Internet Marketing
 
PPTX
Midwest PHP - Scaling Magento
Mathew Beane
 
PPTX
Zendcon scaling magento
Mathew Beane
 
PDF
Costruire un sito e-commerce in alta affidabilità con Magento e Zend Server C...
Zend by Rogue Wave Software
 
PDF
Angrybirds Magento Cloud Deployment
AOE
 
PDF
Optimizing Magento Performance with Zend Server
varien
 
PDF
High-Performance Magento in the Cloud
AOE
 
PDF
Rock-solid Magento Deployments (and Development)
AOE
 
PDF
Magento scalability from the trenches (Meet Magento Sweden 2016)
Divante
 
PDF
Real use cases of performance optimization in magento 2
Max Pronko
 
Simple Helix Presentation | The 2nd Annual eCommerce Expo South Florida
Rand Internet Marketing
 
Midwest PHP - Scaling Magento
Mathew Beane
 
Zendcon scaling magento
Mathew Beane
 
Costruire un sito e-commerce in alta affidabilità con Magento e Zend Server C...
Zend by Rogue Wave Software
 
Angrybirds Magento Cloud Deployment
AOE
 
Optimizing Magento Performance with Zend Server
varien
 
High-Performance Magento in the Cloud
AOE
 
Rock-solid Magento Deployments (and Development)
AOE
 
Magento scalability from the trenches (Meet Magento Sweden 2016)
Divante
 
Real use cases of performance optimization in magento 2
Max Pronko
 
Ad

Similar to Scaling Magento (20)

PDF
SANDcamp 2014 - A Perfect Launch, Every Time
Jon Peck
 
PPTX
Automating using Ansible
Alok Patra
 
PPTX
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
Redis Labs
 
PPTX
Dynomite @ RedisConf 2017
Ioannis Papapanagiotou
 
PDF
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
PDF
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 
PDF
6 Months Sailing with Docker in Production
Hung Lin
 
PDF
Liferay portals in real projects
IBACZ
 
PDF
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Nicolas Brousse
 
PDF
The Accidental DBA
PostgreSQL Experts, Inc.
 
PDF
Programming for non tech entrepreneurs
Rodrigo Gil
 
PDF
Designing for operability and managability
Gaurav Bahrani
 
PDF
Php Inspections (EA Extended): The Cookbook
Vladimir Reznichenko
 
PDF
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
smalltown
 
PDF
DrupalCon 2014: A Perfect Launch, Every Time
Pantheon
 
PPTX
My benchmarks brings all the boys to the yard
Ion Dormenco
 
PPTX
Eko10 Workshop Opensource Database Auditing
Juan Berner
 
PDF
MySQL X protocol - Talking to MySQL Directly over the Wire
Simon J Mudd
 
PDF
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Martin Spier
 
PPTX
Cognos Performance Tuning Tips & Tricks
Senturus
 
SANDcamp 2014 - A Perfect Launch, Every Time
Jon Peck
 
Automating using Ansible
Alok Patra
 
RedisConf17 - Dynomite - Making Non-distributed Databases Distributed
Redis Labs
 
Dynomite @ RedisConf 2017
Ioannis Papapanagiotou
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 
6 Months Sailing with Docker in Production
Hung Lin
 
Liferay portals in real projects
IBACZ
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Nicolas Brousse
 
The Accidental DBA
PostgreSQL Experts, Inc.
 
Programming for non tech entrepreneurs
Rodrigo Gil
 
Designing for operability and managability
Gaurav Bahrani
 
Php Inspections (EA Extended): The Cookbook
Vladimir Reznichenko
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
smalltown
 
DrupalCon 2014: A Perfect Launch, Every Time
Pantheon
 
My benchmarks brings all the boys to the yard
Ion Dormenco
 
Eko10 Workshop Opensource Database Auditing
Juan Berner
 
MySQL X protocol - Talking to MySQL Directly over the Wire
Simon J Mudd
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Martin Spier
 
Cognos Performance Tuning Tips & Tricks
Senturus
 
Ad

Recently uploaded (20)

PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 

Scaling Magento

  • 1. Scaling Magento Reid Parham, Aaron Edmonds, and Kyle Terry Public distribution: sensitive information omitted. www.Copiousinc.com
  • 2. COPIOUS ● User-Centered Digital Experience Agency ● Strategy ● Experience ● Engineering ● http://copio.us/
  • 3. Scale Your Code A.K.A. Magento is hard
  • 4. Code Management ● Magento is big! o Our project has over 820,000 lines of PHP ● Multi-lingual, multi-currency, multi-store ● Classes can have complex names o *cough* Enterprise_Reward_Block_Adminhtml_Customer_Edit_T ab_Reward_History_Grid_Column_Renderer_Reason *cough*
  • 5. Code Management (cont.) ● Configuration is driven by XML ● The dreaded EAV ● Magento Indices ● Event-Observer
  • 6. Code Management (Tools) Good tools make the job easier! ● A good IDE o Magicento ● Commerce Bug 2 ● n98-magerun
  • 7. Code Management ● NEVER modify core files o Magento’s forum never helped ● NEVER* add files to app/code/local/Mage o Magento was built to be modular** ● Test your code with flat catalog enabled and disabled ● Before overwriting classes, check for events
  • 8. Code Optimization (Quick Wins) Caching Magento Blocks ● DIY! Event to add cache data: core_block_abstract_to_html_before ● OR use a module https://github.com/aligent/CacheObserver
  • 9. Code Optimization (Quick Wins) Mage::getModel(‘catalog/product’)->load($_product- >getId()); ● This is bad in templates and when looping over product collections ● Load with initial data select o used_in_product_listing attribute option
  • 10. Code Optimization Make efficient use of Magento indices ● Example: Catalog URL Rewrites o Includes all products by default (including products marked as “Not Visible Individually”) o Do you need SEO friendly URLs for products that will never be seen??? o Reduce your index size by up to 95% o Mage_Catalog_Model_Resource_Url::_getProducts
  • 11. Code Optimization (Quick Wins) Mage_Catalog_Model_Resource_Product_Typ e_Configurable_Product_Collection::isEnabled Flat? FALSE
  • 12. Systems ● Hardware Profile ● Cluster Design ● Scaling
  • 14. Hardware Profile (overview) ● 2 racks of hardware and dozens of servers ● Top quality of available (and compatible) chipsets and memory ● Buffered DDR3; 1 channel per CPU ● 126 kW of stable, reliable, redundant, and backed up power ● Minor kernel tweaks
  • 15. Hardware Profile (network) ● NetScaler for load balancing ○ Vserver pools ○ Balances web, database, admin and endeca ○ Monitors will remove downed hosts ● Redundant Network Infrastructure ○ Backplane uses LACP (link aggregation) for redundancy, load balancing and failover ○ HA pairing of configurations
  • 16. Hardware Profile (network) Dynamic port forwarding for browsing: kyle@localhost $ ssh -L 2221:127.0.0.1:2221 whitelistedhost.example.com kyle@whitelistedhost $ ssh -D 2221 cluster.example.com Static port forwarding for Navicat SSH tunneling (tunneling through a tunnel): kyle@localhost $ ssh -L 2222:127.0.0.1:2222 whitelistedhost.example.com kyle@whitelistedhost $ ssh -L 2222:127.0.0.1:22 cluster.example.com
  • 17. Hardware Profile (web) ● Dual Intel Xeon E3-1230 @ 3.30GHz ● 32 GB RAM ● Dozens of servers ● nginx and PHP5-FPM ● 6:1 ratio of PHP processes to CPU cores
  • 18. Hardware Profile (database) ● Redundant database hosts ● MySQL 5.6 chosen for scaling capability ● tcmalloc further improves throughput ● Master/slave replication ● Standby hosts for warm failover ● Failure point: > 4,000 checkouts/hour
  • 19. Hardware Profile (database) ● Quad Intel Xeon E7-2860 ○ 10 cores + hyperthreading each totalling 80 threads ● 128 GBs of RAM ● RAID10 SSDs for data ○ writeback cache; noatime,noexec mount options ● RAID1 HDDs for OS
  • 20. Oops!
  • 21. Hardware Profile (cache) ● Powering discrete instances of Redis ○ Sessions ○ Full page cache ○ Magento back end cache ○ Background processing queues ● Discrete instances are for threading, differing memory limits, differing backup rules, and multi-db deprecation
  • 22. Hardware Profile (cache) ● Content is compressed with LZF ○ Compression and decompression with LZF is faster than gzip so it’s an ideal solution ● Decreased utilization of network capacity ● Sentinel for failover (soon) ● RDB BGSAVE: prime number intervals
  • 24. Hardware Profile (cache) ● Quad Intel Xeon E5-2620 @ 2.00GHz ● 128 GBs of RAM ● 4 bonded network interfaces ○ Prevents saturation of private network ○ 4 Gb/s ○ Bonding mode 5 (balance-tlb) ■ No special switch support ■ Nice when the colo manages the switch
  • 25. Hardware Profile (utility) ● Cron and systems jobs ● Scripts ● Deploys ● Chef Server 10 for deploy and configuration ● Tests ○ Database test suite in Perl (Test::DatabaseRow) ● Backups (and copies)
  • 26. Cluster Overview ● Production ○ Most hardware serves production ● Staging ○ Some data promoted to production nightly ● Preview{1..n} ○ Instances for testing and previewing new features, bug fixes and design changes.
  • 27. ● Aggregate hardware availability exceeds six nines (99.9999%) ● Software availability is ~99.999% ● Software, including deployments: 99.98% ● Software, including maintenance: 99.9% ● Non-recoverable human errors: 98% Production Uptime
  • 29. Team Profile ● 16 committers; 8.25 FTE ● 4 Project Managers ● 5 departments ● 31 vendors ● 5 time zones
  • 30. Team Values ● State your needs; respect others’ ● Respect is given, then adjusted ● Process can always change and improve ● Work/life balance ● Mature and non-aggressive; mediate conflict ● Honesty and transparency
  • 31. Team Mantras ● Trust (relevant) data; make things visible ● Measurable, repeatable, falsifiable (scientific method) ● Redundancy reduces risks (if documented) ● Set expectations (timing, contents, formats) and deliver on them
  • 32. Team Mantras ● Automate what is repeated ● Use known patterns and proven architectures ● Grow talent from within ● Compartmentalization of some data, code, and knowledge
  • 34. 10 Integrated Vendors Adobe, Akamai, tax calculation, legacy software, Ebay, gift cards, ERP (fulfillment and inventory), Oracle, Tierpoint (Dallas, Seattle, Spokane), Endeca provider
  • 35. advertising, application analytics, email, hardware analysis and functionality, maps, offsite storage, promotions, payment gateways, remarketing, shipping estimates, SMS, social networks, uptime 21 Accessory Vendors
  • 36. ● Group emails: avoid general questions, assign actions to people, minimize distribution lists ● Identify urgency of requests ● Use email filters ● Coach and mentor Effective Communication
  • 37. ● Daily phone calls: only while needed ● Set an agenda; keep to a schedule ● Encourage people to skip calls or to leave early ● End the call when completed Effective Communication
  • 38. Tools ● GitHub ● Google Docs ● Pivotal Tracker ● Conference calls, Skype, and IM ● BugHerd
  • 46. Build Knowledge ● Document the “obvious” ● 1000-line README ● Capture failures and solutions ● What happens when? ● Which database and server?
  • 48. “This is how we work.”
  • 50. Learn from previous failures.
  • 51. Code Review ● Standardize pull request structures ● Constructive feedback; ask questions ● emoji-cheat-sheet.com
  • 52. Code Review Pull requests can also be workspaces
  • 53. Releases and Git flow: rhythm, ownership, and pride.
  • 55. Deployments ● Monday through Thursday only! ● Communication: tickets, cross references, pull requests, QA status, and releases ● Set expectations: timings for outages, maintenance, and degraded functionality ● Are we done, yet? ● Explain outcomes and options
  • 56. Community Participation ● Patches submitted o Redis o Cm_RedisSession o Cm_Cache_Backend_Redis o https://github.com/magento/magento2 ● Modules improved o CacheObserver o VF_CustomMenu
  • 57. Community Participation ● http://magento.stackexchange.com/ ● http://stackoverflow.com/ ● phpredis bug(s)
  • 58. ● Spence, Muneera U. Collaborative Processes lecture. 13 Apr. 2006. ● Marks, Andrea. "The Role of Writing in a Design Curriculum." AIGA: Design Education (2004). ● Katzenbach, Jon R., and Douglas K. Smith. The Wisdom of Teams. HarperCollins, 2003. Collaboration Texts
  • 59. ● Bennis, Warren, and Patricia W. Biederman. Organizing Genius. Perseus, 1997. ● Marcum, James W. After the Information Age. Peter Lang, 2006. ● https://en.wikipedia.org/wiki/Collaboration (and collaborative method) Collaboration Texts
  • 60. See Also GitHub (and Gist) @parhamr @kyleterry @aedmonds

Editor's Notes

  • #2: How would you build the world’s largest, fastest, most complex Magento ecommerce store? Join three COPIOUS engineers as they share their approaches to this problem. This one-hour presentation will include the best practices, code samples, and system configurations necessary to scale Magento up to 100,000 daily orders with a catalog of 100,000 products. Client is publicly traded, so we’re constrained by federal regulations on some details. US retail sector; busiest periods, in order: 1. Cyber Monday 2. Pre-Christmas 3. Post-Christmas 4. Back to School 5. Spring Break Site-wide average response time: 282 ms
  • #3: Founded in early 2000s. Native iOS and Android Ecommerce clusters Product configurations Complex integrations Marketing and content strategy We are hiring Business Development Director Sr. Software Engineer / Engineering Manager Studio Manager DevOps Engineer Sr. Ruby on Rails Developer Sr. Strategist Mobile Engineer
  • #6: Keep things specific to Magento, not basic
  • #8: These are more like ground rules If you’re modifying core files, you’re doing it wrong! All too common to see Magento forum recommendations telling people to just modify app/code/core/… Events: 406 events fired for homepage, 663 for category page, 1038 for PDP, 836 for cart
  • #9: Blocks are where the rubber meets the road for Magento, the last piece in the chain of getting data to the end-user. Many blocks are not cached (some rightly so for customer session) For instance, Magento CMS blocks go through the rendering process for each page they are displayed on Many modules available for this. Open source options available. Production cache host has ~1 million keys for Back End cache. commands per second: > 3,000 expirations per second: ~100 hit rate: ~85%
  • #10: This is common to see this on product listing pages, cart page, checkout review Magento has accounted for this!
  • #11: Optimizing what is included in the indexes can be difficult but it can provide some big payoffs if you have a large catalog. Rewrite Mage_Catalog_Model_Resource_Url::_getProducts Current runtime for catalog_url index: ~30 minutes
  • #12: What does this method return? This method is called in product list blocks as well as PDPs and other small pages like the cart and each step of the checkout via collectTotals.
  • #15: We don’t actually use 126 kW of power :) Sandy Bridge: not the latest and greatest but still good Kernel tweaks include: socket limits, shared memory limits, open file limits, larger queues for networking, and IPv4 stability/security/capacity IPv6 ignored at nginx layer
  • #16: MySQL and HTTP monitors will remove hosts from the pool that go down. Maximum period between failure and pool removal is 7 seconds. Scripts try to recover downed instances by restarting services. Outcomes from outages are emailed to the group. See ARP table corruption? Is it every 4 hours? Do you have Cisco switches? This is the ARP cache lifetime :) Important NetScaler configs… * Services: -cip ENABLED X-Forwarded-For -cltTimeout 30 -svrTimeout 120 -CKA YES * Virtual server, port 80: -persistenceType NONE * Virtual server, port 443: -persistenceType SSLSESSION
  • #17: SKIP if time is an issue A locked down network with no VPN means you need to get creative when working from home.
  • #18: These CPUs are fast enough and a great value; not *extreme* power. $240 each Average daily Load average of 0.7–1.2: sustained normal Load average of 5: target maximum load (35% performance degradation) Load average of 7+: “failure” load NGINX and PHP5-FPM We are targeting a comfortable performance level. Ratio of PHP processes to CPU cores found through trial and error. This is the lowest process count we could deploy without socket resets under crush loads. This quantity of PHP processes is possible with 32 GB of RAM in each web host. Several boxes were shipped with extra/junk/mismatched RAM (1 GB sticks) and review was necessary
  • #19: https://github.com/blog/1422-tcmalloc-and-mysql MySQL 5.6 versus 5.5: https://dev.mysql.com/tech-resources/articles/mysql-5.6-rc.html TODO What makes 5.6 scale better? • Better linear performance and scale on systems supporting multi-processors and high CPU thread concurrency • InnoDB has been re-factored to minimize legacy mutex contentions and bottlenecks Better multi-processor support We are interested in Percona and MariaDB but do not have operational capacity to use either. (discover, tune, configure, automate, etc) Failure defined as connection timeouts and socket resets for ~3 percent of users. It took us nearly a month to produce enough load to cause minor failures. The real/hard failure point is higher than this, but that’s the best we’ve been able to do! :P Configs: thread_cache_size = 512 (possibly too low!) table_open_cache = 12288 tmp_table_size = 512M query_cache_type = 1 (on) query_cache_limit = 4M (supports SOAP and REST API integrations) query_cache_size = 512M (larger than this is problematic; it’s typically ~60% full) innodb_buffer_pool_size = 32G innodb_log_buffer_size = 2G innodb_log_file_size = 512M innodb_file_per_table Statistics: 42 TB of transmitted data 23 TB of innodb writes in 90 days 8.4 TB of innodb log churn 100% thread cache hit rate 99.996% table cache hit rate (large number of open tables possibly related to MySQL bugs #16244691 and #65384) 99.9999993% of table locks are immediate (most are nightly processes) Innodb_buffer_pool_wait_free: 0 Innodb_log_waits: 0 85% query cache hit rate; this doesn’t really mean anything with such high churn rate of data 94.5% of temp tables are in memory (only nightly processes require disk tables) 99.97% of queries are faster than 200 ms; we’ve reached a plateau of optimization The average row lock is a bit slow as a consequence of Magento indexing architecture and our background processing queues Moderate rate of random and sequential reads (table scans!), but we can absorb that overhead with hardware and focus on improving PHP code
  • #20: YES Lots of memory DB cache, For things like placing orders, SSDs provide fast write speeds. MAYBE We have just over 10,000 write IOPS capacity (and sustain ~125 IOPS). YES Remote management, configuration, and validation of hardware RAID can be difficult; push the colocation facility to assign knowledgeable technicians. Partition mount configurations assume power will never be lost; optimal throughput and security. (RAID controllers do not have battery backup units installed) These CPUs are about $2,000 each!
  • #21: We now know what a load average of 400 looks like! o.O (Ugly SQL that wanted a temp table of 2.4 quadrillion rows) endeca query was wanting to create a temp table
  • #22: SKIP this if needed YES 3,000 commands per second; 0.7 ms average response time MAYBE All Redis instances persist to disk with RDB. Only sessions’ RDB files are backed up off-server (Sessions point to carts! We see a lot of anonymous users.) YES Downside of multiple instances: quadrupling of file descriptors and socket connections required some kernel ulimit tweaks NO PHP Redis libraries not quite mature enough to support persistent connections with PHP5-FPM: https://github.com/nicolasff/phpredis/issues/70 zend diables pconnect
  • #24: Compressing cache contents also increases storage capacity! We can afford the increased CPU overhead to improve RAM and network capability. Network throughput went from 500 Mbit/sec to 125 Mbit/sec Sustained disk IO went from 80% to 12% utilization Prime number intervals on the RDB BGSAVE reduce contention over disk IO, as write activity is less likely to overlap
  • #25: Several boxes were shipped with extra/junk RAM (2 GB sticks) and review was necessary Balances transmitted data by changing the mac address on the outgoing packages No special switch support; 32 Gbps backplane :) Nice when the colo runs the switch “The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave.” - Linux Foundation CPUs are $450 each.
  • #26: SKIP this if needed Daily offsite backups; verified functional :) Bash, Ruby, Perl, Python, PHP Chef *really* wants to run every 30 minutes; prevent that with the `--once` argument. Test failures catch human errors; emails sent from failures are intentionally obnoxious
  • #28: Ubuntu, nginx, php-fpm and MySQL are reliable, predictable, and scriptable. Deployments take about 10 minutes because we’re cautious about database schema and large caches take time to clear. Magento’s architecture for indexing in 1.12 greatly constrains our uptime. We’ve written automation, adjusted sources of authority, and standardized communication workflows to prevent most human error. Human errors will quickly drag uptime down to 94% if response time is slow and mitigation/recovery plans are not documented.
  • #30: This is slightly above the size of most effective teams; managed through limiting scope of engagements for timing and components. We also cluster around sprints and product/feature development teams. We hire smart people who work like craftspersons. They enjoy building things that delight other people. Expertise is not required, but the ability to learn is. Client had some turnover; we have to be careful to not be perceived as a threat. Regions: England (vendor), East Coast US, Midwest, West Coast US, East Coast Australia.
  • #31: QUICK slide This is intended for peer groups with moderate homogeneity and similar cultural backgrounds. (Not a license for monoculture, though.) Give full respect up front—these are your peers. If person is disrespectful or burdensome, provide suggestions, and then gently reduce respect. Some people worked *some* long weeks; not always. Many people even took vacations! Admitting mistakes is better for all.
  • #32: QUICK slide Scientific method Measure everything up front and develop your questions later (Borrowed from Big Data™, NoSQL, etc)
  • #33: QUICK slide Humans make errors; machines are made for repetition. We do not test in production! :) Standard POSIX process signaling and Ubuntu init scripts. Generally: Systems engineers need to know a moderate amount about a lot. Software engineers need to know a lot about a little. Project managers need to know a little about a lot. Operations engineers need to know a little about a lot.
  • #34: Easy mantra/value: nourish people with free beverages and comfortable, low-distraction environments. Feed your team! We worked many lunch hours and a few late nights. Bosses bought food and accommodated dietary needs.
  • #35: Introduction to vendors Names, titles, emails, time zones (and business hours), escalation procedures Optimal scenario: “we speak for [client] and [vendor] deals directly with us” Approach with a gentle demeanor (not here to take over and rule how everything goes) Small talk provides something to relate to; people seem quite affable toward the PNW Share: Design goals, objectives, and values Push the vendors to deliver; this should be done by company superiors.
  • #36: SKIP if needed Trust, and verify. Some documentation was wrong or missing.
  • #37: Urgencies have varying levels and definitions; find what works! Skype and instant message are need-to-know basis Cell phone contact should be rare and with explicit boundaries
  • #38: SKIP if needed Phone calls are hard!
  • #39: Contact lists, issues triage, process documentation, collaborative editing, task delegation, history and context BOUNDARIES: I check in with others when I see their timestamps are outside of business hours. Documents with sensitive info are marked CONFIDENTIAL and shared with a minimal group. Some documents are internal only.
  • #40: As proof of our decent work/life balance, we see most commits are business hours in local time. (Committers are in 2 time zones and have flexible schedules) Some late nights and weekends were had, but typically for specific sprints, maintenance, deployments, and chores.
  • #41: Cowboy coding in production: help or go away
  • #42: Launch day! June 28/29 The state of the codebase was compatible with release because we were making limited and deliberate changes.
  • #43: Release day. Features passing UAT are accepted before release.
  • #44: Some releases have pretty complex preparations. A team familiar with Git is an effective team. (It shouldn’t get in the way!)
  • #45: SKIP if needed. The network graph can be quite elegant with such a large and effective team.
  • #46: SKIP if needed. Off hours maintenance is sometimes a frenzied, guess and check process. We mitigate these risks with supporting people available and our commit/deploy rigor for safeguards.
  • #47: Some of my flurries of commits are documentation updates. I’m a risk in that I understand nearly everything; others hold me accountable by requesting documentation. The preview environment has a Chef HTML template that lists tickets, branches, URLs, known issues, and general notes. I’ve provided links and listings of which files determine application states and integrations. (Single sources of authority preferred)
  • #52: Standardize what gets included in a pull request. “What changed” (list) and “How to test” (steps, outcomes, caveats) are my favorites. Opportunity to teach and learn; see how others do things, and provide DRYing or refactoring advice. It’s a vulnerable moment that deserves to be uplifting and positive.
  • #53: Pull requests can provide advance notice to the group at large.
  • #56: Friday and weekend deployments eat up budgets and human capital by requiring people be available. Educate people and standardize language regarding ticketing systems and GitHub flow. Quality assurance: always include steps to reproduce and expected outcome. Define what failure is; consider releasing incremental fixes. Do you coordinate releases by dates? Version numbers? Names? Standardize and schedule! Build routines. Explaining if issues have been solved or when they’re expected to be fixed. Build and grow trust by defining risks, mitigation options, rollback criteria, and recovery steps and timings.
  • #59: Reid’s 2007 undergraduate thesis was a survey of modern and contemporary literature with analysis for educational settings.
  • #62: Expand upon the “dropping people from email CC if they’re difficult” advice, please? Our most common habit is taking email threads internal to determine our preferred response. We explicitly state [internal thread] in the first line when this has been done. Some people are very knowledgeable but tend to provide advice or input past their job titles. It’s nice and well intentioned, but slows things down. Our habit has been to only ask questions of those people for subjects specifically under their job titles. What issues have you seen with splitting MySQL reads and writes? Checkout theoretically can experience problems with replication delay, but we haven’t seen it happen. Slaves will stop on any foreign key errors, and we’ve seen some. Features most frequently causing that problem have been reports. We’ve disabled reports because the client uses Analytics products for business intelligence. Database unit tests and validation tests have helped us catch human errors that would cause slaves to stop Why use real hardware? “Walk before we can run.” Previous Magento partners could only optimize the site to a point where it required 64 web servers and very high IOPS. That wasn’t going to be easy on the cloud. The current application state would probably run pretty well on the cloud, but we’re risk averse and want full control. We’re planning to eventually get to the cloud, which should occur when this hardware is out of date. That will be opportunity for full investigation of a custom ecommerce product (service oriented architecture; onmichannel integration).