The document describes a major incident response involving the web2kafka service that occurred on January 6, 2017, in San Francisco and Toronto. Key roles such as incident commander, deputy, scribe, and communications liaison were defined, detailing their responsibilities during the incident, which was triggered by increased traffic leading to memory issues and system failures. The root cause was identified as a memory usage increase causing the Linux oom-killer to intervene, ultimately requiring a redeployment with adjusted settings to prevent recurrence.
Related topics: