book

Building Secure and Reliable Systems

by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield

March 2020

Intermediate to advanced

555 pages

16h 29m

English

O'Reilly Media, Inc.

Book available

Start your free trial

Related skills

Site Reliability Engineering (SRE)

Associated roles

DevOps engineer
SRE

Why We Wrote This BookWho This Book Is ForA Note About CultureHow to Read This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
On Passwords and Power DrillsReliability Versus Security: Design ConsiderationsConfidentiality, Integrity, AvailabilityConfidentialityIntegrityAvailabilityReliability and Security: CommonalitiesInvisibilityAssessmentSimplicityEvolutionResilienceFrom Design to ProductionInvestigating Systems and LoggingCrisis ResponseRecoveryConclusion
Attacker MotivationsAttacker ProfilesHobbyistsVulnerability ResearchersGovernments and Law EnforcementActivistsCriminal ActorsAutomation and Artificial IntelligenceInsidersAttacker MethodsThreat IntelligenceCyber Kill Chains™Tactics, Techniques, and ProceduresRisk Assessment ConsiderationsConclusion
Safe Proxies in Production EnvironmentsGoogle Tool ProxyConclusion
Design Objectives and RequirementsFeature RequirementsNonfunctional RequirementsFeatures Versus Emergent PropertiesExample: Google Design DocumentBalancing RequirementsExample: Payment ProcessingManaging Tensions and Aligning GoalsExample: Microservices and the Google Web Application FrameworkAligning Emergent-Property RequirementsInitial Velocity Versus Sustained VelocityConclusion
Concepts and TerminologyLeast PrivilegeZero Trust NetworkingZero TouchClassifying Access Based on RiskBest PracticesSmall Functional APIsBreakglassAuditingTesting and Least PrivilegeDiagnosing Access DenialsGraceful Failure and Breakglass MechanismsWorked Example: Configuration DistributionPOSIX API via OpenSSHSoftware Update APICustom OpenSSH ForceCommandCustom HTTP Receiver (Sidecar)Custom HTTP Receiver (In-Process)TradeoffsA Policy Framework for Authentication and Authorization DecisionsUsing Advanced Authorization ControlsInvesting in a Widely Used Authorization FrameworkAvoiding Potential PitfallsAdvanced ControlsMulti-Party Authorization (MPA)Three-Factor Authorization (3FA)Business JustificationsTemporary AccessProxiesTradeoffs and TensionsIncreased Security ComplexityImpact on Collaboration and Company CultureQuality Data and Systems That Impact SecurityImpact on User ProductivityImpact on Developer ComplexityConclusion

Why Is Understandability Important?System InvariantsAnalyzing InvariantsMental ModelsDesigning Understandable SystemsComplexity Versus UnderstandabilityBreaking Down ComplexityCentralized Responsibility for Security and Reliability RequirementsSystem ArchitectureUnderstandable Interface SpecificationsUnderstandable Identities, Authentication, and Access ControlSecurity BoundariesSoftware DesignUsing Application Frameworks for Service-Wide RequirementsUnderstanding Complex Data FlowsConsidering API UsabilityConclusion
Types of Security ChangesDesigning Your ChangeArchitecture Decisions to Make Changes EasierKeep Dependencies Up to Date and Rebuild FrequentlyRelease Frequently Using Automated TestingUse ContainersUse MicroservicesDifferent Changes: Different Speeds, Different TimelinesShort-Term Change: Zero-Day VulnerabilityMedium-Term Change: Improvement to Security PostureLong-Term Change: External DemandComplications: When Plans ChangeExample: Growing Scope—HeartbleedConclusion
Design Principles for ResilienceDefense in DepthThe Trojan HorseGoogle App Engine AnalysisControlling DegradationDifferentiate Costs of FailuresDeploy Response MechanismsAutomate ResponsiblyControlling the Blast RadiusRole SeparationLocation SeparationTime SeparationFailure Domains and RedundanciesFailure DomainsComponent TypesControlling RedundanciesContinuous ValidationValidation Focus AreasValidation in PracticePractical Advice: Where to BeginConclusion
What Are We Recovering From?Random ErrorsAccidental ErrorsSoftware ErrorsMalicious ActionsDesign Principles for RecoveryDesign to Go as Quickly as Possible (Guarded by Policy)Limit Your Dependencies on External Notions of TimeRollbacks Represent a Tradeoff Between Security and ReliabilityUse an Explicit Revocation MechanismKnow Your Intended State, Down to the BytesDesign for Testing and Continuous ValidationEmergency AccessAccess ControlsCommunicationsResponder HabitsUnexpected BenefitsConclusion
Strategies for Attack and DefenseAttacker’s StrategyDefender’s StrategyDesigning for DefenseDefendable ArchitectureDefendable ServicesMitigating AttacksMonitoring and AlertingGraceful DegradationA DoS Mitigation SystemStrategic ResponseDealing with Self-Inflicted AttacksUser BehaviorClient Retry BehaviorConclusion
Background on Publicly Trusted Certificate AuthoritiesWhy Did We Need a Publicly Trusted CA?The Build or Buy DecisionDesign, Implementation, and Maintenance ConsiderationsProgramming Language ChoiceComplexity Versus UnderstandabilitySecuring Third-Party and Open Source ComponentsTestingResiliency for the CA Key MaterialData ValidationConclusion
Frameworks to Enforce Security and ReliabilityBenefits of Using FrameworksExample: Framework for RPC BackendsCommon Security VulnerabilitiesSQL Injection Vulnerabilities: TrustedSqlStringPreventing XSS: SafeHtmlLessons for Evaluating and Building FrameworksSimple, Safe, Reliable Libraries for Common TasksRollout StrategySimplicity Leads to Secure and Reliable CodeAvoid Multilevel NestingEliminate YAGNI SmellsRepay Technical DebtRefactoringSecurity and Reliability by DefaultChoose the Right ToolsUse Strong TypesSanitize Your CodeConclusion
Unit TestingWriting Effective Unit TestsWhen to Write Unit TestsHow Unit Testing Affects CodeIntegration TestingWriting Effective Integration TestsDynamic Program AnalysisFuzz TestingHow Fuzz Engines WorkWriting Effective Fuzz DriversAn Example FuzzerContinuous FuzzingStatic Program AnalysisAutomated Code Inspection ToolsIntegration of Static Analysis in the Developer WorkflowAbstract InterpretationFormal MethodsConclusion
Concepts and TerminologyThreat ModelBest PracticesRequire Code ReviewsRely on AutomationVerify Artifacts, Not Just PeopleTreat Configuration as CodeSecuring Against the Threat ModelAdvanced Mitigation StrategiesBinary ProvenanceProvenance-Based Deployment PoliciesVerifiable BuildsDeployment Choke PointsPost-Deployment VerificationPractical AdviceTake It One Step at a TimeProvide Actionable Error MessagesEnsure Unambiguous ProvenanceCreate Unambiguous PoliciesInclude a Deployment BreakglassSecuring Against the Threat Model, RevisitedConclusion
From Debugging to InvestigationExample: Temporary FilesDebugging TechniquesWhat to Do When You’re StuckCollaborative Debugging: A Way to TeachHow Security Investigations and Debugging DifferCollect Appropriate and Useful LogsDesign Your Logging to Be ImmutableTake Privacy into ConsiderationDetermine Which Security Logs to RetainBudget for LoggingRobust, Secure Debugging AccessReliabilitySecurityConclusion
Defining “Disaster”Dynamic Disaster Response StrategiesDisaster Risk AnalysisSetting Up an Incident Response TeamIdentify Team Members and RolesEstablish a Team CharterEstablish Severity and Priority ModelsDefine Operating Parameters for Engaging the IR TeamDevelop Response PlansCreate Detailed PlaybooksEnsure Access and Update Mechanisms Are in PlacePrestaging Systems and People Before an IncidentConfiguring SystemsTrainingProcesses and ProceduresTesting Systems and Response PlansAuditing Automated SystemsConducting Nonintrusive TabletopsTesting Response in Production EnvironmentsRed Team TestingEvaluating ResponsesGoogle ExamplesTest with Global ImpactDiRT Exercise Testing Emergency AccessIndustry-Wide VulnerabilitiesConclusion
Is It a Crisis or Not?Triaging the IncidentCompromises Versus BugsTaking Command of Your IncidentThe First Step: Don’t Panic!Beginning Your ResponseEstablishing Your Incident TeamOperational SecurityTrading Good OpSec for the Greater GoodThe Investigative ProcessKeeping Control of the IncidentParallelizing the IncidentHandoversMoraleCommunicationsMisunderstandingsHedgingMeetingsKeeping the Right People Informed with the Right Levels of DetailPutting It All TogetherTriageDeclaring an IncidentCommunications and Operational SecurityBeginning the IncidentHandoverHanding Back the IncidentPreparing Communications and RemediationClosureConclusion
Recovery LogisticsRecovery TimelinePlanning the RecoveryScoping the RecoveryRecovery ConsiderationsRecovery ChecklistsInitiating the RecoveryIsolating Assets (Quarantine)System Rebuilds and Software UpgradesData SanitizationRecovery DataCredential and Secret RotationAfter the RecoveryPostmortemsExamplesCompromised Cloud InstancesLarge-Scale Phishing AttackTargeted Attack Requiring Complex RecoveryConclusion
Background and Team EvolutionSecurity Is a Team ResponsibilityHelp Users Safely Navigate the WebSpeed MattersDesign for Defense in DepthBe Transparent and Engage the CommunityConclusion
Who Is Responsible for Security and Reliability?The Roles of SpecialistsUnderstanding Security ExpertiseCertifications and AcademiaIntegrating Security into the OrganizationEmbedding Security Specialists and Security TeamsExample: Embedding Security at GoogleSpecial Teams: Blue and Red TeamsExternal ResearchersConclusion
Defining a Healthy Security and Reliability CultureCulture of Security and Reliability by DefaultCulture of ReviewCulture of AwarenessCulture of YesCulture of InevitablyCulture of SustainabilityChanging Culture Through Good PracticeAlign Project Goals and Participant IncentivesReduce Fear with Risk-Reduction MechanismsMake Safety Nets the NormIncrease Productivity and UsabilityOvercommunicate and Be TransparentBuild EmpathyConvincing LeadershipUnderstand the Decision-Making ProcessBuild a Case for ChangePick Your BattlesEscalations and Problem ResolutionConclusion

Content preview from Building Secure and Reliable Systems

Foreword by Michael Wildpaner

At their core, both Site Reliability Engineering and Security Engineering are concerned with keeping a system usable. Issues like broken releases, capacity shortages, and misconfigurations can make a system unusable (at least temporarily). Security or privacy incidents that break the trust of users also undermine the usefulness of a system. Consequently, system security is top of mind for SREs.

On the design level, security has become a highly dynamic property of distributed systems. We’ve come a long way from passwordless accounts on early Unix-based telephony switches (nobody had a modem to dial into them, or so people thought), static username/password combinations, and static firewall rules. These days, we instead use time-limited access tokens and high-dimensional risk assessment at millions of requests per second. Granular cryptography of data in flight and at rest, combined with frequent key rotation, makes key management an additional dependency of any networking, processing, or storage system that deals with sensitive information. Building and operating these infrastructure security software systems requires close collaboration between the original system designers, security engineers, and SREs.

The security of distributed systems has an additional, more personal, meaning for me. From my university days until I joined Google, I had a side career in offensive security with a focus on network penetration testing. I learned a lot about the fragility ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

Errata Page

Building Secure and Reliable Systems

by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield

Foreword by Michael Wildpaner

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Site Reliability Engineering

Foundations of Scalable Systems

Solutions Architect's Handbook - Second Edition

How Linux Works, 3rd Edition

Publisher Resources

Foreword by Michael Wildpaner

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Site Reliability Engineering

Foundations of Scalable Systems

Solutions Architect's Handbook - Second Edition

How Linux Works, 3rd Edition

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.