dealing with large scale failures takes a qualityatiely different approach
set of design principles



AWS
Amazon Simple Storage Service (S3)
Scoping the failure

core tenants
no long term contracts
available on demand
elastic

Intro to AWS
ec2
ebs
vpc
s3
sqs
simpledb
cdn
emr
rds


CAP
consistency, available or perfomance but not all three

Capacity
82 million objects
100,000 request per second


Failures will occur in data center

  • expet drive to fail
  • extpect netwwork connection to fail
  • expect a single machine to go out
failure scenerios
must think of all failures and which ones are important
manifestations take on different forms
  • corruption of stored and transmitted data
  • losing one machine in fleet
  • losing an entire data center
  • losing entire data center and one machine in another datacenter
causes of failure
  • human error
network configuration
  • pulled cords
  • forgetting to expose lob to external traffic
  • dns black holes
  • sw bugs
  • acts of nature
  • flooding
  • heat waves
  • lightening (has happened 5 times to amazon and caused partial outage once

  • entropy
drive failures
rack wsithc makes half the hostws in rack unaccesable

  • beyond scale
some dimensions of scale are easyy to manage
- amount of free space in system
- prcise measurements of when you coud run out
- no ambiguity
- acquisition of components by mutltiple suppliers

some dimensions of sale are more diffult
- request rate
- ultimate manifestation: DDOS attack

Timely failure detection
propagation of failure must handle or avoid
- scaling bottlenecks of their own
- centrliaed failure of falure detenction units
- asymetric routes

S3 Gossip approach for faulre detection
gossip, or epidment protocol, are useful tools when problistic consistency can be used

basic idea
- appls, components heartbeat their existence

not easy data changes at different rates, and network overlay
can't exchange all gossip state
network overlay must be takeninto consideration
doesn't handle the bootstrap case
doesn't address the issue of application lifecycle
not all state transcations in lifecycle should be performed automatically. for some human intervnetation may be required.

DESIGN PRINCIPLES (to help system be resilent)
- service relationships should be tolerant
- decouping functionality into multiples services has standard set of advantages
need to protect yourself fom upstream service dependencies when they haze you
(permissions to lease can call a certain number of times)
protect yourself from downstream service dependencies when they fail

- code for large failures
- some systems you suppress entirely
examples: replication of entities (data)
- some systems must choose different behaviors based on the unit failure
- anticipate data corruption (end-to-end check includes the customer)

- code for elasticity
the dimensions of elasticy
- need infinite elasticity for cloud storage
- quick elasticity for recovery from large-scale failures
introducing new capacity to a fleet
- idelaly you can introduce more resources in the sytems and capabilities increase
- all load balancing systems (hw and sw)

- mointor, extralpolate, and react
- modeling (to determine choke points that need to be monitored)
- alarming
- reactng
- feedback loops (take what you see and bring back to model where you should spend time to be durability in real-time)
- keeping ahead of failures

Code for frequent single machine failures
- most common fialure manifesttion - a single box
for persistent use quorum
- advatnage:
does not requir all ops to sucess
-hises underlying failure
hides porr latency
- disad
increase aggreate load on system for omse ops
more complex
difficult to scale

- all ops have a "setzie"

Game Days
network eng and dc tchnicions turn off a data center
- don't tell service owners
- accept the risk, it is going ot happen anyway
- build up to it to start


real faiure experiences
  • large outage last year
  • traced down to single network card
  • once found problem easy to reproduce
  • corrput leake past TCP checksum