3. Integration Points
Integrations are the #1 risk to stability.
Every out of process call can and
will eventually kill your system.
Yes, even database calls.
Saturday, June 23, 12
5. “In Spec” vs. “Out of Spec”
Example: Request-Reply using XML over HTTP
“In Spec” failures “Out of Spec” failures
TCP connection refused TCP connection accepted, but no data sent
HTTP response code 500 TCP window full, never cleared
Error message in XML response Server replies with “EHLO”
Server sends link farm HTML
Server streams Weird Al mp3s
Well-Behaved Errors Wicked Errors
Saturday, June 23, 12
6. Remember This
Necessary evil.
Peel back abstractions.
Large systems fail faster than small ones.
Useful patterns: Circuit Breaker, Use Timeouts, Use Decoupling Middleware,
Handshaking, Test Harness
Saturday, June 23, 12
7. Chain Reaction
Failure moves horizontally across tiers
Common in search engines and
app servers
Saturday, June 23, 12
8. Remember This
One server down jeopardizes the rest.
Hunt for Resource Leaks.
Useful pattern: Bulkheads
Saturday, June 23, 12
9. Cascading Failure
Failure moves vertically across tiers
Common in enterprise services
& SOA
Saturday, June 23, 12
10. Remember This
“Damage Containment”
Stop cracks from jumping the gap
Scrutinize resource pools
Useful patterns: Use Timeouts, Circuit Breaker
Saturday, June 23, 12
11. Users
Too many, too clicky
Some malicious users
Buyers
Front-page viewers
Screen scrapers
Saturday, June 23, 12
12. Handle Traffic Surges Gracefully
Degrade features automatically
Shed load.
Don’t keep sessions for bots.
Reduce per-user burden:
IDs, not object graphs.
Query parameters, not result sets.
Saturday, June 23, 12
13. Blocked Threads
All request threads blocked = “crash”
Impossible to test away
Learn to use java.util.concurrent or System.Threading.
(Ruby & PHP coders, just avoid threads completely.)
Saturday, June 23, 12
14. Pernicious and Cumulative
Hung request handlers = less capacity.
Hung request handler = frustrated user/caller
Each remaining thread serves 1/(N-1) extra requests
Saturday, June 23, 12
15. Example: Blocking calls
In a request-processing method
String key = (String)request.getParameter(PARAM_ITEM_SKU);
Availability avl = globalObjectCache.get(key);
In GlobalObjectCache.get(String id), a synchronized method:
Object obj = items.get(id);
if(obj == null) {
obj = strategy.create(id);
}
…
In the strategy:
public Object create(Object key) throws Exception {
return omsClient.getAvailability(key);
}
Saturday, June 23, 12
16. Remember This
Use proven constructs.
Don’t wait forever.
Scrutinize resource pools.
Beware the code you cannot see.
Useful patterns: Use Timeouts, Circuit Breaker
Saturday, June 23, 12
17. Attacks of Self-Denial
BestBuy: XBox 360 Preorder
Amazon: XBox 360 Discount
Victoria’s Secret: Online Fashion Show
Anything on FatWallet.com
Saturday, June 23, 12
18. Defenses
Avoid deep links
Static landing pages
CDN diverts or throttles users
Shared-nothing architecture
Session only on 2nd click
Deal pool
Saturday, June 23, 12
19. Remember This
Open lines of communication.
Support your marketers.
Saturday, June 23, 12
20. Unbalanced Capacities
SiteScope Online
NYC Store Order Scheduling
Management
6 Hosts 1 Host
20 Hosts 6 Instances 1 Instance
Customers
75 Instances 450 Threads 25 Threads
3,000 Threads
SiteScope
San Francisco
Saturday, June 23, 12
21. Scaling Ratios
Dev QA Prod
Online Store 1/1/1 2/2/2 20/300/6
Order
1/1/1 2/2/2 4/6/2
Management
Scheduling 1/1/1 2/2/2 4/2
Saturday, June 23, 12
22. Unbalanced Capacities
Scaling effect between systems
Sensitive to traffic & behavior patterns
Stress both sides of the interface in QA
Simulate back end failures during testing
Saturday, June 23, 12
23. SLA Inversion
Partner 1's Partner 1's Message Message
Application DNS Queues Broker
No SLA 99% 99.99% 99%
Pricing and
Inventory Frammitz
Promotions
99.9% 99.99%
No SLA
SpamCannon's SpamCannon's
Corporate MTA Corporate DNS
DNS Applications
99.999% 99.9%
98.5% 99%
What SLA can Frammitz really guarantee?
Saturday, June 23, 12
24. Remember This
No empty promises.
Monitor your dependencies.
Decouple from your dependencies.
Measure availability by feature, not by server.
Beware infrastructure services: DNS, SMTP, LDAP.
Saturday, June 23, 12
25. Unbounded Result Sets
Development and testing is done with small data sets
Test databases get reloaded frequently
Queries often bonk badly with production data volume
Saturday, June 23, 12
26. Unbounded Result Sets: Databases
SQL queries have no inherent limits
ORM tools are bad about this
Appears as slow performance degradation
Saturday, June 23, 12
27. Unbounded Result Sets: SOA
Chatty remote protocols, N+1 query problem
Hurts caller and provider
Caller is naive, trusts server not to hurt it.
Saturday, June 23, 12
28. Remember This
Test with realistic data volumes
Don’t trust data producers.
Put limits in your APIs.
Saturday, June 23, 12
30. Circuit Breaker
Ever seen a remote call wrapped with a retry loop?
int remainingAttempts = MAX_RETRIES;
while(--remainingAttempts >= 0) {
try {
doSomethingDangerous();
return true;
} catch(RemoteCallFailedException e) {
log(e);
}
}
return false;
Why?
Saturday, June 23, 12
31. Faults Cluster
Fast retries good for for dropped packets
(but let TCP do that)
Most other faults require minutes to hours to correct
Immediate retries very likely to fail again
Saturday, June 23, 12
32. Faults Cluster
Problems with the remote host, application or
the network will probably persist
for an long time... minutes
or hours
Saturday, June 23, 12
33. Bad for Users and Systems
Users: Systems:
Wait longer to get an error response. Ties up threads, reducing overall capacity.
What happens after final retry? Multiplies load on server, at the worst times.
Induces a Cascading Failure
Saturday, June 23, 12
34. Stop Banging Your Head
Wrap a “dangerous” call
Closed Open
on call / pass through on call / fail
Count failures call succeeds / reset count on timeout / attempt reset
call fails / count failure
threshold reached / trip breaker
pop
After too many failures, stop passing calls
After a “cooling off” period, try the next call
attempt
reset pop
If it fails, wait some more before calling again reset
Half-Open
on call/pass through
call succeeds/reset
call fails/trip breaker
Saturday, June 23, 12
35. Considerations
Sever malfunctioning features
Degrade gracefully on caller
Critical work must be queued for later
Saturday, June 23, 12
36. Remember This
Stop doing it if it hurts.
Expose, monitor, track, and report state changes
Good against: Cascading Failures, Slow Responses
Works with: Use Timeouts
Saturday, June 23, 12
37. Bulkheads
Partition the system
Allow partial failure without losing service
Applies at different granularity levels
Saturday, June 23, 12
38. Common Mode Dependency
Foo Bar
Baz
Foo and Bar are coupled via Baz
Saturday, June 23, 12
39. With Bulkheads
Foo Bar
Baz Baz
Pool 1 Pool 2
Baz
Foo and Bar have dedicated resources from Baz.
Saturday, June 23, 12
40. Remember This
Save part of the ship
Decide if less efficient use of resources is OK
Pick a useful granularity
Very important with shared-service models
Monitor each partition’s performance to SLA
Saturday, June 23, 12
41. Test Harness
Real-world failures are hard to create in QA
Integration tests work for “in-spec” errors,
but not “out-of-spec” errors.
Saturday, June 23, 12
42. “In Spec” vs. “Out of Spec”
Example: Request-Reply using XML over HTTP
“In Spec” failures “Out of Spec” failures
TCP connection refused TCP connection accepted,
but no data sent
HTTP response code 500
TCP window full, never cleared
Error message in XML response
Server replies with “EHLO”
Server sends link farm HTML
Server streams Weird Al mp3s
Well-Behaved Errors Wicked Errors
Saturday, June 23, 12
43. “Out-of-spec” errors
happen all the time in the
real world.
They never happen
during testing...
unless you force them to.
43
Saturday, June 23, 12
44. Killer Test Harness
Daemon listening on network
Substitutes for the remote end of an interface
Can run locally (dev) or remotely (dev or QA)
Is totally evil
Saturday, June 23, 12
45. Just a Few Evil Ideas
Port Nastiness
19720 Allows connections requests into the queue, but never accepts them.
19721 Refuses all connections
19722 Reads requests at 1 byte / second
19723 Reads HTTP requests, sends back random binary
19724 Accepts requests, sends responses at 1 byte / sec.
19725 Accepts requests, sends back the entire OS kernel image.
19726 Send endless stream of data from /dev/random
Now those are some out-of-spec errors.
45
Saturday, June 23, 12
46. Remember This
Force out-of-spec failures
Stress the caller
Build reusable harnesses for L1-L6 errors
Supplement, don’t replace, other testing methods
Saturday, June 23, 12
47. Scaling Effects SLA Inversion
mitigates counters
Attacks of Decoupling
reduces impact
Self-Denial Middleware
lead to
counters Users
exacerbates
counters
Bulkheads Blocked Threads Test Harness
found
leads to finds problems in
near
mutual
Chain Reactions aggravation Integration Points
results from
counters violating damage
leads to leads to
Steady State
counters
avoids
Slow Responses Cascading Failures
counters prevents
leads to
counters Circuit Breaker
Unbalanced
leads to
Capacities
Fail Fast counters
works with
counters can avoid
Unbounded
Handshaking Result Sets Use Timeouts
Saturday, June 23, 12