SlideShare a Scribd company logo
1 of 48
Download to read offline
Stability Patterns
                           …and Antipatterns




                              Michael Nygard
                        mtnygard@thinkrelevance.com
                                @mtnygard
                                  © Michael Nygard, 2007-2012   1
Saturday, June 23, 12
Stability Antipatterns




                                                 2
Saturday, June 23, 12
Integration Points

          Integrations are the #1 risk to stability.


          Every out of process call can and
          will eventually kill your system.


          Yes, even database calls.




Saturday, June 23, 12
Example: Wicked database hang




Saturday, June 23, 12
“In Spec” vs. “Out of Spec”
                                   Example: Request-Reply using XML over HTTP


        “In Spec” failures                                “Out of Spec” failures

        TCP connection refused                            TCP connection accepted, but no data sent
        HTTP response code 500                            TCP window full, never cleared
        Error message in XML response                     Server replies with “EHLO”
                                                          Server sends link farm HTML
                                                          Server streams Weird Al mp3s




                        Well-Behaved Errors                                Wicked Errors
Saturday, June 23, 12
Remember This

        Necessary evil.

        Peel back abstractions.

        Large systems fail faster than small ones.



        Useful patterns: Circuit Breaker, Use Timeouts, Use Decoupling Middleware,
                         Handshaking, Test Harness




Saturday, June 23, 12
Chain Reaction


        Failure moves horizontally across tiers
        Common in search engines and
        app servers




Saturday, June 23, 12
Remember This

         One server down jeopardizes the rest.
         Hunt for Resource Leaks.




         Useful pattern: Bulkheads




Saturday, June 23, 12
Cascading Failure

                          Failure moves vertically across tiers

                              Common in enterprise services
                                                    & SOA




Saturday, June 23, 12
Remember This

            “Damage Containment”
            Stop cracks from jumping the gap
            Scrutinize resource pools


            Useful patterns: Use Timeouts, Circuit Breaker




Saturday, June 23, 12
Users


          Too many, too clicky
          Some malicious users
          Buyers
          Front-page viewers
          Screen scrapers



Saturday, June 23, 12
Handle Traffic Surges Gracefully

          Degrade features automatically
          Shed load.
          Don’t keep sessions for bots.
          Reduce per-user burden:
                    IDs, not object graphs.
                    Query parameters, not result sets.



Saturday, June 23, 12
Blocked Threads


             All request threads blocked = “crash”
             Impossible to test away
             Learn to use java.util.concurrent or System.Threading.
             (Ruby & PHP coders, just avoid threads completely.)




Saturday, June 23, 12
Pernicious and Cumulative

                  Hung request handlers = less capacity.
                  Hung request handler = frustrated user/caller


                  Each remaining thread serves 1/(N-1) extra requests




Saturday, June 23, 12
Example: Blocking calls

       In a request-processing method
                        String key = (String)request.getParameter(PARAM_ITEM_SKU);
                        Availability avl = globalObjectCache.get(key);


        In GlobalObjectCache.get(String id), a synchronized method:
                        Object obj = items.get(id);
                        if(obj == null) {
                          obj = strategy.create(id);
                        }
                        …

        In the strategy:
                        public Object create(Object key) throws Exception {
                          return omsClient.getAvailability(key);
                        }



Saturday, June 23, 12
Remember This


         Use proven constructs.
         Don’t wait forever.
         Scrutinize resource pools.
         Beware the code you cannot see.


         Useful patterns: Use Timeouts, Circuit Breaker


Saturday, June 23, 12
Attacks of Self-Denial


                            BestBuy: XBox 360 Preorder

                            Amazon: XBox 360 Discount

                        Victoria’s Secret: Online Fashion Show

                             Anything on FatWallet.com


Saturday, June 23, 12
Defenses

          Avoid deep links
          Static landing pages
          CDN diverts or throttles users
          Shared-nothing architecture
          Session only on 2nd click
          Deal pool

Saturday, June 23, 12
Remember This

                        Open lines of communication.

                        Support your marketers.




Saturday, June 23, 12
Unbalanced Capacities


                         SiteScope        Online
                            NYC           Store            Order       Scheduling
                                                        Management

                                                           6 Hosts        1 Host
                                           20 Hosts      6 Instances    1 Instance
                        Customers
                                         75 Instances    450 Threads    25 Threads
                                        3,000 Threads




                          SiteScope
                        San Francisco




Saturday, June 23, 12
Scaling Ratios


                                         Dev      QA       Prod

                        Online Store     1/1/1    2/2/2   20/300/6

                      Order
                                         1/1/1    2/2/2    4/6/2
                    Management

                        Scheduling       1/1/1    2/2/2     4/2

Saturday, June 23, 12
Unbalanced Capacities


         Scaling effect between systems
         Sensitive to traffic & behavior patterns
         Stress both sides of the interface in QA
         Simulate back end failures during testing


Saturday, June 23, 12
SLA Inversion

                         Partner 1's      Partner 1's                 Message        Message
                         Application         DNS                      Queues          Broker
                          No SLA             99%                       99.99%          99%




                                                                                    Pricing and
                          Inventory                      Frammitz
                                                                                    Promotions
                            99.9%                         99.99%
                                                                                      No SLA




                                                                    SpamCannon's   SpamCannon's
                        Corporate MTA    Corporate DNS
                                                                        DNS         Applications
                           99.999%           99.9%
                                                                       98.5%           99%



                            What SLA can Frammitz really guarantee?
Saturday, June 23, 12
Remember This

          No empty promises.

          Monitor your dependencies.

          Decouple from your dependencies.

          Measure availability by feature, not by server.

          Beware infrastructure services: DNS, SMTP, LDAP.



Saturday, June 23, 12
Unbounded Result Sets

         Development and testing is done with small data sets

         Test databases get reloaded frequently

         Queries often bonk badly with production data volume




Saturday, June 23, 12
Unbounded Result Sets: Databases


        SQL queries have no inherent limits

        ORM tools are bad about this

        Appears as slow performance degradation




Saturday, June 23, 12
Unbounded Result Sets: SOA


            Chatty remote protocols, N+1 query problem

            Hurts caller and provider

            Caller is naive, trusts server not to hurt it.




Saturday, June 23, 12
Remember This



                        Test with realistic data volumes
                        Don’t trust data producers.
                        Put limits in your APIs.



Saturday, June 23, 12
Stability Patterns




                                             29
Saturday, June 23, 12
Circuit Breaker

          Ever seen a remote call wrapped with a retry loop?
             int remainingAttempts = MAX_RETRIES;

             while(--remainingAttempts >= 0) {
               try {
                 doSomethingDangerous();
                 return true;
               } catch(RemoteCallFailedException e) {
                 log(e);
               }
             }
             return false;




                                                        Why?
Saturday, June 23, 12
Faults Cluster



          Fast retries good for for dropped packets
          (but let TCP do that)

          Most other faults require minutes to hours to correct

          Immediate retries very likely to fail again



Saturday, June 23, 12
Faults Cluster

          Problems with the remote host, application or
          the network will probably persist
          for an long time... minutes
          or hours




Saturday, June 23, 12
Bad for Users and Systems

          Users:                                  Systems:

          Wait longer to get an error response.   Ties up threads, reducing overall capacity.

          What happens after final retry?         Multiplies load on server, at the worst times.

                                                  Induces a Cascading Failure




Saturday, June 23, 12
Stop Banging Your Head

     Wrap a “dangerous” call
                                                                        Closed                                             Open
                                                        on call / pass through                               on call / fail
     Count failures                                     call succeeds / reset count                          on timeout / attempt reset
                                                        call fails / count failure
                                                        threshold reached / trip breaker
                                                                                                pop
     After too many failures, stop passing calls

     After a “cooling off” period, try the next call
                                                                                                                       attempt
                                                                      reset                      pop
     If it fails, wait some more before calling again                                                                  reset



                                                                                                 Half-Open
                                                                                  on call/pass through
                                                                                  call succeeds/reset
                                                                                  call fails/trip breaker




Saturday, June 23, 12
Considerations


          Sever malfunctioning features
          Degrade gracefully on caller
          Critical work must be queued for later




Saturday, June 23, 12
Remember This

       Stop doing it if it hurts.
       Expose, monitor, track, and report state changes


       Good against: Cascading Failures, Slow Responses
       Works with:       Use Timeouts




Saturday, June 23, 12
Bulkheads

         Partition the system
         Allow partial failure without losing service
         Applies at different granularity levels




Saturday, June 23, 12
Common Mode Dependency


                                 Foo            Bar




                                         Baz



                           Foo and Bar are coupled via Baz

Saturday, June 23, 12
With Bulkheads

                                        Foo            Bar




                                        Baz            Baz
                                       Pool 1         Pool 2

                                                Baz

                        Foo and Bar have dedicated resources from Baz.
Saturday, June 23, 12
Remember This

          Save part of the ship
          Decide if less efficient use of resources is OK
          Pick a useful granularity
          Very important with shared-service models
          Monitor each partition’s performance to SLA




Saturday, June 23, 12
Test Harness



        Real-world failures are hard to create in QA
        Integration tests work for “in-spec” errors,
        but not “out-of-spec” errors.



Saturday, June 23, 12
“In Spec” vs. “Out of Spec”
                                   Example: Request-Reply using XML over HTTP


        “In Spec” failures                                 “Out of Spec” failures

        TCP connection refused                             TCP connection accepted,
                                                           but no data sent
        HTTP response code 500
                                                           TCP window full, never cleared
        Error message in XML response
                                                           Server replies with “EHLO”
                                                           Server sends link farm HTML
                                                           Server streams Weird Al mp3s




                        Well-Behaved Errors                                Wicked Errors
Saturday, June 23, 12
“Out-of-spec” errors
                        happen all the time in the
                              real world.

                          They never happen
                            during testing...

                        unless you force them to.
                                                     43
Saturday, June 23, 12
Killer Test Harness

                        Daemon listening on network
                        Substitutes for the remote end of an interface
                        Can run locally (dev) or remotely (dev or QA)
                        Is totally evil




Saturday, June 23, 12
Just a Few Evil Ideas
                   Port                                           Nastiness
                   19720   Allows connections requests into the queue, but never accepts them.

                   19721   Refuses all connections

                   19722   Reads requests at 1 byte / second

                   19723   Reads HTTP requests, sends back random binary

                   19724   Accepts requests, sends responses at 1 byte / sec.

                   19725   Accepts requests, sends back the entire OS kernel image.

                   19726   Send endless stream of data from /dev/random


                                   Now those are some out-of-spec errors.

                                                                                                 45
Saturday, June 23, 12
Remember This

          Force out-of-spec failures
          Stress the caller
          Build reusable harnesses for L1-L6 errors
          Supplement, don’t replace, other testing methods




Saturday, June 23, 12
Scaling Effects                                    SLA Inversion

                                                                                                                        mitigates                      counters

                                     Attacks of                                                                                         Decoupling
                                                                                       reduces impact
                                     Self-Denial                                                                                        Middleware
                                                          lead to

                               counters                               Users
                                                                                 exacerbates
                                                                                                                                           counters
                        Bulkheads                                                                   Blocked Threads                                         Test Harness

                                                                                                                            found
                                                                                        leads to                                                  finds problems in
                                                                                                                             near
                                                                                                  mutual
                                                                 Chain Reactions                aggravation                         Integration Points
                                                      results from
                         counters                       violating                                                  damage
                                                                                                   leads to                         leads to
                                      Steady State
                                                                                                                                                      counters
                                                                 avoids
                                                                            Slow Responses                         Cascading Failures
                                                                          counters                                                             prevents
                                          leads to
                                                                                                                        counters                            Circuit Breaker
                        Unbalanced
                                                                                     leads to
                        Capacities
                                                     Fail Fast                                          counters

                                                                                                                                               works with
                                    counters            can avoid

                                                                               Unbounded
                                       Handshaking                             Result Sets                           Use Timeouts




Saturday, June 23, 12
Michael Nygard
                        mtnygard@thinkrelevance.com
                                @mtnygard
                                  © Michael Nygard, 2007-2012   48
Saturday, June 23, 12

More Related Content

Viewers also liked

DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, Initiative
DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, InitiativeDOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, Initiative
DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, InitiativeGene Kim
 
Building Serverless APIs (January 2017)
Building Serverless APIs (January 2017)Building Serverless APIs (January 2017)
Building Serverless APIs (January 2017)Julien SIMON
 
Resilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsResilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsUwe Friedrichsen
 
The Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationThe Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationKasun Indrasiri
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web OperationsJohn Allspaw
 

Viewers also liked (6)

DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, Initiative
DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, InitiativeDOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, Initiative
DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, Initiative
 
Building Serverless APIs (January 2017)
Building Serverless APIs (January 2017)Building Serverless APIs (January 2017)
Building Serverless APIs (January 2017)
 
Resilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsResilience reloaded - more resilience patterns
Resilience reloaded - more resilience patterns
 
The Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital TransformationThe Role of Enterprise Integration in Digital Transformation
The Role of Enterprise Integration in Digital Transformation
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
 
Serverless
ServerlessServerless
Serverless
 

Similar to Stability patterns presentation

What makes JBoss AS7 tick?
What makes JBoss AS7 tick?What makes JBoss AS7 tick?
What makes JBoss AS7 tick?marius_bogoevici
 
Backend as a Service
Backend as a ServiceBackend as a Service
Backend as a ServiceLutz Kohl
 
[convergefl] Adaptive Images in Responsive Web Design
[convergefl] Adaptive Images in Responsive Web Design[convergefl] Adaptive Images in Responsive Web Design
[convergefl] Adaptive Images in Responsive Web DesignChristopher Schmitt
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observabilityTheo Schlossnagle
 
HH.JS - State of the Automation
HH.JS - State of the AutomationHH.JS - State of the Automation
HH.JS - State of the AutomationAdam Christian
 
MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)
MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)
MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)Alfranio Júnior
 
Reliability and Resilience Patterns
Reliability and Resilience PatternsReliability and Resilience Patterns
Reliability and Resilience PatternsDmitry Chornyi
 
[T3CON12CA] TYPO3 Phoenix - The Current State
[T3CON12CA] TYPO3 Phoenix - The Current State[T3CON12CA] TYPO3 Phoenix - The Current State
[T3CON12CA] TYPO3 Phoenix - The Current StateChristian Müller
 
Managing Technical Debt
Managing Technical DebtManaging Technical Debt
Managing Technical Debtspullara
 
Open Cloud System Networking Vision
Open Cloud System Networking VisionOpen Cloud System Networking Vision
Open Cloud System Networking VisionRandy Bias
 
Full Stack & Full Circle: What the Heck Happens In an HTTP Request-Response C...
Full Stack & Full Circle: What the Heck Happens In an HTTP Request-Response C...Full Stack & Full Circle: What the Heck Happens In an HTTP Request-Response C...
Full Stack & Full Circle: What the Heck Happens In an HTTP Request-Response C...Carina C. Zona
 
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comRuby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comIlya Grigorik
 

Similar to Stability patterns presentation (16)

What makes JBoss AS7 tick?
What makes JBoss AS7 tick?What makes JBoss AS7 tick?
What makes JBoss AS7 tick?
 
Backend as a Service
Backend as a ServiceBackend as a Service
Backend as a Service
 
[convergefl] Adaptive Images in Responsive Web Design
[convergefl] Adaptive Images in Responsive Web Design[convergefl] Adaptive Images in Responsive Web Design
[convergefl] Adaptive Images in Responsive Web Design
 
Atldevops
AtldevopsAtldevops
Atldevops
 
Unleash The Monkeys
Unleash The MonkeysUnleash The Monkeys
Unleash The Monkeys
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
HH.JS - State of the Automation
HH.JS - State of the AutomationHH.JS - State of the Automation
HH.JS - State of the Automation
 
MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)
MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)
MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)
 
Reliability and Resilience Patterns
Reliability and Resilience PatternsReliability and Resilience Patterns
Reliability and Resilience Patterns
 
[T3CON12CA] TYPO3 Phoenix - The Current State
[T3CON12CA] TYPO3 Phoenix - The Current State[T3CON12CA] TYPO3 Phoenix - The Current State
[T3CON12CA] TYPO3 Phoenix - The Current State
 
Managing Technical Debt
Managing Technical DebtManaging Technical Debt
Managing Technical Debt
 
Open Cloud System Networking Vision
Open Cloud System Networking VisionOpen Cloud System Networking Vision
Open Cloud System Networking Vision
 
Charles
CharlesCharles
Charles
 
Full Stack & Full Circle: What the Heck Happens In an HTTP Request-Response C...
Full Stack & Full Circle: What the Heck Happens In an HTTP Request-Response C...Full Stack & Full Circle: What the Heck Happens In an HTTP Request-Response C...
Full Stack & Full Circle: What the Heck Happens In an HTTP Request-Response C...
 
Concurrency on the JVM
Concurrency on the JVMConcurrency on the JVM
Concurrency on the JVM
 
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comRuby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
 

More from Justin Dorfman

Open Source CDNs | LAWebSpeed April 29th 2014
Open Source CDNs | LAWebSpeed April 29th 2014Open Source CDNs | LAWebSpeed April 29th 2014
Open Source CDNs | LAWebSpeed April 29th 2014Justin Dorfman
 
Wisdom of the crowd gathering insights from real user monitoring presentation
Wisdom of the crowd gathering insights from real user monitoring presentationWisdom of the crowd gathering insights from real user monitoring presentation
Wisdom of the crowd gathering insights from real user monitoring presentationJustin Dorfman
 
Solving the hard problems of user experience management presentation
Solving the hard problems of user experience management presentationSolving the hard problems of user experience management presentation
Solving the hard problems of user experience management presentationJustin Dorfman
 
Preview toward agile APM at Intel presentation
Preview toward agile APM at Intel presentationPreview toward agile APM at Intel presentation
Preview toward agile APM at Intel presentationJustin Dorfman
 
Predicting user activity to make the web fast presentation
Predicting user activity to make the web fast presentationPredicting user activity to make the web fast presentation
Predicting user activity to make the web fast presentationJustin Dorfman
 
One millions users vs your web application mega testing cloud applications pr...
One millions users vs your web application mega testing cloud applications pr...One millions users vs your web application mega testing cloud applications pr...
One millions users vs your web application mega testing cloud applications pr...Justin Dorfman
 
Develop, deploy and manage tomorrow’s applications…today presentation 1
Develop, deploy and manage tomorrow’s applications…today presentation 1Develop, deploy and manage tomorrow’s applications…today presentation 1
Develop, deploy and manage tomorrow’s applications…today presentation 1Justin Dorfman
 
Broadening the user perspective – from network latency to user experience tim...
Broadening the user perspective – from network latency to user experience tim...Broadening the user perspective – from network latency to user experience tim...
Broadening the user perspective – from network latency to user experience tim...Justin Dorfman
 
Akamai internet insights
Akamai internet insightsAkamai internet insights
Akamai internet insightsJustin Dorfman
 
A new era at GoDaddy.com presentation
A new era at GoDaddy.com presentationA new era at GoDaddy.com presentation
A new era at GoDaddy.com presentationJustin Dorfman
 
Understanding hardware acceleration on mobile browsers presentation
Understanding hardware acceleration on mobile browsers presentationUnderstanding hardware acceleration on mobile browsers presentation
Understanding hardware acceleration on mobile browsers presentationJustin Dorfman
 
Michelin starred cooking with chef presentation
Michelin starred cooking with chef presentationMichelin starred cooking with chef presentation
Michelin starred cooking with chef presentationJustin Dorfman
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersJustin Dorfman
 
Abuse prevention in the globally distributed economy presentation
Abuse prevention in the globally distributed economy presentationAbuse prevention in the globally distributed economy presentation
Abuse prevention in the globally distributed economy presentationJustin Dorfman
 
A web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationA web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationJustin Dorfman
 
WordPress Optimization - WordCampLA 09-10-11
WordPress Optimization - WordCampLA 09-10-11WordPress Optimization - WordCampLA 09-10-11
WordPress Optimization - WordCampLA 09-10-11Justin Dorfman
 

More from Justin Dorfman (16)

Open Source CDNs | LAWebSpeed April 29th 2014
Open Source CDNs | LAWebSpeed April 29th 2014Open Source CDNs | LAWebSpeed April 29th 2014
Open Source CDNs | LAWebSpeed April 29th 2014
 
Wisdom of the crowd gathering insights from real user monitoring presentation
Wisdom of the crowd gathering insights from real user monitoring presentationWisdom of the crowd gathering insights from real user monitoring presentation
Wisdom of the crowd gathering insights from real user monitoring presentation
 
Solving the hard problems of user experience management presentation
Solving the hard problems of user experience management presentationSolving the hard problems of user experience management presentation
Solving the hard problems of user experience management presentation
 
Preview toward agile APM at Intel presentation
Preview toward agile APM at Intel presentationPreview toward agile APM at Intel presentation
Preview toward agile APM at Intel presentation
 
Predicting user activity to make the web fast presentation
Predicting user activity to make the web fast presentationPredicting user activity to make the web fast presentation
Predicting user activity to make the web fast presentation
 
One millions users vs your web application mega testing cloud applications pr...
One millions users vs your web application mega testing cloud applications pr...One millions users vs your web application mega testing cloud applications pr...
One millions users vs your web application mega testing cloud applications pr...
 
Develop, deploy and manage tomorrow’s applications…today presentation 1
Develop, deploy and manage tomorrow’s applications…today presentation 1Develop, deploy and manage tomorrow’s applications…today presentation 1
Develop, deploy and manage tomorrow’s applications…today presentation 1
 
Broadening the user perspective – from network latency to user experience tim...
Broadening the user perspective – from network latency to user experience tim...Broadening the user perspective – from network latency to user experience tim...
Broadening the user perspective – from network latency to user experience tim...
 
Akamai internet insights
Akamai internet insightsAkamai internet insights
Akamai internet insights
 
A new era at GoDaddy.com presentation
A new era at GoDaddy.com presentationA new era at GoDaddy.com presentation
A new era at GoDaddy.com presentation
 
Understanding hardware acceleration on mobile browsers presentation
Understanding hardware acceleration on mobile browsers presentationUnderstanding hardware acceleration on mobile browsers presentation
Understanding hardware acceleration on mobile browsers presentation
 
Michelin starred cooking with chef presentation
Michelin starred cooking with chef presentationMichelin starred cooking with chef presentation
Michelin starred cooking with chef presentation
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
 
Abuse prevention in the globally distributed economy presentation
Abuse prevention in the globally distributed economy presentationAbuse prevention in the globally distributed economy presentation
Abuse prevention in the globally distributed economy presentation
 
A web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationA web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentation
 
WordPress Optimization - WordCampLA 09-10-11
WordPress Optimization - WordCampLA 09-10-11WordPress Optimization - WordCampLA 09-10-11
WordPress Optimization - WordCampLA 09-10-11
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Stability patterns presentation

  • 1. Stability Patterns …and Antipatterns Michael Nygard mtnygard@thinkrelevance.com @mtnygard © Michael Nygard, 2007-2012 1 Saturday, June 23, 12
  • 2. Stability Antipatterns 2 Saturday, June 23, 12
  • 3. Integration Points Integrations are the #1 risk to stability. Every out of process call can and will eventually kill your system. Yes, even database calls. Saturday, June 23, 12
  • 4. Example: Wicked database hang Saturday, June 23, 12
  • 5. “In Spec” vs. “Out of Spec” Example: Request-Reply using XML over HTTP “In Spec” failures “Out of Spec” failures TCP connection refused TCP connection accepted, but no data sent HTTP response code 500 TCP window full, never cleared Error message in XML response Server replies with “EHLO” Server sends link farm HTML Server streams Weird Al mp3s Well-Behaved Errors Wicked Errors Saturday, June 23, 12
  • 6. Remember This Necessary evil. Peel back abstractions. Large systems fail faster than small ones. Useful patterns: Circuit Breaker, Use Timeouts, Use Decoupling Middleware, Handshaking, Test Harness Saturday, June 23, 12
  • 7. Chain Reaction Failure moves horizontally across tiers Common in search engines and app servers Saturday, June 23, 12
  • 8. Remember This One server down jeopardizes the rest. Hunt for Resource Leaks. Useful pattern: Bulkheads Saturday, June 23, 12
  • 9. Cascading Failure Failure moves vertically across tiers Common in enterprise services & SOA Saturday, June 23, 12
  • 10. Remember This “Damage Containment” Stop cracks from jumping the gap Scrutinize resource pools Useful patterns: Use Timeouts, Circuit Breaker Saturday, June 23, 12
  • 11. Users Too many, too clicky Some malicious users Buyers Front-page viewers Screen scrapers Saturday, June 23, 12
  • 12. Handle Traffic Surges Gracefully Degrade features automatically Shed load. Don’t keep sessions for bots. Reduce per-user burden: IDs, not object graphs. Query parameters, not result sets. Saturday, June 23, 12
  • 13. Blocked Threads All request threads blocked = “crash” Impossible to test away Learn to use java.util.concurrent or System.Threading. (Ruby & PHP coders, just avoid threads completely.) Saturday, June 23, 12
  • 14. Pernicious and Cumulative Hung request handlers = less capacity. Hung request handler = frustrated user/caller Each remaining thread serves 1/(N-1) extra requests Saturday, June 23, 12
  • 15. Example: Blocking calls In a request-processing method String key = (String)request.getParameter(PARAM_ITEM_SKU); Availability avl = globalObjectCache.get(key); In GlobalObjectCache.get(String id), a synchronized method: Object obj = items.get(id); if(obj == null) { obj = strategy.create(id); } … In the strategy: public Object create(Object key) throws Exception { return omsClient.getAvailability(key); } Saturday, June 23, 12
  • 16. Remember This Use proven constructs. Don’t wait forever. Scrutinize resource pools. Beware the code you cannot see. Useful patterns: Use Timeouts, Circuit Breaker Saturday, June 23, 12
  • 17. Attacks of Self-Denial BestBuy: XBox 360 Preorder Amazon: XBox 360 Discount Victoria’s Secret: Online Fashion Show Anything on FatWallet.com Saturday, June 23, 12
  • 18. Defenses Avoid deep links Static landing pages CDN diverts or throttles users Shared-nothing architecture Session only on 2nd click Deal pool Saturday, June 23, 12
  • 19. Remember This Open lines of communication. Support your marketers. Saturday, June 23, 12
  • 20. Unbalanced Capacities SiteScope Online NYC Store Order Scheduling Management 6 Hosts 1 Host 20 Hosts 6 Instances 1 Instance Customers 75 Instances 450 Threads 25 Threads 3,000 Threads SiteScope San Francisco Saturday, June 23, 12
  • 21. Scaling Ratios Dev QA Prod Online Store 1/1/1 2/2/2 20/300/6 Order 1/1/1 2/2/2 4/6/2 Management Scheduling 1/1/1 2/2/2 4/2 Saturday, June 23, 12
  • 22. Unbalanced Capacities Scaling effect between systems Sensitive to traffic & behavior patterns Stress both sides of the interface in QA Simulate back end failures during testing Saturday, June 23, 12
  • 23. SLA Inversion Partner 1's Partner 1's Message Message Application DNS Queues Broker No SLA 99% 99.99% 99% Pricing and Inventory Frammitz Promotions 99.9% 99.99% No SLA SpamCannon's SpamCannon's Corporate MTA Corporate DNS DNS Applications 99.999% 99.9% 98.5% 99% What SLA can Frammitz really guarantee? Saturday, June 23, 12
  • 24. Remember This No empty promises. Monitor your dependencies. Decouple from your dependencies. Measure availability by feature, not by server. Beware infrastructure services: DNS, SMTP, LDAP. Saturday, June 23, 12
  • 25. Unbounded Result Sets Development and testing is done with small data sets Test databases get reloaded frequently Queries often bonk badly with production data volume Saturday, June 23, 12
  • 26. Unbounded Result Sets: Databases SQL queries have no inherent limits ORM tools are bad about this Appears as slow performance degradation Saturday, June 23, 12
  • 27. Unbounded Result Sets: SOA Chatty remote protocols, N+1 query problem Hurts caller and provider Caller is naive, trusts server not to hurt it. Saturday, June 23, 12
  • 28. Remember This Test with realistic data volumes Don’t trust data producers. Put limits in your APIs. Saturday, June 23, 12
  • 29. Stability Patterns 29 Saturday, June 23, 12
  • 30. Circuit Breaker Ever seen a remote call wrapped with a retry loop? int remainingAttempts = MAX_RETRIES; while(--remainingAttempts >= 0) { try { doSomethingDangerous(); return true; } catch(RemoteCallFailedException e) { log(e); } } return false; Why? Saturday, June 23, 12
  • 31. Faults Cluster Fast retries good for for dropped packets (but let TCP do that) Most other faults require minutes to hours to correct Immediate retries very likely to fail again Saturday, June 23, 12
  • 32. Faults Cluster Problems with the remote host, application or the network will probably persist for an long time... minutes or hours Saturday, June 23, 12
  • 33. Bad for Users and Systems Users: Systems: Wait longer to get an error response. Ties up threads, reducing overall capacity. What happens after final retry? Multiplies load on server, at the worst times. Induces a Cascading Failure Saturday, June 23, 12
  • 34. Stop Banging Your Head Wrap a “dangerous” call Closed Open on call / pass through on call / fail Count failures call succeeds / reset count on timeout / attempt reset call fails / count failure threshold reached / trip breaker pop After too many failures, stop passing calls After a “cooling off” period, try the next call attempt reset pop If it fails, wait some more before calling again reset Half-Open on call/pass through call succeeds/reset call fails/trip breaker Saturday, June 23, 12
  • 35. Considerations Sever malfunctioning features Degrade gracefully on caller Critical work must be queued for later Saturday, June 23, 12
  • 36. Remember This Stop doing it if it hurts. Expose, monitor, track, and report state changes Good against: Cascading Failures, Slow Responses Works with: Use Timeouts Saturday, June 23, 12
  • 37. Bulkheads Partition the system Allow partial failure without losing service Applies at different granularity levels Saturday, June 23, 12
  • 38. Common Mode Dependency Foo Bar Baz Foo and Bar are coupled via Baz Saturday, June 23, 12
  • 39. With Bulkheads Foo Bar Baz Baz Pool 1 Pool 2 Baz Foo and Bar have dedicated resources from Baz. Saturday, June 23, 12
  • 40. Remember This Save part of the ship Decide if less efficient use of resources is OK Pick a useful granularity Very important with shared-service models Monitor each partition’s performance to SLA Saturday, June 23, 12
  • 41. Test Harness Real-world failures are hard to create in QA Integration tests work for “in-spec” errors, but not “out-of-spec” errors. Saturday, June 23, 12
  • 42. “In Spec” vs. “Out of Spec” Example: Request-Reply using XML over HTTP “In Spec” failures “Out of Spec” failures TCP connection refused TCP connection accepted, but no data sent HTTP response code 500 TCP window full, never cleared Error message in XML response Server replies with “EHLO” Server sends link farm HTML Server streams Weird Al mp3s Well-Behaved Errors Wicked Errors Saturday, June 23, 12
  • 43. “Out-of-spec” errors happen all the time in the real world. They never happen during testing... unless you force them to. 43 Saturday, June 23, 12
  • 44. Killer Test Harness Daemon listening on network Substitutes for the remote end of an interface Can run locally (dev) or remotely (dev or QA) Is totally evil Saturday, June 23, 12
  • 45. Just a Few Evil Ideas Port Nastiness 19720 Allows connections requests into the queue, but never accepts them. 19721 Refuses all connections 19722 Reads requests at 1 byte / second 19723 Reads HTTP requests, sends back random binary 19724 Accepts requests, sends responses at 1 byte / sec. 19725 Accepts requests, sends back the entire OS kernel image. 19726 Send endless stream of data from /dev/random Now those are some out-of-spec errors. 45 Saturday, June 23, 12
  • 46. Remember This Force out-of-spec failures Stress the caller Build reusable harnesses for L1-L6 errors Supplement, don’t replace, other testing methods Saturday, June 23, 12
  • 47. Scaling Effects SLA Inversion mitigates counters Attacks of Decoupling reduces impact Self-Denial Middleware lead to counters Users exacerbates counters Bulkheads Blocked Threads Test Harness found leads to finds problems in near mutual Chain Reactions aggravation Integration Points results from counters violating damage leads to leads to Steady State counters avoids Slow Responses Cascading Failures counters prevents leads to counters Circuit Breaker Unbalanced leads to Capacities Fail Fast counters works with counters can avoid Unbounded Handshaking Result Sets Use Timeouts Saturday, June 23, 12
  • 48. Michael Nygard mtnygard@thinkrelevance.com @mtnygard © Michael Nygard, 2007-2012 48 Saturday, June 23, 12