Google broke its own cloud, again, with dud DB config change

Memcache was gone in 20 seconds and down for nearly two hours

Wed 8 Nov 2017 // 08:26 UTC

Google's again 'fessed up to cooking its own cloud.

This time the mess was brief – just under two hours last Monday – and took down its Memcache service. The result was “Managed VMs experienced failures of all HTTP requests and App Engine API calls during this incident.”

There's a little upside in the fact that Google now offers a replacement for Managed VMs, the “App Engine Flexible Environment”, and it stayed up,

But there's also some egg to smear on a face because the wound was self-inflicted. Here's Google's root cause explanation:

The App Engine Memcache service requires a globally consistent view of the current serving datacenter for each application in order to guarantee strong consistency when traffic fails over to alternate datacenters. The configuration which maps applications to datacenters is stored in a global database.

The incident occurred when the specific database entity that holds the configuration became unavailable for both reads and writes following a configuration update. App Engine Memcache is designed in such a way that the configuration is considered invalid if it cannot be refreshed within 20 seconds. When the configuration could not be fetched by clients, Memcache became unavailable.

As you can see, whatever the update to the database did, it broke Memcache. And for all of Google's vaunted site reliability engineering expertise, this part of its cloud could fall over in 20 seconds.

It gets worse: Google's account of the incident also reports that it tried to rollback the database to its previous state, but doing so “required an update to the configuration in the global database, which also failed.” A workaround was eventually found.

The Register knows of at least six other occasions on which Google's own actions crocked its cloud. Those incidents include a typo, human error, automated services going down and missing a BGP problem, another bad configuration change, a bad patch and a failed attempt to patch two problems at once.

None of the outages were lengthy, or as embarrassing as IBM's load balancers disappearing because it mis-registered a domain name or AWS's major S3 outage caused by a typo. But they still happened. And keep happening. ®

Topics

Special Features

Vendor Voice

Resources

SaaS

Google broke its own cloud, again, with dud DB config change

Memcache was gone in 20 seconds and down for nearly two hours

More about

More about

More about

More about

More about

TIP US OFF

Other stories you might like

911 goes MIA across multiple US states, cause unclear

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Sacramento airport goes no-fly after AT&T internet cable snipped

Industrial systems integrating digitalisation

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

Datacenter outages are on the decline, but when they hit, they hit hard

Tech trade union confirms cyberattack behind IT, email outage

McDonald's ordering system suffers McFlurry of tech troubles

LinkedIn's turn to fall over: Outage hits thinkfluencer hub

World-plus-dog booted out of Facebook, Instagram, Threads

AT&T's apology for Thursday's outage should stretch to a cup of coffee

Americans wake to widespread AT&T cellular outages

About Us

Our Websites

Your Privacy