On Monday morning, Google services like Gmail and Drive were down for about 45 minutes, leaving many Workspace users unable to do their work. In the aftermath of the incident, Google promised it would conduct an investigation into what happened. In a post spotted by 9to5Google, it has now shared its findings.
At the center of the outage was work Google had done to migrate to its User ID Service, which handles authenticating your account credentials. The problem originated in October when the company moved to a new system for allocating system resources, while leaving parts of the old one in place.
In leaving those old components in place, they incorrectly came back with an error about usage being at zero. The outage would have occurred earlier if not for a grace period the company had put in place. Unfortunately, that fix expired, and its automated systems started to behave as if the problem was real. Google had safeguards in place to prevent those types of issues, but they weren’t built to handle the exact case that occurred on Monday morning.
“We would like to apologize for the scope of impact that this incident had on our customers and their businesses,” Google said. “We take any incident that affects the availability and reliability of our customers extremely seriously, particularly incidents which span multiple regions.”
While the company’s engineers were able to address the problem relatively quickly, Google says it plans to implement new measures to prevent a similar situation in the future. In particular, one of its goals is to do a better job of communicating when an outage takes out its services. It also plans to improve its monitoring systems so that it can catch incorrect configurations sooner.