2019-7-18 20:13 |
During the week of June 24, Coinbase.com had two periods of degraded service. On June 25, for a 30 minute period, most customers experienced error messages that prevented them from completing buys, sells, and trades. On June 26, for a 32 minute period, about a third of customers experienced error messages across the site and mobile apps.
Below is a detailed description of the cause as well as the changes we have implemented to prevent any similar service degradations.
June 25From 15:16 PT to 15:46 PT (22:16–22:46 UTC), buy/sell/trade functionality on Coinbase.com was severely degraded with a 97% error rate. In other words, 97% of buy/sell/trade requests received an error response.
This severe service degradation was caused by an automatic failover of a MongoDB cluster that powers fraud prevention logic as part of a scheduled maintenance. Cluster failovers for scheduled maintenance are typically instantaneous and have minimal customer impact. The failover for this particular cluster lasted 25 minutes due to its size and replication configuration. During this time, queries and commands to the cluster were unable to be completed. Because the cluster is used during the request/response cycles for submitting buys, sells, and trades, requests to these endpoints errored.
We are currently re-architecting trade functionality to move usage of this cluster out of the request path. We are also updating configuration to ensure failovers will be instantaneous going forward. We have ensured that failovers for this cluster may only be initiated during rare, scheduled downtime, when there will be no impact on customers.
June 26thFrom 13:37 PT to 14:09 PT (20:37–21:09 UTC), Coinbase.com experienced sustained error rates across all endpoints. Error rates hovered around 35% for the duration of the incident.
This moderate service degradation was caused by sustained increased query latency on a MongoDB cluster that stores user account data. There were several contributing factors that led to increased query latency and degraded service:
Before the incident began, a background job performed an aggregation for a large number of accounts, causing excessive reads into cache and cache evictions for the cluster’s storage engine cache.Soon after, a large number of real-time price alerts were triggered, causing significantly increased query throughput for this cluster.Increased query throughput, combined with the prior cache evictions, caused additional cache pressure and resulted in query queueing and increased query latency. These are initial conclusions; in collaboration with MongoDB our investigation is ongoing.Due to increased query latency during the request/response cycle, web workers became saturated, serving HTTP 502s.Storage engine cache activity for the affected clusterThe incident was resolved when we manually failed over the affected cluster to instances with more memory. Since the incident, we’ve done the following:
We’ve audited working set size across all clusters, resizing instances and/or reducing the working set size as necessary.We’ve removed the background job and are performing an audit of similar queries, moving them to analytics nodes.We’ve tuned certain high-throughput endpoints that are hit hardest during price alerts.We’re continuing ongoing work to reduce load on MongoDB through caching and reading from secondary data stores that can be scaled horizontally.We take uptime very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrency. If you’re interested in solving scaling challenges like those presented here, come work with us.
Unless otherwise noted, all images provided herein are by Coinbase.
Incident Post Mortem: June 25–26, 2019 was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Similar to Notcoin - Blum - Airdrops In 2024