We’ve had a number of brief outages and/or periods of degraded performance in the last few weeks. I’d like to shed some light on what caused these incidents and what we’re doing to prevent them in the future.
As you may know, one of Pivotal Tracker’s core features is that your view of your project is always up to date, there’s no need to refresh your browser page. If one member of the project pushes a start button on a story, for example, everyone else sees the change immediately. This is an important aspect of keeping the entire team focused and on the same page.
Under the covers, the way this is accomplished is via polling – the browser sends a request every few seconds, basically asking if there is something new. Given the large number of users out there, this translates to approximately 1000 requests per second.
Most of these requests don’t end up hitting any of our application servers, they go straight to a very fast in-memory cache (in the form of multiple memcached processes). Only requests that involve a “stale” response (meaning, there are some changes to return to the client) make their way to an application server. These represent a very small fraction of all requests.
This architecture works well, but the in-memory cache is a critical component, and if it goes down or has any problems, the 1000 requests per second end up hitting the app servers, which are not designed to handle that kind of load. The requests end up backing up, and it takes a few minutes for the system to recover even if the caches are brought back up quickly.
Some of the recent brief outages in the last few weeks involved the cache processes hitting a few different configuration-specified limits (related to connections and the virtualization layer). We also saw a similar issue with our load balancers, which route all of the traffic to the right places in the cluster.
In all cases, the problem was identified and resolved quickly, and Tracker was brought back to normal.
To reduce the likelihood of similar issues in the future, we’ve added more monitoring, and we’re making some changes to the environment, including additional layers of redundancy for the cache, and moving the cache processes from virtual hosts to dedicated bare metal machines. We’re also considering similar changes to other parts of the cluster, but taking it one step at a time to avoid introducing too many changes all at once.
We’re also considering moving away from the polling architecture, which requires a continuous high traffic rate, to a push approach, via HTML5 WebSockets. This would reduce the number of requests dramatically, but the HTML5 WebSockets protocol is still being finalized, and only some browsers support it natively (Chrome 4 and Safari 5 currently). One option that we’re thinking about is a hybrid approach – WebSockets push for browsers that support it, falling back to polling when push is not supported.
We apologize if you were inconvenienced by any of these brief outages – we certainly understand what it means to lose access to Tracker, even momentarily.