Cascade Faliure: January 2015

Here at the office, we have been using graphite to collect runtime metrics broadcast by our JVM processes.

Some time back, we discovered that our single graphite collector (carbon-cache) could no longer keep up with the load. We replaced it with a carbon-relay, which uses consistent-hashing to distribute the work to multiple carbon-cache instances.

To keep an eye on things, I created a dashboard to monitor the activity of the carbon processes. It was comforting to see the distributed workload, as we went from a cpu bottleneck to an i/o bottleneck.

We later moved the whisper databases to an SSD, removing the i/o bottleneck.

One thing I had noticed - the load on the caches wasn't balanced; one carbon-cache instance was being asked to do more than its share of the work. I ported the hash ring logic into iPython Notebook to run some experiments, and discovered that my initial scheme for instance naming was pathologically bad. So the cache instances were renamed, and the balance of the load became a lot more even.

I was disappointed that my graph seemed to be a bit hit or miss on the most recent datapoint for the metrics. graphite-web will initially try to fetch metrics from disk, but it will then query the caches to look for points available that haven't been flushed to disk yet. Clearly, something was wrong with the lookup there.

A review of local_settings.py revealed that I did not update the instance names for the CARBONLINK_HOSTS; which means that when attempting to determine which cache should hold a metrics, the webapp was using the wrong hash ring. This would sometimes work (if the metric name happened to map to the same instance in both schemes, just by luck). So I scrambled off to update this setting, and restarted the server.

No improvement. What's going on.

My theory is this: graphite-web is using the hash ring to determine which cache should hold the metric. But the carbon-metrics are always local to the cache that produces them. So trying to find them via a hash-ring is going to fail more often than not. Which means that the webapp cannot read those metrics until they have been flushed to disk.

Cascade Faliure

Thursday, January 29, 2015

A surprise in graphite monitoring