'Transient' Issues
By justin
Doctors today have access to dozens of tools for diagnosing a panoply of medical issues. We have blood tests that can detect hundreds of different conditions and diseases, imaging that can see into different parts of your organs, and we can even detect if you have COVID from shit water (risky click!). Despite all the tools that doctors have available, there is still a set of conditions that are reached through what’s known as a “Diagnosis of Exclusion.”
A diagnosis of exclusion is exactly what it sounds like. A series of tests for known diseases is performed, perhaps some blood workup, imaging performed, etc… and if all of those come back negative, you’re given the unsatisfying diagnosis for a condition that can only be explained by ruling out every other possible condition. IBS (sticking with the shit theme here), is diagnosed in such a manner. People with IBS often don’t ever find out the underlying cause of their condition. Before I go further, I just want to clarify, these conditions are very real for people who suffer from them, despite the lack of a specific test to pinpoint a diagnosis. But sometimes, doctors reach a “diagnosis of exclusion” conclusion too prematurely. This happens when they aren’t aware of the existence of an underlying condition.
I have a friend who was diagnosed with IBS (Note: I swear I’m going somewhere with all this). They kept getting second opinions, even after a gauntlet of tests that ruled out a large majority of conditions. Eventually, my friend found a doctor that suggested an elimination diet for them. They kept a strict diary of everything they ate for a few weeks and noted whenever they got sick. Looking through their food journal, the doctor saw a common ingredient in the food they were eating when they got sick. The foods all had red food dye in them. This is a pretty rare allergy! The doctor was able to verify this with a food challenge. Now, they avoid any foods that are artificially dyed red and no longer have any symptoms of IBS. Had my friend not found a doctor willing to dive deeper into this condition, they would still be suffering from flare-ups and taking unnecessary medications.
Like doctors, engineers today have access to dozens of diagnostic tools. We have tools that allow for high level system resource observation, tools that allow for tracing of kernel functions with minimal overhead, we even have tools to detect memory leaks from servers’ (heap) dumps! But despite all the tools that are available to engineers, we too have cases of Diagnosis of Exclusion. Too often, when we can’t immediately explain why something occurred, we jump to the conclusion that these must be transient issues.
We tend to think of transient issues as a weird blip in the matrix and assume it was a one off event that won’t happen again. The stars aligned in a weird way, a solar flare caused a bit to flip somewhere and now everything is back to normal because the Nagios monitor recovered after you hit retry.
And again like doctors, we sometimes reach for a conclusion a bit too early. It’s easy to explain, it doesn’t require us to dig further since we’re under the assumption that it won’t happen again. Because the alert resolved on its own or that failed test succeeded on retry, we just want to go back to what we were previously doing. In my time as an SRE/Sys Admin/Web Ops Engineer/Devops/Nerd, I’ve seen these mythical transient issues pop up, and always, when digging deeper into one of these issues, we’ve discovered an underlying cause.
First let’s separate transient issues from those for which we don’t know an underlying cause. A transient issue does have an underlying cause, it’s just that the cause was from that of a temporary event (BGP flapping, server removed from load balancer, etc.). However it’s important to be able to identify the event first. Even if no action is required by the operator, we should verify if the systems behave as expected. If your organization is SLO driven, maybe you are okay with the occasional transient issue eating into your error budget. An issue can be transient and not have an immediate known underlying cause to it.
The mistake I often see are people coming up with a reason for the transient issue, before an investigation has been done.
“Oh! AWS must be having an issue”.
We need to push back when we see this type of behavior occur. Coming to the conclusion that AWS is having issues (without a status page update or confirmation from your rep) is an example where a diagnosis of exclusion is warranted after all other possibilities have been ruled out. As one medical journal paper put it:
“the DOE becomes nothing more than an ongoing educated guess” 1
I don’t think this post would be complete without my own admission of a time I chalked something up to just being a transient issue. At a previous job, I had just recently switched a large set of servers over to one of Amazon’s new instance types. We began noticing a few servers reporting their instance’s status check as being unhealthy. They wouldn’t respond to restart commands, so we’d terminate the instance, launch a replacement and then all was well again in the world. “Transient issue!”, I declared.
Then the transient issues became persistent.
My first assumption was that Amazon’s newer hardware was going through the Bathtub Curve’s initial failure rate stage. Given we had a decent number of servers, we’d just likely see more of these. The impact to us was minimal since our load balancer properly removed the bad instances each time. But still, it was causing a bit of toil for us.
We then starting noticing, even after replacing these instances, some would fail again within hours. Odd! We were now experiencing a failure rate of 5-10% of our servers in a day. We opened a ticket with AWS and begun a deeper dive on our side trying to rule out anything we could. During our deep dive, we ended up discovering kernel panic messages 2 on the console output of the servers.
We never got to the bottom of what was exactly happening, but it appeared there was an issue with the underlying hypervisor and our kernel version. We did a lot of back and forth debugging with AWS, and ultimately, we had to move off these instance types. The important thing was that we were able to eliminate causes on our end and pinpoint where the issue was occurring, even if we didn’t know the exact underlying causes. Had we not dug further we would have been continuously replacing servers. A few months later, we again attempted the new instance types with a new major kernel version. Things were stable.
Remember that temporary symptomatic improvement can occur, even when the treatment is aimed at the wrong disease. 1
Before reaching the “diagnosis of exclusion” conclusion, have you ruled out all that you can with the information available? What data are you missing that would allow you to dig deeper into the issue and discover what a contributing factor to the issue may be? Are you missing a log line, a system metric graph, or an application instrumentation?
In a cloud driven world, we often won’t have the right access or tools to fully know the “why” in some issues. However, we should have enough available information to point to the issue at the cloud provider level. The “why” then becomes less important in many (Note: I’m not saying all) issues outside your direct control.
We should understand where the failure occurred so we can architect around these issues and provide information to the provider for deeper debugging. I don’t need to know what caused a networking hiccup on AWS between my DB and application that resulted in having to restart my app servers to re-connect to the DB. I may never actually know the reasons if it wasn’t a massive outage affecting other customers, but I have enough information to know that my application needs to be able to handle that type of failure gracefully without a human intervening.
-
“The Diagnosis of Exclusion: An Ongoing Uncertainty” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783127/) ↩︎ ↩︎
-
Bug report with more detail on the issues we saw: (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1178707) ↩︎