The USE method for troubleshooting

13 Aug 2018, 08:01

linux / troubleshooting

Julie Evans (@bork) has been posting fantastic little cartoons describing different UNIX commands. I learn a little something new from every single post along with its comments. She posted this little gem about the top command: <blockquote class="twitter-tweet" data-lang="en"><p lang="und" dir="ltr">top <a href="https://t.co/RV51i3K65n">pic.twitter.com/RV51i3K65n</a></p>— 🔎Julia Evans🔍 (@b0rk) <a href="https://twitter.com/b0rk/status/1022331694811099137?ref_src=twsrc%5Etfw">July 26, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

She then followed that up with a link to one of her sources for understanding top and in particular what the "load average" represents on Linux. <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">just reread <a href="https://twitter.com/brendangregg?ref_src=twsrc%5Etfw">@brendangregg</a>'s blog post on what linux load averages mean and it's SO GOOD <a href="https://t.co/S9y4Sontw0">https://t.co/S9y4Sontw0</a></p>— 🔎Julia Evans🔍 (@b0rk) <a href="https://twitter.com/b0rk/status/1022326293164122119?ref_src=twsrc%5Etfw">July 26, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> And she's right, that post is SO GOOD. He explains that on Linux, these represent an approximation of system load average rather than CPU alone and how and why that works for the purpose. His brief explanation at the top of the post shows a good way to think about the load average from top

<blockquote> Some interpretations:

If the averages are 0.0, then your system is idle.
If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
If they are higher than your CPU count, then you might have a performance problem (it depends).

As a set of three, you can tell if load is increasing or decreasing, which is useful. They can be also useful when a single value of demand is desired, such as for a cloud auto scaling rule. But to understand them in more detail is difficult without the aid of other metrics. A single value of 23 - 25, by itself, doesn't mean anything, but might mean something if the CPU count is known, and if it's known to be a CPU-bound workload.

Instead of trying to debug load averages, I usually switch to other metrics. I'll discuss these in the "Better Metrics" section near the end. </blockquote>

In the post he references his own USE method for troubleshooting and analysis . As if the top post wasn't enough of an educational rabbit hole, I traveled down further into his site.

The USE method he references and created mentions that, when you're analyzing an issue, for every resource check for metrics for Utilization and Saturation as well as Errors. When you're starting to troubleshoot a performance problem, what do you check? The USE method is a strategic approach to help quickly triangulate on issue identification and help eliminate confirmation bias.

Typically when I troubleshoot an issue I start with some general assumptions of where I think the problem might lie based on my experience and then work specific to general. As I come across indications of perhaps other issues or related issues, I chase them down. This does work pretty well but relies very much on my own experience and is subject to my biases. The USE method is more generic and while specific familiarity and experience is good, you can still follow it. That said, troubleshooting may take longer as you learn your tools that produce metrics for individual resources. Even if I follow my own methods, it's good to have another strategy to step back and help answer the question: "Have I missed anything here that might alter my conclusions?"

In any given system you've got roughly the same resource categories including CPUs, RAM, the buses and transports between them, storage and networking. Its rather important to understand the communications paths between resources, how queuing may be employed, etc. If you don't take the time to locate or draw a quick block diagram. In doing so, you may find sections where you simply don't know the paths between resources and seeing them will prompt you to suss out that detail.

The biggest thing to understand is, "what is this tool showing me - a utilization metric or a saturation metric?" The site provides a wealth of detail for tools used to gather metrics. It is a fantastic reference. He also makes the important point that in a cloud/virtualization context you really need to understand the perspective you have. If you're using a shared resource and viewing memory stats within a guest, one guests high utilization may be against a more global cap.

From a little bite-size twitter posting, I discovered a wealth of new learning information that I am still devouring. While I do take an organized approach to learning new things and developing my skills, it's both enjoyable and important to leave in a little time for random discovery .

The USE method for troubleshooting

Share!