Sampling Profilers Suck

Carlos Reyes

2023-10-28

I've been working on making C++ programs faster for a long time. Most of the popular profilers for this job use a method called "sampling" to measure how well a program is doing. I've used them a lot, so I know they're not perfect.

Don't get me wrong, these tools are really popular and lots of people use them to get helpful information. But there's a catch: people often use them without understanding what these tools can't show. Plus, sometimes the results you get can be a bit off.

I want to give credit to two big-name tools in the C++ performance world. They've been around for a while and have lots of features. But let me tell you, getting clear, easy-to-use results from them can be a headache!

So, if this sounds like I'm venting, I'm sorry. It's just that dealing with these issues pushed me to make my own solution. That's why I created Giopler—a tool built to profile and debug C++ programs very easily.

Intel VTune Profiler ¹

When it comes to code profilers, VTune (used to be called Intel VTune Amplifier) is usually the first name that pops into our heads. It's a sleek tool with a user-friendly interface. It even has special features like GPU profiling, though I've never had to use that. But VTune comes with some strings attached. For instance, you need admin rights to install it because it messes with the Linux kernel. Beyond that, I can't say much about how it works. It's closed-source, and Intel keeps the details under wraps.

A few years back, I tried using VTune in a virtual environment, and it was a mess. I couldn't get any helpful error messages; it just didn't work. My guess is that VTune needs more access to the computer's hardware than a virtual environment allows. This was a big hurdle for me and the projects I was working on.

Lately, VTune has become part of Intel's oneAPI toolkit. I've tried getting it to work on Linux, but no luck. So I can't really talk about the newer versions. All I can say is that the older, pre-oneAPI versions are what I'm familiar with.

Intel vs AMD

VTune is a product of Intel and its primary purpose is to enable developers to profile and optimize code for Intel CPUs. Given that Intel manufactures the CPUs and has intimate knowledge of their architecture, micro-architecture, and performance counters, VTune is tailor-made to extract the best possible performance metrics from Intel chips. It is designed to leverage Intel-specific features and counters that may not be present in other architectures.

When it comes to AMD CPUs, there are several factors to consider:

Different Architectures: AMD and Intel have fundamentally different architectures and micro-architectural features. This means the way they execute instructions, manage caches, or even handle threading can be quite different. A tool optimized for Intel's intricacies might not be the best fit for AMD's specific features.
Performance Counters: One of the main ways that profilers like VTune gather information is through performance counters embedded in the CPU. These counters keep track of various metrics like cache misses, branch mispredictions, instruction retirements, and many more. Intel and AMD CPUs have different sets of counters, and even if they have counters with similar purposes, they might not be accessed in the same way. VTune is optimized to access and interpret Intel's performance counters, not AMD's.
Lack of Full Support: Given that VTune is designed primarily for Intel CPUs, it's logical to assume that not all of its features and capabilities would be supported or work optimally on AMD CPUs. Some advanced analysis features might rely on Intel-specific technologies or extensions that simply aren't present in AMD CPUs.
Potential Compatibility Issues: When using VTune on AMD systems, users might encounter compatibility issues or less reliable results. As mentioned, in virtual environments or certain configurations, VTune might not work smoothly with non-Intel hardware. This can lead to incomplete data, misinterpretations, or even tool crashes.
Bias towards Intel Optimizations: Certain recommendations or insights that VTune provides could be based on optimizations that are most effective on Intel CPUs. Using these recommendations on AMD hardware might not yield the same performance benefits, and in some cases, could even be counterproductive.

That said, it's not that VTune cannot be used at all with AMD CPUs. Basic profiling tasks might still work, and some insights can be gleaned. However, for detailed, accurate, and comprehensive performance analysis on AMD hardware, it's often better to use tools specifically designed or optimized for AMD architectures, like AMD's own profiling tools. AMD’s sampling profiler is called µProf ². I have never used it.

Remember, the key to effective profiling is not just collecting data but interpreting it correctly. Using a tool that doesn't fully understand the architecture it's analyzing can show misleading results, missed optimization opportunities, and wasted effort.

Linux Perf ³

Ah, Linux Perf, you're a mixed bag. You've got your flaws, but you've saved me when VTune just wouldn't cooperate. Linux Perf is a bunch of tools you run from the command line. They tap into features built right into the Linux operating system. Because the profiling stuff is part of the system itself, Perf usually just works. At least, when you know what you are doing.

But there's a downside: Perf isn't easy to use. It gives you a super technical, raw look at what's going on inside the computer. I've tried explaining how to use the perf-record command to other programmers and hit a wall. You need to really get modern CPUs to use it, and that's a lot to ask of most programmers.

As for the tool that helps you look at the data you collect—well, it's pretty basic, to put it kindly. Mastering it is a steep climb, and even when you get the hang of it, it's slow.

Profiling events

Now, let's get into the nitty-gritty of how these profilers do their job—or sometimes don't. Imagine a profiler as a camera taking quick snapshots of a program while it's running. These snapshots are called "profiling events." When you run a profiling program, you get a whole series of these events, often in the thousands.

So what's in one of these snapshots, or events? At the very least, it shows you where the program is at that moment—like, what part of the code is running. Most of the time, you also get a look at the "program stack," a sort of to-do list the program is working through. Some profilers even check counters in the CPU to give you more data. And they can pull in all sorts of other info about what the system is up to.

Sampling profilers

Sampling profilers, such as Intel VTune and Linux Perf, work like clockwork interrupters. They pause the program threads you're running many times every second. While they are paused, they look at all the tasks the program is doing and gather data for their snapshots, also known as profiling events.

You might end up with millions of these events from just one run of your program. That's a lot of data! So, these profilers are smart about it. They compress the information so it doesn't hog too much space. The fact that they collect data in a regular, simple way makes it easier to pack all that information down. Of course, then they have to work to decompress the data before they can use it.

Sample rate

Let's go deeper into a key setting in sampling profilers: the "sample rate." This decides how many times per second the profiler interrupts the running program to collect data. Understanding this can explain some of the issues we might face.

Imagine a super-short function that takes just a microsecond (a millionth of a second) to run. To get an accurate picture of how much CPU time that function uses, the profiler would need to sample it two million times a second. This requirement comes from something called the Nyquist rate ⁴, a concept from signal processing. The Nyquist rate assumes that you're collecting data at perfect, regular intervals ⁵. But even tiny delays can make the data unreliable, and modern CPUs are full of these little delays.

Here's where it gets tricky. Linux Perf usually samples at rates of 1000 or 4000, depending on the version of the operating system. Intel VTune samples at rates of 100 or 1000, depending on the mode. This means you start losing accuracy if a function runs for less than two milliseconds. So, in most cases, you're sampling at rates much too low to capture the details of many functions accurately.

If you're lucky, the shorter functions might not even show up in your data. That's not ideal, but at least it won't give you a wrong idea. More likely, you'll get times that are just plain wrong. And you won't have any easy way to know how far off they are.

VTune's highest sampling rate is 100,000, still 20 times too slow for the best accuracy. Linux Perf adjusts its maximum rate based on how long it takes to process the event data. I've found that it's usually around 20,000, which is 100 times too slow.

And don't forget, cranking up the sample rate isn't a free ride. More data means bigger files and more analysis time. Plus, your program will run slower while you're collecting all that extra data.

Function call counts

Let's say you've got a function in your program that's running slower than you'd like. Naturally, you turn to your profiler for clues. A key question pops into your head: Is this function being called too often, or is it just slow to begin with? Maybe it's both. Knowing this would be super helpful for fixing the issue. But here's the catch: a sampling profiler can't give you that answer.

To me, that's a deal-breaker. It might sound harsh, but I can't see it any other way. If you want to make your program faster, you've got two main options. You can either cut down on how many times that slow function is called or speed up the function itself. These are totally different approaches. A sampling profiler doesn't help you choose between them, and that's a big problem.

Compiler optimizations

Sampling profilers are like detectives for computer programs. They look at the machine language (the language computers understand) to figure out what part of your original code was running when there's an interrupt, kind of like a pause in the process. This can be tricky because the machine code that computers read is often very different from the code we write. Even when we use standard settings in compilers (tools that translate our code into machine language), the final machine code can look quite different.

One common way to make code run faster is using something called 'function inlining.' This is where the compiler simplifies your code to speed it up, and it happens even at low settings. But, this can confuse sampling profilers. They struggle to identify where these inlined functions start and end in the code.

When I profile, or analyze, my code, I like to do it with the same optimization settings (or speed-up settings) that I use in the real world. This seems pretty clear. If you use a sampling profiler on an optimized version of your application, you might not get accurate results.

Tracing profilers, like Giopler, don't have this problem. For them to work correctly, the machine code needs to accurately represent what the original code is supposed to do. The compiler, the tool that translates your code, has the job of making sure it keeps track of the original code, no matter what optimization level is used.

Profiling overhead

Sampling and tracing profilers are tools that help us understand how our computer programs work. They both need to pause the program and gather data to do their job. I've used both types a lot and know how they work inside and out. In real use, they're pretty similar, even though they work in different ways.

Both kinds of profilers use smart tricks to do their job without slowing down the program too much. Giopler, our profiler, is already good at this, and we have ideas to make it even better. But remember, this section is just a general idea of how they work.

Sampling profilers pause your program many times every second to see what's happening. They're like taking a bunch of quick snapshots. Tracing profilers are a bit different. They only pause your program when it starts and finishes certain tasks you're interested in. This means they interrupt the program way less than sampling profilers.

Giopler does something cool with its snapshots. It also checks things like how much free memory the computer has and how busy the CPU is. This adds more useful information to the snapshots. The downside is that it takes a bit longer to gather this info. But I think it's worth it, and it was an easy decision to include this feature in Giopler.

Data instead of information

Sampling profilers capture data for every function they come across, not just the ones you've written. This includes all the extra functions from third-party libraries and possibly even the operating system functions your program uses. Sure, tools like Linux Perf offer ways to limit the scope, but I find those features clunky and error-prone.

Honestly, do I care how long some random library function takes? Probably not. That code is most likely already well-optimized. Plus, the chances of me wanting to rewrite it are slim to none. That's even assuming I know what that function does, which is often not the case.

I see all this extra data as just background noise. Sure, sampling profilers give you a mountain of data, but how much of it can you actually use? Not much. It's like trying to hear a soft melody in the middle of a loud concert; the real insights are there, but they're drowned out by all the other noise.

Advantages of sampling profilers

Sampling profilers are sometimes praised for giving a quick snapshot of a program's performance. But let's be real, the issues I've talked about make me question how useful that snapshot really is. Deciphering the data can be a headache and takes a lot of time. Worse yet, they don't even give you all the details you need to make smart changes to your code.

And let's not forget, the data itself can be iffy. Sure, you get a ton of numbers thrown at you, but what can you actually do with them? It's like having a map full of landmarks but no roads. You see where things are, but you have no idea how to get there. So, while these tools may be "quick" at generating data, how about turning that data into real improvements for your program? Not so quick at all.

Tracing profilers

Tracing profilers, such as Giopler, work a bit differently. Instead of periodically interrupting the program, they let the program's own threads send out events. At the very least, these events happen when a function starts and stops. But that's not all. Giopler can send out events for many other reasons too. For instance, it has a special library to help you find bugs in your code. And we're planning to add other features such as for unit testing.

Because of the way they work, sampling profilers usually don't capture these extra types of events. They're focused more on taking quick, periodic snapshots. So, they miss out on a lot of the richer details that a tracing profiler like Giopler can provide.

Server offloading

At Giopler, our powerful servers do the grunt work so your computer doesn't have to. As soon as an event is received, it's rapidly broken up into 10+ tables using 50+ indexes. Why? Because we understand that data analysis is about asking "what if?" Being able to pivot your inquiry instantly is crucial, so we precompute as much data as we can. We aim to save you time, not waste it. That's what sets us apart from tools that are confined to your local computer.

But what about the code running on your end? We've designed it to be as lightweight as possible. All it does is quickly take a snapshot of the running thread and set it aside for later. We have a separate group of threads responsible for packaging, compressing, and sending this data to our servers. This way, the impact on your program's performance is minimal.

In a nutshell, Giopler offers you the best of both worlds: minimal impact on your local resources, and robust data analysis on our end.

Intelligent tagging

In tracing profilers like Giopler, you have to manually tag or annotate specific functions to understand their behavior better. While some may consider this an extra step or even a downside, it can actually be an advantage when you look at the bigger picture.

By manually tagging functions in your code, you're essentially embedding your understanding and context directly into the data that the profiler will collect. This creates a richer, more nuanced dataset that is far easier to interpret and act upon later. It's a one-time investment of effort that can yield significant long-term benefits, such as:

Clearer Insights: Because you know your code best, tagging key functions can help you and your team to immediately focus on the areas that matter the most.
Reduced Noise: Tagging helps in filtering out irrelevant functions and processes, making it easier to drill down into the real issues affecting performance.
Ease of Collaboration: Tagged code can serve as a sort of documentation, making it easier for team members to understand what each part of the code is supposed to do, and how it fits into the overall performance picture.
Future-proofing: As your project evolves, these tags can serve as markers to quickly identify if and how changes in the code are affecting performance in areas that have been previously optimized or debugged. So, rather than seeing manual tagging as an extra burden, it can be viewed as an empowering feature that enhances the utility of the profiling process. By doing it once in the code, you're saving time and adding clarity for every future profiling run.

Unwanted overhead

Tracing profilers can be tricky. Why? Some think they'll make your program run slower, even if you're not using them. But with Giopler, thanks to some cool updates in C++20, this isn't much of an issue. Here's why.

Giopler has a special setting that controls how it works. When we use certain features in Giopler, and then turn them off, there's no extra work for your program. In simple terms, they're just gone. No extra weight!

Now, for some technical bits: there are features in Giopler tied to local class variables. Due to C++ rules, these can't vanish entirely. But we've done the next best thing. We insert an unused std::unique_ptr variable, which is kind of like a placeholder that doesn't do much. And even in small functions, this barely adds any work for your program.

So, the bottom line? Don't stress about Giopler slowing things down in production mode. The benefits are totally worth it, and any slowdown is super tiny.

Summary

In conclusion, Giopler isn't just another profiler—it's an evolution. It embodies the understanding of the challenges developers face when working with sampling profilers. It offers a fresh perspective and provides solutions that bridge the gap between accuracy, efficiency, and usability.

While traditional tools like Intel VTune and Linux Perf have their merits, they also come with certain limitations. Sampling profilers can offer a quick view, but they often don't go deep enough or provide the actionable data you really need to improve a program's performance. With their set sample rates, it's easy to miss out on capturing the intricacies of short functions and understand their true impact on performance.

Giopler, on the other hand, operates on a different principle. As a tracing profiler, it offers rich and accurate insights into your program's behavior. By allowing you to manually tag functions, Giopler ensures that you're collecting data only on what truly matters, reducing noise and making analysis more meaningful. With our robust data analysis happening on powerful servers, Giopler ensures that the heavy lifting doesn't impact your local resources.

Moreover, with the Giopler C++20 client library, you're not just getting a tool, you're getting a highly optimized and streamlined library that's easy to integrate and work with. Its modular design ensures that adding it to your workflow is hassle-free, and it has been crafted keeping the best coding practices in mind.

So, if you've ever been frustrated with the limitations of traditional profiling tools or if you're simply looking for an easier, more intuitive, accurate, and efficient way to analyze your C++ programs, Giopler is here to change the game for you.

Disclosure: This post may contain affiliate links. If you use these links to make a purchase, we may earn a commission at no extra cost to you. This helps support our work but does not influence what we write about or the price you pay. Our editorial content is based on thorough research and guidance from our expert team.

About Giopler

Giopler is a fresh approach for writing great computer programs. Use our header-only C++20 library to easily add annotations to your code. Then run our beautiful reports against your fully-indexed data. The API supports profiling, debugging, and tracing. We even support performance monitoring counters (PMCs) without ever requiring administrative access to your computer.

Compile your program, switching build modes depending on your current goal. The annotations go away with zero or near zero runtime overhead when not needed. Our website has charts, tables, histograms, and flame/icicle performance graphs. We support C++ and Linux today, others in the future.

Embracing Efficiency: Data-Oriented Design for Software Optimization

Icicle Lemonade Recipe