I recently came across a post on the Intel® Many Integrated Core Architecture (Intel MIC Architecture) forum wherein the developer was expecting a certain count for a hardware event but this count was always zero. There are several cases in which you can encounter this behavior. Let’s walk through each scenario and take a closer look at what could be happening during the hardware event collection.
Let’s start with the simplest, though most improbable, cause. You are seeing zero counts for a particular event because that hardware event never occurred. If the application from which you are measuring events happens to fall into this category, then you are one of those super awesome, lucky developers, who have tuned their code to perfection and the code itself is a superb match for the hardware.
If you think that you are not one “those” lucky developers then the only other possibility is that that the data provided by Intel VTune Amplifier is not representative of what really happened in the hardware (provided you did everything correctly!). To figure out why Intel VTune Amplifier didn’t correctly perform your data collection, let’s take a step back and understand how Intel VTune Amplifier collects hardware events. Intel VTune Amplifier depends on the Performance Monitoring Units (PMUs) present in the silicon to collect hardware event statistics. During a collection, Intel VTune Amplifier’s driver programs the PMU to monitor a set of hardware events and sets a `sample after` value (SA) for each event. During your program’s execution, the PMUs continuously monitor the hardware events and after `SA` numbers of events have occurred, the PMU generates an interrupt. On receiving the interrupt, the Intel VTune Amplifier’s driver swoops in and collects a representative sample for the `SA` number of hardware event that occurred.
Now, we can see where things can go wrong here. If the number of the hardware events that occurred in your program are less than the `sample after` value then no events will be sampled by Intel VTune Amplifier, and it will assume that no hardware events occurred during the run (even though they had) thus reporting zero counts. Another point to understand here is that Intel VTune Amplifier does not report exact number of hardware event counts but rather gives you an approximation that has a granularity equal to the `sample after` value of the event. Hence, one way to fix this is by creating a custom analysis in Intel VTune Amplifier and setting a lower `sample after` value. Empirically, events occurring less than 10,000 times are insignificant and hence the `sample after` value should be no lower than 10,000. Also, further lowering the sample after value will cause system perturbations and provide incorrect results. In general, the `sample after` value should be chosen such that it does not generate an interrupt more frequently than once every 10 msec.
This raises another interesting question: is my profiling data for the default `sample after` value statistically valid? Well, the Intel VTune Amplifier developers have carefully chosen the default `sample after` value for each event so that in most cases your profiling data will be statistically valid and will correctly reflect the behavior of your application. However, in certain cases the data is not statistically valid. The primary reason (besides the ones mentioned above) for this being that, by default, Intel VTune Amplifier multiplexes between hardware events. Due to hardware limitations, the PMU can monitor only a small set of hardware events during a profiling run. Thus, if you need to collect a large number of hardware events in one profiling run, then you will have to execute the application multiple times to get all the information you need. Intel VTune Amplifier overcomes this by multiplexing between events during the run (which is extremely useful in most cases!). However, when using multiplexing Intel VTune Amplifier uses extrapolation to estimate the number of events. If the number of hardware events for your application was small to begin with, perhaps because the application runtime was short, then this will render your results statistically inaccurate. In such cases, you should allow Intel VTune Amplifier to perform multiple runs of the application instead of having Intel VTune Amplifier multiplex the events.
Lastly, we have the golden question: When should you use multiple runs over multiplexing? When running a simple hotspot analysis on your application, if you notice that the CPU_CLK_UNHALTED event has an Event Sample Count of less than 10,000, or that the INSTRUCTIONS_EXECUTED event has an Event Sample Count of less than 1,000, then your application is probably not long enough to provide valid results with event multiplexing. In such cases, you should allow Intel VTune Amplifier to execute the application multiple times in one profiling run instead of multiplexing the events. Another scenario where it is wiser to allow Intel VTune Amplifier to use multiple runs for profiling is when you have an application whose behavior will not be in steady state during sampling. You can enable multiple runs in Intel VTune Amplifier by selecting the appropriate check box in the project properties. At this point, I would also like to bring to your notice that when using multiple runs for profiling, you are implicitly assuming that your application has a repeatable performance. If this is not true then you probably should refrain from using multiple runs for you application. And lastly, always remember that higher Event Sample Counts indicate greater statistical validity.
Immagine icona:
