2015-09-14

Flame Graphs vs Instruments in OS X: using Intel's Performance Counters

TL;DR: as of last update for OS X 10.11 / Instruments.app 7.1, you can't really create meaningful CPI Flame Graphs on OS X because of bugs in Apple's tools, both GUI and command-line. There are some alternatives and workarounds; for an alternative to a CPI Flame Graph, Instruments.app might be good enough. I sent 6 bug reports to Apple, and 5 of them got marked as duplicates; so maybe some Instruments post-7.1 will get good enough to allow the data extraction needed for real Flame Graphs.



I have been trying to generate Brendan Gregg's Cycles-Per-Instruction Flame Graphs on OS X (10.10), mostly to check cache-related CPU stalls. However, after working on it for a good while, it's looking to me like the given method is somewhat shaky, and the same or better insights about the code can be gotten more easily; partly thanks to Instruments.app, … and partly in spite of Instruments.app.


Now, the first thing to notice is that those inspiring CPI Flame Graphs are in fact not showing CPI. Gregg is sampling independently "cycle" events and "resource stall" events, meaning that he's taking a stack trace every time the "cycle" counter counts 1 million, and everytime the "resource stall" event counter counts 1 million [1]

And, if I am not missing anything (unsure because the only example is the truncated stack trace in the blog post, there are no further examples in the GitHub repo, and I didn't get any response to my comments), then those events are treated by Gregg as independent. So, at the end of the sampling and graphing process, what the we get is actually a display of which parts of the code have an imbalance between "resource stall" events and "cycle" events. Which is interesting in itself, but I don't see how that can be called "CPI" – unless heavily quoted [2]. In fact, if anything, it would be (an approximation of) something more like "Cycles per Stall"; when plotted against proper CPI they are not far off, but they certainly are not the same.

So, if we decide that sampling for resource stalls is interesting in itself, turns out that good old [3] Instruments.app is enough for getting most of the bang-for-the-buck: 
  • fire it, 
  • create a Counter document, 
  • choose to sample on event, 
  • choose RESOURCE_STALLS.ANY, 
  • choose any number of events – probably 1M is ok to begin with, and choose the kind of stacks that interest you – user, kernel, both.
  • choose your executable and configure it if needed (params, environment vars)
  • run your code. 
That's it; the execution points (code lines, functions, PC register values, whatever) will be annotated with the quantity of times the trigger found them, and the call stack is there to navigate. If you add the same counter to be captured, then you'll also have an histogram showing how the counter values are distributed vs the running time.

Again, this is mostly useful for heavily looping code, since chance is that any events will happen most frequently on stall-heavy points. And I am betting that one could justify it properly in a Bayesian way, but I am not going to. Hey, exercise for the interested reader!

Instead I'm going to suggest that, for less loopy code, and frankly for a more general and intuitive usage [4], sampling by time might be better. One problem is that Instruments.app's GUI in v6.4, current as of OS X 10.10, does not work for time units different than millisecs; it lets you choose seconds and microseconds, but it doesn't honor that anyway. So if you find yourself wanting shorter periods than millisecs you'll have to for example sample on the "cycle" events. [5]

However, sampling at 1ms might be good enough. When sampling by time the values of the counters at every sample point reflect what happened on that period. So it is now trivial to collect the "cycles" and "instructions" counters for each sample, divide them, and you have the real (though averaged!) CPI for the last period; indeed, you can capture not only raw counters but formulas, and you can see that the IPC formula is provided by default in the gear menu. And it's even correctly defined: Instructions executed / Cycle. No need to take a detour through "resource stalls".

The histograms, though useful, where pretty crappy on v6.4; but in v7 they are noticeably improved. Pity that the rest of bugs seem in place still.

The other detail is that Instruments.app can capture more than PMC counters on the same run; for example, we could coordinate other instruments together, even DTrace custom scripts, to for example see how IO activity or thread switches relate to cache misses.

If only Instruments.app was more robust.


But I want a Flame Graph!


You can't have it, no-thanks to Instruments.app. Let me explain why, in heart-breaking detail.

If we really wanted now to go the extra mile and make a Flame Graph, we would have to export the captured data and digest it in some way that can be then fed to the flamegraph.pl script. Exporting is in principle simple, though cumbersome; there is no way to automate it from the command line, but rather one has to go into Instruments.app, fumble a bit and choose the data export option. There is some AppleScript support in Instruments, but it is not good enough for automating this, since the first step is visually organising the call tree table columns, because they define how the data is exported. Also one has to unfold manually all the twisties so they get exported. Of course one could script the GUI to do all of this, which is a great last-recourse option, but sucks badly when it's the only option. LAME, APPLE. [6]

Once the data was exported, which it won't as we'll see, we'd need to use the stackcollapse-Instruments script included in the flamegraph repo. It's a very simple script which expects the exported data in a CSV format, and as said the contents of the CSV data is defined by the table. So go to the page talking about the stackcollapse-Instruments script, take a look at the Instruments screenshot, and organize your calltree columns in the same way – only swapping the "Self" column for the data column you want to get represented in your flamegraph.

And that's where Instruments.app lets us down: it summarizes the stack traces, accumulating the time and number of times the stack trace was observed. But it does NOT allow to summarize on any different way, like accumulating the number of resource stalls, or cycles, or CPIs, or whatever. Which is what we need, of course. The counters are shown and exported, but they clearly aren't the sum of the counter values in the collected identical samples, which is what we need. (what are they? some debatably-useless average, or… what? I'm even afraid to ask at this point)

Just for completeness, as a note to myself in case in some future Instruments.app managed to suck less – or if anyone ever thinks about this after I already forgot – IF we could get that counter-summarized list of stack traces, that still would only get us one stack-collapsed file, which will only get us a vanilla flamegraph. But a CPI flamegraph as done by Gregg is an abuse of a differential flamegraph. So if following him we need two stack-collapsed files, each of them representing one of the counters used by him, meaning the "cycles" and the "resource stalls". So, we'd

  • organize the Instruments.app table for "cycles" export, and do the export
  • reorganize the table for "resource stalls", and do the export
  • feed each of the exported files to the stackcollapse-Instruments script
  • and apply the last step of Gregg's process: use his diffolded script on both stackcollapsed files and generate the flamegraph
Of course another option would be to modify the stackcollapse-Instruments script to understand your type of table. It's just a handful of perl lines, and even there are some experimental python versions of other scripts in the dev directory of the repo.

However the simplest and most future-proof way of doing things would to just get an unfiltered list of raw stack traces with their counter values. That way we could process them in whatever way we would feel like. Note that Instruments.app can export the full list of samples and their counters, but then each sample only lists the top of the stack instead of the full stacks. So close, yet so far.

Also, we had already stablished that we can get real CPIs instead of the cycles vs resource stalls mishmash. How to put that into a flamegraph? Well, the flamegraph scripts only accept positive integers as the weights for stacks (or so it seems), so we'd have to adapt the floating point CPI values; probably just multiplying by say 1000 would be enough. That way the wider "flames" would be representing the slower, higher-CPI traces.

Mac OS X 10.11 El Capitan is around the corner, so I will recheck Instruments.app on it and see if anything got improved. I'll report here either way – but can already foresee filing a bunch of bug reports to Apple. UPDATE: in Instruments.app v7.1 beta on Mac OS X 10.11, the histograms are much better, even pretty. All the other bugs are still there. I posted 6 bug reports to Apple. 5 of them got marked as duplicate. Heh. Not holding my breath.

It's kinda interesting/depressing for a long-time Mac user like me that all of this is making me want to make a permanent Linux / FreeBSD partition just for development. That way I could be fixing my problems directly in the applications' code, instead of having to just document them for the next explorer who tries to follow this rocky path.

That's supposing that there was something kinda equivalent to Instruments.app in either Linux of FreeBSD, that is. Which I am afraid there isn't; that's why flame graphs can be so useful in there.

But maybe that's what I should be doing then… 

Some background on profiling in OS X


Mac OS X has included DTrace for years already, which is terribly nice for profiling and all kinds of insight into the system. However, to get something at the level of CPU instructions (like a CPI calculation), you need data with that resolution. DTrace is too rough for that. [7]

There is a variety of Performance Counters on Intel processors (seemingly variously named PMC (Performance Monitoring Counters), PMU (Performance Monitoring Unit), CPC (CPU Performance Counters), PIC (Performance Instrumentation Counters), PME (Performance Monitoring Events), OMG WTF FFS), and Instruments.app can tap them with the Counters instrument. 

A bit of a stumbling block is that Instruments.app is terribly flimsy; look at it a bit too hard and it hangs. Look at it sideways and it runs your app but won't record anything, not even complaining of any problem. Look at it inquisitively and the machine restarts. BRAVO, APPLE!

One concrete example was already mentioned: time-based sampling of counters is broken for non-ms units. It looks to me like it's purely a GUI problem: you can choose seconds and microseconds as additional time units, but they are not honored [8]. When I wanted smaller than 1 ms, I settled on sampling every 50k Cycle events – which on my 2.8 GHz Mac is about 18 us. Much smaller than that, and the recording will either silently fail to record anything, or restart the machine with a menacing double-chime and long, black-screen, pondering.

Anyway, recording the Cycle values shows that the counter is captured with a lag of about 2k-3k cycles, so looks like we are already at a point where the latency between trigger and capture is around 5% of the requested period. Probably doesn't make sense to go smaller.

Next problem: even if Instruments.app manages to record, if the trace is biggish, the app unceremoniously dies. I have managed to see traces of about 600k samples, but it took the app long enough to show them that I had already assumed it was dead… and the moment I started moving around the small, anemic charts, it did die. Other times I tried some more demanding sampling which didn't work, and when I went back to parameters that previously had worked well, Instruments stubbornly kept not recording anything until the whole machine was restarted.


Profiling from the command line


Fortunately, OS X includes two command-line, er, commands for profiling: iprofiler and instruments. So at the very least the capturing can be done independently, and then we can try analysing offline. But don't get your hopes too high yet.

Iprofiler initially looks like the best option, because it looks independent from the app and its manpage even makes sense! It made me think it would be the equivalent to FreeBSD's pmcstat. But, NO. It's plain useless (for PMC recording at the very least). The -counters -pmc PMC_MNEMONIC option doesn't seem to work at all; will accept -pmc OHGODHOWIHATEYOU as happily as a real mnemonic. Of course that means that the error checking is none (THANKS AGAIN, APPLE) and then maybe I am using bad mnemonics, but I have tried the same used by Instruments.app (which are the same used by pmcstats and Intel's docs and Intel's profiler) and nothing works.

But using the instruments command (and a template created on Instruments.app with the desired settings) finally one can get a real profile with PMC values dumped into a file that won't evaporate the moment Instruments.app kicks the bucket. Of course you need Instruments.app to do anything with that trace

Furthermore, the trace seems to be cleaner this way; for example there is no awkward empty time at the beginning like when capturing directly from Instruments.app. 


Things to be aware of


Neither iprofiler, instruments nor Instruments.app allow you to do stdin/stdout redirection to/from a profiled program. In fact, when running instruments, the stdout from the profiled program is not even shown – though it is stored and shown once you open the trace file in Instruments.app. Retarded, I know, right? Clearly command-line programmers are not specially courted by Apple. Well, anyway, if Instruments.app is an example of what they do for the courted ones... ahem. I expected much more, in case it wasn't clear already.

Given the problems in OS X, you might be tempted to use Linux's perf or oprofile, or FreeBSD's pmcstat or a variety of options listed by Gregg in his posts. Probably most are in the longer run saner routes than anything touching Instruments.app. BUT, beware if using a Virtual Machine for this. VirtualBox as of version 5.0.4 does not support PMC counters (always report 0). VMware seems to support some virtual PMC for some years already (since VMware Fusion 5, Workstation 9). Parallels also mentions PMU virtualization since v9.

Another maybe interesting way to get stack traces with cache performance information (for the flame graphs or not) might be using Valgrind/callgrind. I still didn't check whether the export would be good for flamegraphing; the truth is that KCacheGrind has seemed up to now good enough to not bother with the flame graphs. 




[1] I am quoting "cycles" because the PMC event definition is more precise than that, but for informal reasoning we can just equal it to processor cycles. Same for "resource stalls".

[2] IF the code behaves well enough, like looping a lot over a section getting both kinds of events, then maybe the final relationship between collected events might even correlate to what the real CPI would show; but at the end of the day that is not so important for the purpose of finding CPU stall problems.

[3] not specially old nor good, but plenty of flimsy

[4] "Intuitive" is hardly adequate a word. You better have a very good idea of what the code is doing and what are the capabilities of the hardware if you plan to make sense of what the profiler shows and what could be improved. For example: what COULD be ideal IPC value in your machine? And WHY your program is not getting it? On the other hand, once you have some idea about that, the fact that sampling on time shows the evolution of the counter values certainly makes it more intuitive to me than sampling on events.

[5] But be careful, because if you go to too short periods (lower than say 50k cycles), Instruments.app will either drop events silently, or hang, or reboot the whole machine. And anyway, even if it seems to work right, I'd recommend saving your results often, since it has a tendency to die unprovokedly. I'm even suspecting that the whole OS/machine gets unstable; I have had a couple of reboot-on-wake-from-sleep cases the last couple of days in which I had been playing more intensively with Instruments.app. Beware also on the "cycles" event as a substitute of wallclock time; I didn't check on whether they differ when for example a thread is scheduled out.

[6] I am guessing that one possibility would be to create an Instruments.app template which defines not only what has to be sampled but how the results have to be shown in the call tree table, and then the GUI trip needed between sampling from the CLI and wanting to export is shortest, and maybe even simple enough to be automated though the limited AppleScript support in Instruments.app and without the need of GUI scripting.

[7] I hear that in Solaris, DTrace can be used to access Performance Counters but that that was still not implemented in FreeBSD as of 2015. I also don't see any DTrace provider for PMC access in OS X 10.10 (list of providers here).

[8] If you look for _sampleRateUnits in a saved Instruments.app template (you'll need to convert it with PlistBuddy), you will "ms" a few lines under it. Maybe changing the units in the file would work, but I haven't tried.

1 comment

  1. Somehow I missed that there are command-line tools by Intel for checking the Performance Counters on OS X! Still didn't check them, but as a reminder... https://software.intel.com/en-us/articles/intel-performance-counter-monitor

    ReplyDelete