1:29:40 PGO(Profile Guided Optimization) is what you're looking for - it lets the program gather data for how likely each branch is, and this data can be used for optimization on the second compilation
BTW. There is a shortcut in `perf` to use inverted callgraph. `perf report -G`. You still need to record with `perf record -g` of course and have some debug info present in the binary (if needed use -Og and/or -ggdb and/or -fno-omit-frame-pointer, or some combination, but be careful as some of these options might impact measurements itself).
I agree with perf's decision to display the most expensive callees first (see the discussion at ~32 minutes). It's what you need to initially know. If your reaction is "WTF this function isn't supposed to show up in the list at all", you immediately know that it's way too expensive or called in places you don't expect.
counts for "cycles" often seem to get attributed to the instruction that gets stuck waiting for the result (reads that register), not just the next instruction in program order. At least that's true for cache-miss loads. At 1:09:30 we can see high counts on jmp instructions back into the main loop, and they directly follow idiv. Out-of-order execution makes it hard to blame any specific instruction for a cycle because multiple instructions can be in flight at once. To make matters worse, the "cycles" even doesn't usually use PEBS. The blame might go to the oldest instruction in the ROB, which would explain cache-miss loads blaming the instruction that's stuck waiting. The load itself may be able to retire, just leaving the load incoming via a load buffer without blocking retirement until an instruction that needs the load result value. Or not, I forget if that's true. BTW, the "cache miss" explanation for the benefit of using UNLIKELY when unrolling is implausible in these tight loops. More likely the benefit is taken-branch throughput effects on the front-end vs. not-taken. When idiv doesn't have to run, instructions per clock is potentially quite high.
If we are interested in the performance of vector::push_back() alone, for we normally donot call vector::reserve() before hand, then bench marking vector::push_back() after vector::reserve() in the sample given, are not we kind of only measured the performance of vector::push_back() partially?
Imagine being at google for the second week and you get Ken Thompson's code that was basically addressed to you. I'm sure it was beautiful. Isn't it fascinating how you need a long time to explain all the stuff you need to do to get some numbers from the performance benchmarking that should sort of come from just running the program as default? I'm sure this perf is made by and for lvl 99 linux wizard engineers to have a secret language to cast incantations and keep their secrets safe from normal engineers and programmers. It is not meant to provide a nice tool for the "normal" people but just a as an accidental byproduct give them some useful features once you figure out the spell to call them out.
Any decent (and thorough) book on compilers would be a good start. He's basically explaining why micro benchmarking is hard and how to do it correctly.
Why does my perf (ver 3.15.5, running on Gentoo) look totally different? Cannot select the line item to expand(already expanded). Therefore, cannot hit 'a' to see the assembly.
At 42:45 he says that "asm volatile ..." doesn't work on Microsoft's compiler, and that he suspects there is something else that would. Does anyone know what that something else is?
Matter fact... Sorry no, Leave the G out, just use #pragma optimize( "", off ) ... it will turn EVERYTHING off for you. All optimizations will be disabled which is what you want in a benchmark.
@@seditt5146 benchmarking un-optimized code is generally useless. The overhead vs. normal code can vary wildly depending on factors that aren't important in real code, like whether you used a std::vector or an array. e.g. stackoverflow.com/questions/57124571/why-is-iterating-an-stdarray-much-faster-than-iterating-an-stdvector . Your best bet to get the asm you want to benchmark may be to to assign to a `volatile int sink` variable, which costs you an extra store instruction. But you need to read the asm to make sure the compiler isn't CSEing across iterations or something.
I would wonder if this is the case. A barrier is just a sign that read/writes must not be reordered crossing a barrier. But it would by no means "clobber" all available memory and disable all further optimizations. If so that would be a performance nightmare for non-blocking data structures using a lot of atomic ops and barriers.
_mm_mfence()? Idk, I am not sure exactly what volatile does just that the fence and likely the barrier you mention blocks reordering of memory past that specific point so it should at the very least stop reordering of the code. What you need in MSVC is #pragma optimize( "g", off ) I am pretty sure. Now that I look at his code above I almost thing that might even mean the same as the g is for global optimization and turns it off between the blocks. I place it in a macro due to some issues I have with threading and the optimizer defeating a duel spin lock system I had in place allowing it to deadlock when it shouldn't.
51:45 Chandler Carruth didn't really explain anything during this time, scrolling up and down through the assembly and pretending that his audience understands what is going on, without him explaining it at all. Not sure if he himself understood all of those assembly he seems to talk about. He touches a topic, and about to explain it, but does not actually explain it and instead let the audience see and understand it themselves (presumably).
For my own notes (and anyone else's!) Chandler's recommended flags for record and report are at 32:52
1:29:40 PGO(Profile Guided Optimization) is what you're looking for - it lets the program gather data for how likely each branch is, and this data can be used for optimization on the second compilation
BTW. There is a shortcut in `perf` to use inverted callgraph. `perf report -G`. You still need to record with `perf record -g` of course and have some debug info present in the binary (if needed use -Og and/or -ggdb and/or -fno-omit-frame-pointer, or some combination, but be careful as some of these options might impact measurements itself).
Chandler Carruth is my spirit animal.
I agree with perf's decision to display the most expensive callees first (see the discussion at ~32 minutes). It's what you need to initially know. If your reaction is "WTF this function isn't supposed to show up in the list at all", you immediately know that it's way too expensive or called in places you don't expect.
A link to Bryce Adelstein-Lelbach “Benchmarking C++ Code" that Chandler mentions in his intro: th-cam.com/video/zWxSZcpeS8Q/w-d-xo.html
I appreciate the link.
counts for "cycles" often seem to get attributed to the instruction that gets stuck waiting for the result (reads that register), not just the next instruction in program order. At least that's true for cache-miss loads. At 1:09:30 we can see high counts on jmp instructions back into the main loop, and they directly follow idiv.
Out-of-order execution makes it hard to blame any specific instruction for a cycle because multiple instructions can be in flight at once. To make matters worse, the "cycles" even doesn't usually use PEBS.
The blame might go to the oldest instruction in the ROB, which would explain cache-miss loads blaming the instruction that's stuck waiting. The load itself may be able to retire, just leaving the load incoming via a load buffer without blocking retirement until an instruction that needs the load result value. Or not, I forget if that's true.
BTW, the "cache miss" explanation for the benefit of using UNLIKELY when unrolling is implausible in these tight loops. More likely the benefit is taken-branch throughput effects on the front-end vs. not-taken. When idiv doesn't have to run, instructions per clock is potentially quite high.
This is a great talk, even while I don't understand 50% or more of it.
Run on (48 x 2717.5MHz CPU s)
guy is simply amazing : )
I'm on the way to watch all Chandler Carruth's talks on YT.
This is what I want. Superb in it's diving the matter deeply.
Here is the talk about macro benchmarking that's mentioned in the beginning: th-cam.com/video/zWxSZcpeS8Q/w-d-xo.html
Go to C++ talk and ask what kind of vim setup he uses :D
This is extremely brilliant!
If we are interested in the performance of vector::push_back() alone, for we normally donot call vector::reserve() before hand, then bench marking vector::push_back() after vector::reserve() in the sample given, are not we kind of only measured the performance of vector::push_back() partially?
Color scheme is Jellybean
Naren Allam bless you. That vimrc was noice.
Imagine being at google for the second week and you get Ken Thompson's code that was basically addressed to you. I'm sure it was beautiful.
Isn't it fascinating how you need a long time to explain all the stuff you need to do to get some numbers from the performance benchmarking that should sort of come from just running the program as default? I'm sure this perf is made by and for lvl 99 linux wizard engineers to have a secret language to cast incantations and keep their secrets safe from normal engineers and programmers. It is not meant to provide a nice tool for the "normal" people but just a as an accidental byproduct give them some useful features once you figure out the spell to call them out.
Fast forward. I like his speed as i drink beer while watching.
Can anyone suggest a book that talks about the stuff mentioned in the video in C++? Thanks.
I don’t think there is a book on these specific microtopics that is why he does these talks.
Any decent (and thorough) book on compilers would be a good start. He's basically explaining why micro benchmarking is hard and how to do it correctly.
Why does my perf (ver 3.15.5, running on Gentoo) look totally different? Cannot select the line item to expand(already expanded). Therefore, cannot hit 'a' to see the assembly.
Did you run both 'perf record' and 'perf report' with the -g parameter?
Yes. Could it be because I have the bare Gentoo without X server? Thanks.
I wonder which optimization passes deleted the v.reserve(1); push_back(42); calls at 48:13. Actually, gcc doesn't optimize that away.
Anyone knows what kind of iTerm2/vim setup he uses?
cool video
Looks like 'vim-hybrid'. Check it out on GitHub
Can't find perf for MSYS2. Is this not available for it?
At 42:45 he says that "asm volatile ..." doesn't work on Microsoft's compiler, and that he suspects there is something else that would. Does anyone know what that something else is?
#pragma optimize( "g", off ) Try that.
Matter fact... Sorry no, Leave the G out, just use #pragma optimize( "", off ) ... it will turn EVERYTHING off for you. All optimizations will be disabled which is what you want in a benchmark.
@@seditt5146 benchmarking un-optimized code is generally useless. The overhead vs. normal code can vary wildly depending on factors that aren't important in real code, like whether you used a std::vector or an array. e.g. stackoverflow.com/questions/57124571/why-is-iterating-an-stdarray-much-faster-than-iterating-an-stdvector .
Your best bet to get the asm you want to benchmark may be to to assign to a `volatile int sink` variable, which costs you an extra store instruction. But you need to read the asm to make sure the compiler isn't CSEing across iterations or something.
Is there a profiler from LLVM ?
How is he running perf on osx? Is he remoting into a Linux box?
Yes, he sshs to Linux machine.
Compilers and performance.
i never get idea of what piece of c++ code doing at first look
Amazing! :)
I believe MSVC's '_ReadWriteBarrier' has a similar effect to 'asm volatile("" : : : "memory");'.
I would wonder if this is the case. A barrier is just a sign that read/writes must not be reordered crossing a barrier. But it would by no means "clobber" all available memory and disable all further optimizations. If so that would be a performance nightmare for non-blocking data structures using a lot of atomic ops and barriers.
_mm_mfence()? Idk, I am not sure exactly what volatile does just that the fence and likely the barrier you mention blocks reordering of memory past that specific point so it should at the very least stop reordering of the code. What you need in MSVC is #pragma optimize( "g", off ) I am pretty sure. Now that I look at his code above I almost thing that might even mean the same as the g is for global optimization and turns it off between the blocks. I place it in a macro due to some issues I have with threading and the optimizer defeating a duel spin lock system I had in place allowing it to deadlock when it shouldn't.
Hi just wanted to know whether static analysis of program code does work well even now 2020?
so uh. using reserve, the pushback takes... 2e-9 s * 3e9 1/s ... 1.5 cycles? 🤨
It looks like a simple tmux + vim setup. vim only has vim-airline.
it's faster... in THIS case where no other work is done.
good
The missing frames and video stutter are giving me a headache
wow
Talk starts at 40:20. Thank me later
Thx
I don't get it, what reason can you possibly have to downvote this video o0
that pushback benchmark. turns out the cpu does either 72 or 73 cycles for it. Huh.
51:45 Chandler Carruth didn't really explain anything during this time, scrolling up and down through the assembly and pretending that his audience understands what is going on, without him explaining it at all. Not sure if he himself understood all of those assembly he seems to talk about. He touches a topic, and about to explain it, but does not actually explain it and instead let the audience see and understand it themselves (presumably).
Got tired after 30 minutes of listening to Captain Obvious.
does anybody know the .vimrc Chendler's using ?
vim-airline w/ powerline fonts (you'll need extensive trail-and-error to get it working).
good