1:29:40 PGO(Profile Guided Optimization) is what you're looking for - it lets the program gather data for how likely each branch is, and this data can be used for optimization on the second compilation
BTW. There is a shortcut in `perf` to use inverted callgraph. `perf report -G`. You still need to record with `perf record -g` of course and have some debug info present in the binary (if needed use -Og and/or -ggdb and/or -fno-omit-frame-pointer, or some combination, but be careful as some of these options might impact measurements itself).
I agree with perf's decision to display the most expensive callees first (see the discussion at ~32 minutes). It's what you need to initially know. If your reaction is "WTF this function isn't supposed to show up in the list at all", you immediately know that it's way too expensive or called in places you don't expect.
counts for "cycles" often seem to get attributed to the instruction that gets stuck waiting for the result (reads that register), not just the next instruction in program order. At least that's true for cache-miss loads. At 1:09:30 we can see high counts on jmp instructions back into the main loop, and they directly follow idiv. Out-of-order execution makes it hard to blame any specific instruction for a cycle because multiple instructions can be in flight at once. To make matters worse, the "cycles" even doesn't usually use PEBS. The blame might go to the oldest instruction in the ROB, which would explain cache-miss loads blaming the instruction that's stuck waiting. The load itself may be able to retire, just leaving the load incoming via a load buffer without blocking retirement until an instruction that needs the load result value. Or not, I forget if that's true. BTW, the "cache miss" explanation for the benefit of using UNLIKELY when unrolling is implausible in these tight loops. More likely the benefit is taken-branch throughput effects on the front-end vs. not-taken. When idiv doesn't have to run, instructions per clock is potentially quite high.
Imagine being at google for the second week and you get Ken Thompson's code that was basically addressed to you. I'm sure it was beautiful. Isn't it fascinating how you need a long time to explain all the stuff you need to do to get some numbers from the performance benchmarking that should sort of come from just running the program as default? I'm sure this perf is made by and for lvl 99 linux wizard engineers to have a secret language to cast incantations and keep their secrets safe from normal engineers and programmers. It is not meant to provide a nice tool for the "normal" people but just a as an accidental byproduct give them some useful features once you figure out the spell to call them out.
If we are interested in the performance of vector::push_back() alone, for we normally donot call vector::reserve() before hand, then bench marking vector::push_back() after vector::reserve() in the sample given, are not we kind of only measured the performance of vector::push_back() partially?
Any decent (and thorough) book on compilers would be a good start. He's basically explaining why micro benchmarking is hard and how to do it correctly.
Why does my perf (ver 3.15.5, running on Gentoo) look totally different? Cannot select the line item to expand(already expanded). Therefore, cannot hit 'a' to see the assembly.
At 42:45 he says that "asm volatile ..." doesn't work on Microsoft's compiler, and that he suspects there is something else that would. Does anyone know what that something else is?
Matter fact... Sorry no, Leave the G out, just use #pragma optimize( "", off ) ... it will turn EVERYTHING off for you. All optimizations will be disabled which is what you want in a benchmark.
@@seditt5146 benchmarking un-optimized code is generally useless. The overhead vs. normal code can vary wildly depending on factors that aren't important in real code, like whether you used a std::vector or an array. e.g. stackoverflow.com/questions/57124571/why-is-iterating-an-stdarray-much-faster-than-iterating-an-stdvector . Your best bet to get the asm you want to benchmark may be to to assign to a `volatile int sink` variable, which costs you an extra store instruction. But you need to read the asm to make sure the compiler isn't CSEing across iterations or something.
I would wonder if this is the case. A barrier is just a sign that read/writes must not be reordered crossing a barrier. But it would by no means "clobber" all available memory and disable all further optimizations. If so that would be a performance nightmare for non-blocking data structures using a lot of atomic ops and barriers.
_mm_mfence()? Idk, I am not sure exactly what volatile does just that the fence and likely the barrier you mention blocks reordering of memory past that specific point so it should at the very least stop reordering of the code. What you need in MSVC is #pragma optimize( "g", off ) I am pretty sure. Now that I look at his code above I almost thing that might even mean the same as the g is for global optimization and turns it off between the blocks. I place it in a macro due to some issues I have with threading and the optimizer defeating a duel spin lock system I had in place allowing it to deadlock when it shouldn't.
51:45 Chandler Carruth didn't really explain anything during this time, scrolling up and down through the assembly and pretending that his audience understands what is going on, without him explaining it at all. Not sure if he himself understood all of those assembly he seems to talk about. He touches a topic, and about to explain it, but does not actually explain it and instead let the audience see and understand it themselves (presumably).
For my own notes (and anyone else's!) Chandler's recommended flags for record and report are at 32:52
1:29:40 PGO(Profile Guided Optimization) is what you're looking for - it lets the program gather data for how likely each branch is, and this data can be used for optimization on the second compilation
BTW. There is a shortcut in `perf` to use inverted callgraph. `perf report -G`. You still need to record with `perf record -g` of course and have some debug info present in the binary (if needed use -Og and/or -ggdb and/or -fno-omit-frame-pointer, or some combination, but be careful as some of these options might impact measurements itself).
A link to Bryce Adelstein-Lelbach “Benchmarking C++ Code" that Chandler mentions in his intro: th-cam.com/video/zWxSZcpeS8Q/w-d-xo.html
I appreciate the link.
I agree with perf's decision to display the most expensive callees first (see the discussion at ~32 minutes). It's what you need to initially know. If your reaction is "WTF this function isn't supposed to show up in the list at all", you immediately know that it's way too expensive or called in places you don't expect.
Chandler Carruth is my spirit animal.
counts for "cycles" often seem to get attributed to the instruction that gets stuck waiting for the result (reads that register), not just the next instruction in program order. At least that's true for cache-miss loads. At 1:09:30 we can see high counts on jmp instructions back into the main loop, and they directly follow idiv.
Out-of-order execution makes it hard to blame any specific instruction for a cycle because multiple instructions can be in flight at once. To make matters worse, the "cycles" even doesn't usually use PEBS.
The blame might go to the oldest instruction in the ROB, which would explain cache-miss loads blaming the instruction that's stuck waiting. The load itself may be able to retire, just leaving the load incoming via a load buffer without blocking retirement until an instruction that needs the load result value. Or not, I forget if that's true.
BTW, the "cache miss" explanation for the benefit of using UNLIKELY when unrolling is implausible in these tight loops. More likely the benefit is taken-branch throughput effects on the front-end vs. not-taken. When idiv doesn't have to run, instructions per clock is potentially quite high.
This is a great talk, even while I don't understand 50% or more of it.
guy is simply amazing : )
This is what I want. Superb in it's diving the matter deeply.
Here is the talk about macro benchmarking that's mentioned in the beginning: th-cam.com/video/zWxSZcpeS8Q/w-d-xo.html
I'm on the way to watch all Chandler Carruth's talks on YT.
Run on (48 x 2717.5MHz CPU s)
This is extremely brilliant!
Imagine being at google for the second week and you get Ken Thompson's code that was basically addressed to you. I'm sure it was beautiful.
Isn't it fascinating how you need a long time to explain all the stuff you need to do to get some numbers from the performance benchmarking that should sort of come from just running the program as default? I'm sure this perf is made by and for lvl 99 linux wizard engineers to have a secret language to cast incantations and keep their secrets safe from normal engineers and programmers. It is not meant to provide a nice tool for the "normal" people but just a as an accidental byproduct give them some useful features once you figure out the spell to call them out.
Go to C++ talk and ask what kind of vim setup he uses :D
If we are interested in the performance of vector::push_back() alone, for we normally donot call vector::reserve() before hand, then bench marking vector::push_back() after vector::reserve() in the sample given, are not we kind of only measured the performance of vector::push_back() partially?
Color scheme is Jellybean
Naren Allam bless you. That vimrc was noice.
Can anyone suggest a book that talks about the stuff mentioned in the video in C++? Thanks.
I don’t think there is a book on these specific microtopics that is why he does these talks.
Any decent (and thorough) book on compilers would be a good start. He's basically explaining why micro benchmarking is hard and how to do it correctly.
I wonder which optimization passes deleted the v.reserve(1); push_back(42); calls at 48:13. Actually, gcc doesn't optimize that away.
Why does my perf (ver 3.15.5, running on Gentoo) look totally different? Cannot select the line item to expand(already expanded). Therefore, cannot hit 'a' to see the assembly.
Did you run both 'perf record' and 'perf report' with the -g parameter?
Yes. Could it be because I have the bare Gentoo without X server? Thanks.
Fast forward. I like his speed as i drink beer while watching.
Can't find perf for MSYS2. Is this not available for it?
Anyone knows what kind of iTerm2/vim setup he uses?
cool video
Is there a profiler from LLVM ?
Looks like 'vim-hybrid'. Check it out on GitHub
At 42:45 he says that "asm volatile ..." doesn't work on Microsoft's compiler, and that he suspects there is something else that would. Does anyone know what that something else is?
#pragma optimize( "g", off ) Try that.
Matter fact... Sorry no, Leave the G out, just use #pragma optimize( "", off ) ... it will turn EVERYTHING off for you. All optimizations will be disabled which is what you want in a benchmark.
@@seditt5146 benchmarking un-optimized code is generally useless. The overhead vs. normal code can vary wildly depending on factors that aren't important in real code, like whether you used a std::vector or an array. e.g. stackoverflow.com/questions/57124571/why-is-iterating-an-stdarray-much-faster-than-iterating-an-stdvector .
Your best bet to get the asm you want to benchmark may be to to assign to a `volatile int sink` variable, which costs you an extra store instruction. But you need to read the asm to make sure the compiler isn't CSEing across iterations or something.
i never get idea of what piece of c++ code doing at first look
How is he running perf on osx? Is he remoting into a Linux box?
Yes, he sshs to Linux machine.
Amazing! :)
Compilers and performance.
I believe MSVC's '_ReadWriteBarrier' has a similar effect to 'asm volatile("" : : : "memory");'.
I would wonder if this is the case. A barrier is just a sign that read/writes must not be reordered crossing a barrier. But it would by no means "clobber" all available memory and disable all further optimizations. If so that would be a performance nightmare for non-blocking data structures using a lot of atomic ops and barriers.
_mm_mfence()? Idk, I am not sure exactly what volatile does just that the fence and likely the barrier you mention blocks reordering of memory past that specific point so it should at the very least stop reordering of the code. What you need in MSVC is #pragma optimize( "g", off ) I am pretty sure. Now that I look at his code above I almost thing that might even mean the same as the g is for global optimization and turns it off between the blocks. I place it in a macro due to some issues I have with threading and the optimizer defeating a duel spin lock system I had in place allowing it to deadlock when it shouldn't.
Hi just wanted to know whether static analysis of program code does work well even now 2020?
so uh. using reserve, the pushback takes... 2e-9 s * 3e9 1/s ... 1.5 cycles? 🤨
Talk starts at 40:20. Thank me later
Thx
it's faster... in THIS case where no other work is done.
It looks like a simple tmux + vim setup. vim only has vim-airline.
good
51:45 Chandler Carruth didn't really explain anything during this time, scrolling up and down through the assembly and pretending that his audience understands what is going on, without him explaining it at all. Not sure if he himself understood all of those assembly he seems to talk about. He touches a topic, and about to explain it, but does not actually explain it and instead let the audience see and understand it themselves (presumably).
The missing frames and video stutter are giving me a headache
wow
I don't get it, what reason can you possibly have to downvote this video o0
that pushback benchmark. turns out the cpu does either 72 or 73 cycles for it. Huh.
Got tired after 30 minutes of listening to Captain Obvious.
does anybody know the .vimrc Chendler's using ?
vim-airline w/ powerline fonts (you'll need extensive trail-and-error to get it working).
good