C++ cache locality and branch predictability

mCoding

มุมมอง 85 694

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 23 พ.ย. 2024

ความคิดเห็น • 150

@spaghettihair 3 ปีที่แล้ว ⁺³³¹
Why does the ternary expression at the end not count as branching code?
@robinm7079 3 ปีที่แล้ว ⁺⁶⁷
With a true branch, the statements that execute afterward the branch are different depending on the result of the condition being checked. So with the first example, the if statement, we could have added a bunch more lines of code in the first block (after p *= 2 * x), that wouldn't be executed if the condition was false. With the ternary expression, there's still a condition being checked, but it doesn't effect the statements afterward, the processor knows what's going to happen next and can start loading the next instruction, there's no uncertainty
@mCoding 3 ปีที่แล้ว ⁺²⁴⁸
What I was trying to say here was that the logic could be refactored where the contents of the if are the same with just a different constant. In that case, the ternary can be implemented with a value array instead of a branch. Note that the expression (x > thresh) is a comparison, not in itself a jump. So you can do arr[x > thresh] to read from index 0 to set factor to 1.5, or read from index 1 to set factor to 2.0, which is what clang does. Technically the compiler could use a branch or not use a branch for either the ternary or the if statement as the two are equivalent. I was presenting the ternary as a kind of visual stand-in to superficially show the branch being removed, which is of course not how C++ actually works. I see now that this storytelling only caused additional confusion so here is the actual assembly to keep in mind: godbolt.org/z/zoGWxcoeb. Notice the line
mulsd xmm2, qword ptr [8*rdx + .LCPI0_1]
that indexes into a 2-element in-binary array .LCPIO_1 instead of using a conditional jump. In the link, clang produces an identical binary whether you use the ternary or the if and it does not use a branch. On the same code gcc produces identical binary whether you use the ternary or the if, except it uses a branch for both.
@MithicSpirit 3 ปีที่แล้ว ⁺¹³
@@mCoding on gcc was it compiled with or without optimizations (-O2 flag)?
@mCoding 3 ปีที่แล้ว ⁺⁶⁶
Both at -O3.
@sebastianmestre8971 3 ปีที่แล้ว ⁺²²
The thing that branch prediction predicts is not the outcome of comparisions, but the destination of jumps.
With a plain if statement, a naive compiler might emit a comparison followed by a conditional jump, whose destination the CPU will try to predict at runtime.
With a simple ternary like that, a compiler might have an easier time compiling it down to a single conditional move instruction, which is not followed by any jumps. This way, there is no branch (no different destinations for jumps) in the loop, even though a conditional DOES occur.
(there are other ways to compile stuff into conditional-jump-free code, like using look up tables, etc)
@Broniath 3 ปีที่แล้ว ⁺⁷⁷
My first thought seeing this code: This is some weird ass python library, it almost looks like C++.
Several seconds later: Oh.......
@QuantumHistorian 3 ปีที่แล้ว ⁺⁹⁸
Always nice to get something a little closer to the metal than python from time to time
@ronyWeeb 3 ปีที่แล้ว ⁺²⁴⁵
It's a proper coding channel. Every other coding channel's I had seen are like 'pop culture' coding. But this is different. Great work man.
@mCoding 3 ปีที่แล้ว ⁺⁴⁰
Thanks for your kind words!
@10e999 3 ปีที่แล้ว ⁺²⁴
I completly agree with you.
This is a real software engineering channel. Not recycled dogmatic advices.
@creator-link 2 ปีที่แล้ว ⁺²
What are some pop culture coding channels? I want to watch some
@tannerted 2 ปีที่แล้ว ⁺⁸
Agreed. Also look at Jonathan Blow, Molly Rocket, and Cherno. All good channels like this one.
@Spiderboydk 3 ปีที่แล้ว ⁺¹¹
Great point. When optimizing, *always* measure, people! Merely picking a particular algorithm, because it's supposed to be fast, is nothing more than superstition. You *can't* know without measuring.
@mCoding 3 ปีที่แล้ว ⁺⁶
Always always always. "If you aren't measuring, you don't actually care about performance."
@Yazan_Majdalawi 2 ปีที่แล้ว
@@mCoding
What are the tool/s you used in this video?
@szaszm_ 3 ปีที่แล้ว ⁺¹⁴
I love the simplicity of your sorting algorithm implementations.
@mCoding 3 ปีที่แล้ว ⁺²⁴
Thanks, I stole all of them from various cppreference docs.
@mytech6779 3 ปีที่แล้ว ⁺²⁰
C++20 has added the attribute keywords Likely and Unlikely so the compiler can do some of its own predictive branching optimizations. It make me wonder if CPUs will start to include an instruction to make even more use of this in a few years.
@Qbe_Root 2 ปีที่แล้ว ⁺⁴
PowerPC conditional branch instructions include hints for prediction, so beq+ means the branch is likely and beq- means it's unlikely. I assume other architectures have the same feature but I'm not familiar with them
@bronkolie 3 หลายเดือนก่อน
@@Qbe_Root RISC-V specifies that programs should be compiled so that forward branches are most likely taken, and backward branches most likely not. So it's similar in a way
@LennyTheSniper 3 ปีที่แล้ว ⁺¹³
I watched the video 10 times. Now, my code is going 1024 times faster. Thanks a lot man!
@philipp6589 3 ปีที่แล้ว ⁺⁶²
Since you are doing Python and C++ topics, would be nice to see some video about developing C/C++ extension modules for Python. Is there any easy-to-use framework which allows just-in-time compilation of C++ code snippets inside of Python?
@mCoding 3 ปีที่แล้ว ⁺²⁵
I don't know about JIT for C++ code, but there is Jax, which JIT compiles Python to XLA and offers a numpy-like interface. There's also Cython, which allows you to embed C and C++ into Python, but this is pre-compiled, not JIT.
@FobazF 3 ปีที่แล้ว ⁺³
Check Numba
@pranavnyavanandi9710 2 ปีที่แล้ว
What is this just in time compilation? Do you mean an interpreter? Are there any other types? I thought compilation was just compilation.
@wChris_ 3 ปีที่แล้ว ⁺²¹
it would be nice to have shown the Compiler Explorer output for the branch predicting example, as a ternary is technically a branch that might get optimized into a conditional move
@mCoding 3 ปีที่แล้ว ⁺⁸
See the pinned comment!
@brunoais 3 ปีที่แล้ว ⁺²⁸
9:20 In some AMD (I don't know the latest intel yet) CPU, the CPU actually computes both multiplications at the same time and then waits for the condition to finish to choose which one to use. I'm unsure if it was intoduced in Zen or only Zen2
@spicybaguette7706 2 ปีที่แล้ว ⁺⁵
I believe It's called speculative execution, but this leads to some very funky side-effects that can be abused by attacks such as spectre and meltdown
@Windeycastle 11 หลายเดือนก่อน ⁺¹
I loved to see std::sort laughing at all other implementations.
It's a good call to use the standard library when you can, as it's often more optimised then what you'd write yourself (with some exceptions).
@ciCCapROSTi 2 ปีที่แล้ว ⁺³
What is missing is that YOU (the coder) yourself should prefer branchless code. Your example could have been written as an arithmetic formula. The compilers are smart, but we can help it. It's even better at optimizing algebra than optimizing branches.
@fabricio5p 3 ปีที่แล้ว ⁺²¹
Man, I wish you did more of these low level optimization videos. I don't work with Python but I subscribed just because of these types of videos
@mCoding 3 ปีที่แล้ว ⁺¹¹
More on the way!
@lefttraces7207 3 ปีที่แล้ว ⁺³
I love how I only know two languages and these happen to be the only two you cover, haha.
Great video as always!
@mCoding 3 ปีที่แล้ว ⁺⁵
Lucky both of us!
@mauricio-poppe 3 ปีที่แล้ว ⁺³
Awesome, thank you for an educational video straight to the point.
@mCoding 3 ปีที่แล้ว
Very welcome!
@Swedishnbkongu 2 ปีที่แล้ว ⁺¹
How much can the compiler do loop reordering, or arithmetic reordering (for more ILP), or reducing the number of branches? I've been taught to trust the optimizer more than myself. If i can confirm in compiler explorer that it is different when i micro optimize, maybe I'll do it
@rotbjorn 3 ปีที่แล้ว ⁺⁴
Interesting video! What program did you use to view the graphs & plots and what library did you use to create them?
@mCoding 3 ปีที่แล้ว ⁺⁶
The plots are viewable in any modern browser. I made them in Python using plotly express. You can see the code for making the plots on github!
@prashantrana1089 3 ปีที่แล้ว ⁺¹
this guys offer so good content for free. this guy is awesome
@mk-mc8yx 2 ปีที่แล้ว ⁺¹
In another article I read that we shouldn't be optimizing our code for branch-prediction. If that's true why should one be concerned about it, if so it seems like a good to know thing. If that's not correct, are there any good coding practices that optimize our code to take advantage of branch prediction ? Btw, I found out this channel just few days ago. Very informative and lots to learn for me from here. Thankyou very much.
@mCoding 2 ปีที่แล้ว ⁺²
Most articles that recommend this are not recommending that you avoid optimizing branches, but rather that there are most likely other more important things you should focus on first, and that you should profile your code to find out what is actually slow before blindly trying to optimize branches. Of course, in a high performance systems not otherwise limited by network or io, eventually heap allocations, branches mispredictions, and cache misses start to become the dominating factors that you need to optimize.
@efenestration 11 หลายเดือนก่อน
these will age like wine. thanks for the vid
@johnjoyce 3 ปีที่แล้ว
Nice soft intro to practical examples of performance tuning.
@dcorderoch 3 ปีที่แล้ว ⁺³
"depending on ... the time of day" at first I laughed, then I thought... depending on where the computer is, the temperature could affect the performance... nighttime might be better for performance
@yash1152 ปีที่แล้ว
1:11 > _"sorting small things is actually very common"_
i didnt know. thanks
@danielrhouck ปีที่แล้ว
1:00 I tried to read the example code for the length of the vector, but unless it’s 0 I don’t know the benchmarking library enough to have a guess. But I think I was on the right track and the vectors were too short.
@kinershah464 ปีที่แล้ว
How does the type of cache (k-way or associative or k-eay set associative) play any role in performance? Or does it?
@ProjectPhysX 3 ปีที่แล้ว
9:12 how about the ternary operator ?: or bit masking?
@kivicode 3 ปีที่แล้ว
Great video as always! Only one question: what editor theme are you using?
@mCoding 3 ปีที่แล้ว
Monokai Pro Theme
@robertbrummayer4908 3 ปีที่แล้ว ⁺³
Excellent video as usual, James!
@mCoding 3 ปีที่แล้ว ⁺¹
Thanks again!
@Yupppi 11 หลายเดือนก่อน
And this is why for example's intel's algorithm is far better than writing loops by yourself (the I assume famous video of matrix multiplication algorithm taking it from like 2000 ms to 2 ms).
I love branchless solutions too, they feel like magic.
@veritas7010 ปีที่แล้ว
Would be awesome if the use of compiler opt -O 0,1,2,3 was benched
@TDskirvin 3 ปีที่แล้ว ⁺¹
very informative & succinct, thank you
@mCoding 3 ปีที่แล้ว ⁺¹
You're very welcome!
@puddleglum5610 3 ปีที่แล้ว ⁺⁵
One note -- you should never optimize your code prematurely! A lot of times, it creates more bugs and development overhead than its worth. Optimization should be for when you're having performance issues. But, nonetheless, super great video! This brings me back to my old Computer Architecture classes in college.
@harveyroper5526 3 ปีที่แล้ว
great video, it would have been really useful if you had gone into a little more detail on why cache locality works, ie storing multiple addresses into a single cache line. other than that i really enjoyed seeing a high level application of this concept
@kyyrabb1121 3 ปีที่แล้ว ⁺¹
Please do more c++ videos!
@Zooiest ปีที่แล้ว
Can you make a video about vectorization and/or multithreading?
@danielbelsky8836 ปีที่แล้ว
That was sooo clear!! Thank You!!
@empireempire3545 2 ปีที่แล้ว
Thats great, but how to benchmark where your code is actually having problems with cache?
@mCoding 2 ปีที่แล้ว ⁺¹
Great question and unfortunately i don't have a great answer for you, at least not a definitive one. Generally you want to start with a higher level view of your application. Benchmark/profile the thing as a whole to get a feel for which parts are slow. If you have certain end to end tasks such as producing a complex value or rendering a frame, create benchmarks for those individual tasks you care about and profile them individually. This might involve engineering benchmark code into your application itself (turned on or off with a macro). The profile should help you determine at least broadly where any slowness is. Then you can start to hone in on those slow parts. Look at them and ask yourself WHY are they slow? What data is going in and out, how is it laid out in memory, are your branches predicatable, do calculations that happen near eachother live near each other in memory, and does as much data for any given calculation fit inside a cache line? You can also use tools like perf (linux only, but im sure windows has something similar). Your hardware has counters that track how many cache misses, branch mispredictipns, etc your program caused, and perd will report these numbers to you. You can then use perf and other such tools to get a feel for whether cache or branch predictions are the problem or something else, like a deeper architectural issue or inherently slow algorithm. In general there is no one answer and no one tool that works in all cases, it's just something you get better at with experience. Hope this helps.
@aratakarkosh9588 3 ปีที่แล้ว ⁺²
Holy smokes I already knew it, first time on this channel :DD
@mCoding 3 ปีที่แล้ว ⁺¹
You're becoming wiser!
@venkateshhariharan4341 3 ปีที่แล้ว
how these results will be affected when using with constexpr for ordered input?, probably get faster I don't know, I will be trying that soon.
thanks for letting us know these amazing concepts with practical examples
@mCoding 3 ปีที่แล้ว ⁺¹
constexpr could allow you to do some of the computation at compile-time, so that would be a big improvement over doing it at runtime, but depending on the use case this wouldn't necessarily be possible. As of C++20, vector is not constexpr. However, it could be possible with array.
@---hw6up 3 ปีที่แล้ว
Why do you usr auto for type deduction in function declarations? Looks infernal
@mCoding 3 ปีที่แล้ว ⁺¹
std::vector::iterator is too long to fit on screen
@---hw6up 3 ปีที่แล้ว
@@mCoding Didn't realise the sorting was done using iterators. My bad
@thiagosannafreiresilva4366 3 ปีที่แล้ว
So normally how much will compilers flag and / or optimize this sort of thing? I'm just moving from interpreted to compiled languages and I don't quite have a grasp yet to the extent of optimization done by compilers.
@mCoding 3 ปีที่แล้ว ⁺²
Compilers are fairly decent at optimizing branches away when possible. A compiler will almost never do an optimization like exchanging ijk for ikj in a for loop though because without knowledge of the whole program (e.g. we were computing the sum of integers which is commutative) changing the order of things might change the behavior of the program. Basic things like removing unused variables and not recomputing things that it can prove haven't changed will usually be done. I highly recommend Prof Leiserson's course on performance engineering in C (available on youtube, it's mit ocw) if you want to learn more. Also any CppCon talk.
@thiagosannafreiresilva4366 3 ปีที่แล้ว
@@mCoding thanks!
@szaszm_ 3 ปีที่แล้ว ⁺¹
@@mCoding GCC can perform the ijk -> ikj exchange with -floop-interchange (enabled at -O3) or with -floop-nest-optimize (which includes even more nested loop optimizations) if compiled with --with-isl.
@pengchengwu447 9 หลายเดือนก่อน
std::sort uses insertion sort for small sizes too I believe, why is it faster than yours?
@hievery9678 3 ปีที่แล้ว ⁺³
But is it always worth code readability?
I mean, getting x2 performance is cool and all, but maintaing sometimes almost unreadable code because of such optimizations may be even worse.
@mCoding
May be you have some production experience with them? Have you used them there?
How your coworkers reacted to such weird shananigans with code?
Or you tried to use them only when readability wasn't suffering?
@mCoding 3 ปีที่แล้ว ⁺¹²
I would say that writing code with good cache locality and branch predictability rarely involves any degradation in readability. There are exceptions, particularly when performance is highly critical, e.g. in electronic trading, that you might design your whole application around these properties, causing readability to suffer. However, in those cases the readability of the code was already chosen to be second priority to the performance due to the nature of the code.
@colin398 ปีที่แล้ว
Pretty sure std::sort defaults to some type of insertion sort for small sizes
@arnabthakuria2243 2 หลายเดือนก่อน
what font is this. Looks like Berkley Mono
@louisfghbvc 8 หลายเดือนก่อน
impressive about cache locality...
@aadityabhetuwal5990 2 ปีที่แล้ว
Which IDE do you use?
@alexander3293 3 ปีที่แล้ว
love this kind of videos!!
@10e999 3 ปีที่แล้ว ⁺¹
10:22 How is the line 11 not a branch?
@Yazan_Majdalawi 2 ปีที่แล้ว
See the pinned comment :)
@echoptic775 3 ปีที่แล้ว
Ik curious what are ur thought about rust lang?
@mCoding 3 ปีที่แล้ว ⁺¹
Haven't tried it but I've heard it has a lot of nice properties
@irwainnornossa4605 ปีที่แล้ว
{ belongs to a new line!
We say „2 times“, not „2 x“!
@lightify97 3 ปีที่แล้ว
Please make a C++ course in the same style.
@Khushpich 3 ปีที่แล้ว
Great video as always.
@mCoding 3 ปีที่แล้ว
I appreciate that!
@aadithyavarma 3 ปีที่แล้ว
Hi, I love your video. Are these principles applicable for Python also?
@mCoding 3 ปีที่แล้ว ⁺⁶
Yes but to a much lesser degree. Because every object in Python is a pointer, practically every access of an object behaves like the random order access example and is a cache miss, which is one of the reasons for Python's general slowness.
@switchblade6226 ปีที่แล้ว
@@mCoding WARNING: Resurrecting old comment section to expand a bit on that :P
Modern CPUs do preform unconditional branch target prediction via multi-level prediction tables (more info can be seen in Agner Fog's optimization papers), and those do help with dynamic languages' pointer hell, but obviously only to a degree.
Also, hypothetically, if you JIT python you would be able to potentially take advantage of conditional branch prediction too. And if you have an extremely sophisticated JIT that would flatten and simplify python structures, that can improve cache locality.
But that would require some genie to make a miracle JIT happen lol
@venkateshhariharan4341 3 ปีที่แล้ว
great video bro, helped alot thank you so much ❤️❤️❤️
@mCoding 3 ปีที่แล้ว
Glad to hear it!
@mytech6779 3 ปีที่แล้ว
Only 8M of cache? My 386 had 4M of RAM. (About $1000 of RAM adjusted for inflation.) And it could run MSwindows.
@marcusmors8485 3 ปีที่แล้ว
Good video! Thanks
@mCoding 3 ปีที่แล้ว
Glad you liked it!
@timog7358 3 หลายเดือนก่อน
amazing video
@BenjaminWheeler0510 3 ปีที่แล้ว ⁺⁵
I’m confused. What are all these curly braces??? Is this some weird Python fork?
@mCoding 3 ปีที่แล้ว ⁺⁴
Not sure if joking 🤔. This is C++.
@padnomnidprenon9672 3 ปีที่แล้ว
Im not the only one who told to myself : "Wow new word auto and var declaration, what is this package". And then i saw static void
@adicide9070 2 ปีที่แล้ว
@@mCoding yeah must have been joking. and a good one :D python fork :D:D:D
@andrewf8366 3 ปีที่แล้ว ⁺¹
Why didn't you plot more points in your comparison? You mentioned that the largest "kink" was at 8MB... but you also didn't have another point until 16, so it's not as good a demonstration as could be used if you had more points to show.
@ssspud3545 3 ปีที่แล้ว
I didn't even notice it was 7 hours long until the closing notes...
I was recommended this after never being recommended anything from pyrocynical in over a year while being subscribed. I have to admit, these documentary type videos that review or cover a TV show or piece of media are extremely popular now and basically equal free watch time.
I mean you have all the work cut out for you. Just watch the show and have a comprehensive knowledge about it and writing skills; which you can learn in college or even a US highschool(I meant college in Europe and in the US because community colleges are basically the same concept in both regions).
@mCoding 3 ปีที่แล้ว ⁺³
7 hours long? I think you might have commented on the wrong video.
@ItzSwiftyBoy_Gaming 2 ปีที่แล้ว
Damn, you're coding genius.
@subnumeric 2 ปีที่แล้ว
IEEE 754 floating point numbers are NOT commutative! You've used templates in the get_sum function, so depending on the data this implementation might return very incorrect sums. This is in fact what the unsafe -Ofast optimizations do (in gcc), as in some cases it doesn't produce significant errors and the speedup is worth it. I know averaging isn't really the focus of this video, but I felt like i had to mention it.
@mCoding 2 ปีที่แล้ว
Yes I'm aware of the that these implementations are not general purpose production ready. Indeed, even implementing the average of 2 numbers correctly is a monumental task, so I hope you unserstand why I wouldn't want to attempt this kind of thing unless it is was a key point of the video. But yes, always watch out for noncommitativity hiding in traditionally commutative operators like +!
@subnumeric 2 ปีที่แล้ว ⁺¹
@@mCoding I do understand, it's just that these sort of "details" get you thrown out of university exams, and much worse, neuter rocket defense systems and cost people their lives. I simply felt obligated to mention it, which I in turn hope you understand. Alas, this is indeed a confusing university-level topic and you have to simplify it into a 10 minute video, so I suppose details being cut is to be expected.
Please, do not misunderstand my capitalized "NOT" and the exclamation mark as some harsh critique, but rather as a "passionately highlighted" footnote in a hefty textbook. I appreciate the effort you've put into you video, which is clearly demonstrated by your response to a comment on a several years old video. =)
@ThePC007 3 ปีที่แล้ว
10:29 Ternary operators are just syntactic sugar for an if-else statement. They don't eliminate branching at all. That said, given that booleans are just 0s and 1s this code could probably be optimized to a branchless multiplication, which is what I believe eliminated the branching in this case.
@mCoding 3 ปีที่แล้ว ⁺²
See the pinned comment where I explain what I was trying to suggest here!
@CAMOBAP795 3 ปีที่แล้ว ⁺¹
Interesting should I thing about Branch Prediction when I work with High-level languages (Python, Java)?
P.S. Few years ago I saw some presentation about "cache locality aware" coding in Java
@netfri25 3 ปีที่แล้ว
what ide do you use?
@mCoding 3 ปีที่แล้ว ⁺¹
This is CLion.
@netfri25 3 ปีที่แล้ว
@@mCoding thanks a lot!
@gregorymorse8423 3 ปีที่แล้ว
?: is branching so is && and ||. int c=x > thresh; 2&c|1.5&!c would be the branch less form. Which is how the compiler would optimize it effectively. Also small sizes being faster for less complex algorithm is a known constant threshold which is hard to precisely compute given the complexity of the state machine in the processor. You engineer micro optimizations based on data sizes. Merge sort is not as fast as an in place merge sort, and you didnt specify this detail
@mCoding 3 ปีที่แล้ว
See the pinned comment for what I was trying to convey with the ternary. I'm also curious why you would call a 2x performance difference that I demonstrated across multiple different problems a micro optimization.
@gregorymorse8423 3 ปีที่แล้ว
@@mCoding yes I suppose table vs. Bitwise logic is debatable depending on exactly what is being optimized. But if there are side effects on the true or false expression, control flow required!!
@TheJackal917 3 ปีที่แล้ว
Nice japanese explanations, man. Cool.
@Pa_Nic 3 ปีที่แล้ว ⁺¹
If you watch this video 1000 times, you can break RSA-4096
@mCoding 3 ปีที่แล้ว
Shhhh :)
@falxie_ 3 ปีที่แล้ว ⁺¹
I really should probably go to college so I can learn techniques like this
@chainingsolid 3 ปีที่แล้ว ⁺⁴
Just use some of the terms from this video and google. Much can be learned online and testing stuff your self. If your curious message me and I'll dig up some info (aka more youtube videos).
@axe863 ปีที่แล้ว
Cache me outside .... really bro
@coder436 2 ปีที่แล้ว
7:30 Actually no
@trag1czny 3 ปีที่แล้ว ⁺⁴
discord gang 🤙🤙
@lunarmagpie4305 3 ปีที่แล้ว ⁺⁴
Discord gang
@Miparwo 26 วันที่ผ่านมา
I downvoted because you didn't even bothered to draw what you are trying to explain.
@MithicSpirit 3 ปีที่แล้ว ⁺¹
Discord gang

ต่อไป

เล่นอัตโนมัติ

C++ First Missing Int, faster than 100%!