I cannot get over how amazing these streams with Casey are. The way in which he explains things is so clear, and the chemistry you two have on stream is amazing.
God I love these videos with Cacey, sitting down in the morning with coffee listening to these two for an hour just motivates me and relaxes me at the same time.
First I thought this will be some simple explanation, I enjoyed, cover and absorbed first half of video, even assembly part... And then when double is introduced I started to lose myself... After that came words that I never heard before, my knees felt weak, my neck started pulsating from blood flow. I felt stupid again and I am admired by your knowledge! Great video 10/10
I remember when the 1 billion loops one did the rounds on social media. I janked together a similar example locally between ruby and python. Both were slow, but ruby won. Then I switched from ruby 3.3.5 to truffle ruby and it went from ~30s to ~2s. What did I learn from this? Nothing. Absolutely nothing.
@@uberboat4512 most limitation on compilers is set by language standard, i.e. compilers is prohibited to do some optimizations, that's why undefined behavior in c/C++ and strict aliasing appeared in standard - give compilers some room to optimize code, judging by Rust speed it seems it was wrong road, basically first example they discussed is fail for all compilers and languages, because final code should not have any loops, they can be all optimized away. just simple sum without loops: "(int)(((u-1)*(i128)u*num_u + (numpercu + 1) * numpercu)>>1) + rand()%10000"
Language is matter quite a bit, or at least the design and spec itself. Static typed system is needed for efficiently pack the data (and vectorize it), even if the JIT is capable of monomorphism, it still has to do runtime check for that. Inlining cant be used if polymorphism is based on dynamic dispatch, you need static dispatch for that (like impl vs dyn in Rust). Undefined behavior in the C spec, even if it’s generally a bad thing, gives a room for compiler to do more aggressive optimization. There’s way more examples, but you get the idea.
I think the real question here is "why are C, Rust, and Zig different at all?" If you're compiling with clang, then every single one is just an llvm frontend, and for for loop iterations, there's more than likely going to be exactly 0 difference in the binary output except for whatever timer they're using and the calling convention of whatever functions there are.
because they provide different information to the compiler so the optimization is also different, for example rust provides richer semantic information to llvm than c and c++, so llvm can do more optimization to rust that they can't provide to c and c++, but for now llvm can't fully utilize the rich semantic information provided (still in the development process) so the difference between rust and c and c++ is not much different, the same or slightly higher, it may be some time before llvm can take full advantage of it.
Zig is ever so slightly faster because by default it targets your machine, making use of whatever features your cpu has. Also functions by default don't follow a strict calling convention, this saves some moving around of registers when calling a function.
Yeah I don't buy that part at all. I could just take this C code and load it in Zig and see what the difference would be but I'm too lazy and there's no way there's any meaningful difference. Even if C was 2x faster I'd never start a new project with it instead of Zig unless I had to because C is way more unsafe compared to Zig.
Profiling and measuring code execution in general is such a rare skill these days. You are more likely to find a sane Rust programmer, than a programmer who knows how to profile their code. Needle in a haystack.
Most people just want to write code that does what they need it to do and don't care about speed. If they do any measurement it's usually the date.now() before and after the code section to calculate approximate amount of time the code takes and that's where their debugging and profiling ends. The amount of programmers that use only print statements to debug is way more than it should be.
@ Well, sometimes adding a few print statements to an app with HMR is a lot faster than starting a debugger. Sometimes a debugger also changes execution of the program if it is async.
This was fantastic. Took something a lot of us know, which is that performance comparisons between languages are often both contrived, prone to small accidents that make big differences, and really hard to make "fair", and made a really interesting discussion out of it. Seeing Casey vectorize that code so elegantly, idiomatic even, and explain why it worked was really inspiring honestly. Not often that I deal with critical sections in my work, but it makes me want to learn more about SIMD and LLVM in the event that I do. Thanks for the stream!
it's not only the inner loop that is constant. it can be calculated quiet easy. First remember the gauss summation formula (n-1)*n/2 = sum from 0..(n-1). If the compiler would be smart, it would notice, that we have 10000 / u full iteration for the sum from 0..(u-1) and a stopped iteration at 10000 % u. Defining const int u_div = 10000 / u, u_mod = 10000 % u; is one instruction and yields the computation: a[i] = a[r] = u*(u-1)/2 * u_div + u_mod*(u_mod-1)/2 + r; and this has exactly the same overflow behavior as the loop. the n*(n-1)/2 can even be vectorized simultaneously and therefore we have 3 vector instructions, one FMA and 1 add. Vec1 = ( u*(u-1) >> 1 ) Vec2 = ( u_mod*(u_mod-1) >> 1 ) FMA(Vec1, u_div, Vec2) + r Another fun fact: using 2-adic numbers and 3-6 (readable) liner, one can precalculated the padic inverse in mod 2^32 and then use the vectorized multiplication. That's also an optimization that compilers do, when using a constant.
Take a cat, a dog, an ant, an elephant and a shark and grade them all at how well they swim and declare the best swimmer as "the best animal ever". Obviously this doesn't work...
@@alfagroupkz together with the JVM there is the JIT compiler, all of those things are tools to make programming in Java faster and makes the comparison unfair and confusing. Many other languages don't use that. So a lot of times it has more to do with the framework and the context which the framework runs, rather than the language itself that makes it fast or slow. Plus the JVM is not Java exclusive, there are a few other languages that can take advantage of the JVM such as Clojure, Scala, Jython which makes the debate even more chaotic. Imagine someone complaining Python is slow but then run it on the JVM, using the "Jython" impelementation, all of a sudden it becomes even slower, know what I'm saying?
we need a different word for fully submerged swimming. If it's swimming the shark would die on the surface and the ant would win on technicality of floating. If walk it is to run as swim is to dive, we are still missing a word. It can't just be submergence as that seems a bit klunky. Bless compiler makers hearts for dealing with this type of stuff.
28:40 TH-cam won't let me post links, but it seems that in Rust this is caught, _if_ you write it a bit idiomatically. If the a[i] access happens via an .iter_mut() loop instead, ( `for arr_i in a.iter_mut() { .. }` instead of `for i in 0..100_000 {..}` ) then the compiler catches that the inner loop is entirely pure and computes it before-hand!
Interesting that we'd have to have a "loop go fast" option. I would hope that the compiler just "knows" to optimize it into a classic for loop even with the iterator pattern in the source, but what would I know about "zero cost abstractions" 😂
@@SimGunther I'm not sure I understand what you're saying? Rust optimizes the iterator pattern better than it can optimize a classic for loop, so iterators are actually a net positive abstraction because they let the compiler better analyze how the pieces of code interact with each other
@@SimGunther iter_mut() is that abstraction. The reason this can't be done with the classic for loop syntax is that it makes the lack of side effects much harder (if not impossible) to detect. I'm not sure you know what zero-cost abstraction even refers to.
25:55 Looking at this the outer loop could definitely be removed, because every element of `a` is calculated to be the same value. And the inner loop *could* be calculated in linear time using some math. (And is even a kind of optimization that could be added to a compiler backend like llvm) Edit: Because just *saying* that the inner loop could be calculated in O(1) wasn't satisfying enough to be I just went ahead and implemented it: v1 is the test implementation from the program and v2 is the optimized version just without having to do a loop. fn v1(u: u64) -> u64 { let mut o = 0; for j in 0..100_000 { o += j % u } o } fn v2(u: u64) -> u64 { let d = (100_000 - 1) / u; let m = (100_000 - 1) % u; d * gauss(u - 1) + gauss(m) } /// Calculate sum of 1, 2, 3, .. n fn gauss(n: u64) -> u64 { n * (n + 1) / 2 }
@@dexterman6361 The outer loop sets every element of the array to the same number. (And then reads out *one* of them) I assume that's what you mean. The inner loop can be calculated with some *math* (which I put in my original comment).
for Julia to be that low they probably didn't even put the code inside a function, showing that there wasn't even a care to write the loops in the most performant way in each language
Guarantee the R implementation is similarly fucking stupid, has unvectorised operations and ultimately executes functionality for which loops just wouldn't be used
@@BIGAPEGANGLEADER Imean I *would* argue that the fact that you have to vectorize everything in R to get anything performant is annoying, plus not everything can even be vectorized. But yeah you also shouldn't pretend vectorization doesn't exist and isn't at least okayish. In Julia loops are fast (but you do have to put em in a function, and there are definitely more non-obvious things you have to know to make code performant)
one thing that’s interesting is that they talk about the cpu like it’s some god given inherent law of nature. but remember they created the cpu and these adder units etc BECAUSE it leads to fast execution of the code they run. like 50:00 cpus having spare adders and jump capacity BECAUSE it makes for-loops fast BECAUSE cpus run for loops a lot. cpus codeveloped with code and they’re designed to make it fast and vv
And those CPU specifics can matter a lot, as the -march=native differences shows. And somewhat counter intuitively the jitted languages can have an average there. Because they compile on the fly on a specific machine they can do optimizations that cannot be used when producing a 'should run everywhere' binary. It comes at a cost in startup time, but for longer running processes you can gain those costs back. But that's something that rarely shows in those micro benchmarks.
-march=native will generate instructions for the native cpu. If the cpu that the code gets distributed to doesn’t support the extensions that the host cpu does, the executable will crash with SIGILL. Which means illegal instruction. Also if it supports it, then it still might not be optimal. Ideally you want to compile for the target machine architecture, for example zen5. If you don't know it you either need different versions of the executable to support the major extensions that matter for perf like avx512. But then you rely on the user to choose the correct one. The only seamless solution is to use runtime detection which will then dispatch the proper version of code. This is done for high performance libraries like simdjson. Though this adds complexity.
Please drop a bash oneliner here in order to robustly detect march of the precompiled binary (produced by gcc) that will definitely run at the target machine. Please keep in mind that it should work with a virtual environments (with cut-off instruction set) too. Thank you so much.
I've spent some time, over the past 30 years, on various forums, newsgroups, and email lists for then new languages, and every single time I have seen people posting micro-benchmarks that have had their inner loops optimized away because the computed results were never used. Such benchmarks are invariably posted with a message saying something like "wow! this is the fastest language I have ever seen!"
A lot of people also don’t write go correctly. If you fill your program with interfaces containing numerous methods and shit when there’s no clear benefit, you will slow your program down. Not understanding how to leverage the compiler and lower GC load further slows it down. Honestly, the more simple you write the code, the better it performs. Aside from annoying things like no generics on methods without explicit type casting, and no optional function parameters without allocating slices via variadic arguments, and a couple of other things, go is by far one of the best languages out there. People hate the verbose error handling. Just build your own error system. You don’t have to return the std library error type. It’s your code. Build what you want.
@@RomvnlyPlays No, it's a primitive that you would otherwise abstract upon replication/redundancy, like you would any primitive. That's the whole point of programming. Go makes errors as expressions for this very reason. Anyone who is a competent programmer should be able to put one and two together and abstract for the problem set.
I am the fastest language: 1) all the numbers are the same - just run once and print that value 2) even with only the inner loop, you can compute in closed form without a loop: if u < 100000, print (100000/u)*u*(u-1)/2 + (100000%u)*(100000%u - 1)/2 + r; // all calculated using integer arithmetics if u >= 100000, print 4999950000 + r and I got the result approaching to minute 24:00, where prime mentioned the constness of the result of the inner loop. While 1) wasn't mentioned yet.
Hrmm. I would say the for loops there are for easier human conceptualization the kind of "hoops" the author is asking the compiler jumps through in order come up with the expected answer. Otherwise we could have started with one of those "in the fewest characters write a..." competition submissions. Also, how we are easily able to transcribe the actions/concept of this program to simple/"human readable" math is not surprising as that is exactly why we learn to program. The question there was "are we really making it do that much work? was this any sort of a benchmark beyond add a bunch of 1s?" It still stands for me though, there @ 24:00, that it would be very difficult to optimize given that: "u" is to be considered unknown, and so "j%u" would need to vary down to @primes ;p of acceptable values "u"... Or wait, compiler should recognize it needs to calculate the primes outside of the first loop? [how far are we from AI compilers:)] Love talking about this stuff, especially when I'm corrected
43:58 I'm no hardware engineer but I have designed my own CPU. Integer division (by an undetermined N) is NOT a very common operation, usually you divide by two or a multiple of two which the compiler will optimize to be a simple bit shift (eg. 1024 shift right 1 becomes 512, 69 shift right 1 becomes 34). Even dividing by some odd numbers it can be optimized out. Division by constants of two for instance take literally no logic gates to implement, it's can be as simple as changing the wiring from the input to the output and could fundamentally be done in less then a clock cycle. However, when you divide by an unknown N, this process has to be broken up on-the-fly into a horrible amount of complexity that involves dealing with faults, fractionality, etc - this takes up an enormous amount of silicon space that can otherwise be used for SIMD, Floating points, etc. All of which would be FAR more beneficial to the overall performance of the chip, divides by N almost never happen the OS usually divides by 2, 512, 4096 all capable of being done with a single bit shift. For instance, consider a modulus by two takes no logic gates whatsoever while a modulus by 17 might take thousands of logic gates to implement, when you call IDIV you're using a monstrous piece of silicon that has to break up division up into thousands of bitwise shift, and bitwise addition operations. In fact many CPUs have single integer dividers that all cores must share.
But a division on floating point would be more or less at the same hardness, since it's also dividing integers but additionally scaled by the exponents, not to mention also having to deal with more edge cases. At least a division on 64-bit floating point would surely be harder than a division on 32-bit signed integer, given the mantissa being bigger. Even with 64-bit integers divison can be done by adding a little bit more bit precision to the already existing unit for 64-bit floating point division. It's weird that AVX supports SIMD floating point division but not SIMD integer division.
@@phucminhnguyenle250 Not the same, it is way harder - since exponents have to be dealt with (multiply N times), and generally the floating point stuff is already done by a monolithic beast in the silicon, on Zen 4 this is a giant FPU block that is about the same size of the L2 cache and all of it's necessary control logic (maybe 30% of the silicon), if you care about performance don't touch floating points. Since generally speaking floating point operations are scheduled to a giant FPU shared amongst all cores.
@@linuxguy1199 the point is since integer division can be done by floating point division, then a CPU that supports SIMD floating point division should also be able to support SIMD integer divsion too. You don't even need new circuit, you only need extra microcodes. There's no reason not to do so when in fact, a non SIMD integer division in a loop is *slowlier* than an SIMD floating point division, as proven in the video, and that is 32-bit integer against 64-bit floating point.
@@phucminhnguyenle250 I see what your saying they do still have separate integer and floating point dividers though. But yes you can use the SIMD block but when multi-threading comes in it gets weird since the SIMD block is shared by all cores while some processors have cores each with their own integer dividers. SIMD in a single threaded context will always be faster. Multi threaded it can vary heavily on CPU model.
For those curious on why Go performance is so bad: The Go code uses int, which will default to 64 bit integers; basically every other language is using 32 bit integers. Weirdly it seems like on Apple Silicon, switching it to 32 bit integers reduces the performance. On x64, when you switch it to 32 bit integers then it runs about as fast as C.
This is right but wrong. The int, uint, and uintptr types are usually 32 bits wide on 32-bit systems and 64 bits wide on 64-bit systems. Go has these int types: int int8 int16 int32 int64 uint uint8 uint16 uint32 uint64 uintptr byte // alias for uint8
I knew there's no way go is almost the same speed as node. Go is going to get you similar speeds to other compiled languages like C, rust if your application is not memory heavy.
When talking about all the overhead for the stuff outside the loop, Casey and Prime mentioned srand() initialization and rand() and fetching CLAs, but neither one got to the elephant in the room: printf() in C is _notoriously_ slow, enough that I wouldn't be surprised if the time for that one printf() really did equal the time for several iterations of the outer loop. Also, that bit Prime noticed about how the %u doesn't actually have to force a loop in the final compile? I'm not entirely certain, but given the example Casey tossed together and Prime said is exactly what he was talking about, I _think_ what I noticed when I was trying to figure out what he meant is even worse, because you could replace the entire outer loop with, like, a couple of multiplications and adds (EDIT: I meant a couple of multiplications and an add and one more run of an inner loop. So, 100,001 loops instead of 1,000,000,000 loops, with an add in that last loop.), even with the run-time user input and random number. (Apologies if that actually _is_ what he meant - I'm not entirely certain I understood that part - but I got the impression they were talking about optimizing the inner loop.)
If you look at all the code that the original author submitted, it is all written by someone who is very new to programming. They didn't write anything in any of the original languages' code that made sense
25:30 Not only does the sum of j%u always produce the same number, I am pretty sure modulo as used by most languages (if not all of them?) is distributive so you can effectively compute the sum of j and then just use modulo u once, completely solving the inner loop in constant time.
Maybe the j%u const replacement doesn't work because you add it to the array value, which is an int, for which the overflow is undefined behaviour (at least in C++ in prior versions, maybe here too), and it isn't allowed to optimize because it doesn't know a prio the result of a possibly occuring overflow.
@@JuanMamian depends on how you define it… in purely mathematical sense (1 mod 3)+(2 mod 3) = 0, obviously that’s not how C defines it, but some other languages do…
Very interesting discussion. Regarding "comparing languages on real workloads" by for example http servers: there is a channel (Anton Putra) that does that, for example comparing Go to Rust. However he usually uses frameworks (understandably) but does compare decent server performance metrics. All in all, it's insightful but again, should be taken with care due to the complexity of framing and measuring.
agree. comparing language with lightweight load will not show the difference of the language, because all they do is just lightweight load. just like comparing olympic math person vs common person over lightweight question suct as 1 + 1 = ...
Casey is right about Lua: I don't know about vanilla Lua, but LuaJIT is indeed slower on the outset. I always run my Lua benchmarks in waves because of that. The first wave or two are almost always slower. It takes a bit for LuaJIT to gather information and do its magic and actually start accelerating the language.
Thanks for the discussion of this...I love when you guys chat together, I learn a whole lot. I understand the point that Casey makes about this particular benchmark, but there's many reasons why we use AOT languages like C instead of interpreted & garbage-collected languages like Python. To say that "Language Performance Comparisons are Junk" seems to ignore the performance tradeoffs that some languages make. This is a very interesting, applicable, and important area of study. As Casey has demonstrated here and many times before, performance has lots to do with how you write code. But it does also have to do with the mechanics of the language and its build/execution strategy. The experience of using different languages is different not just because of semantics, but also because they make performance tradeoffs in the way they build and execute a program. I get it...don't plant our flag in this particular benchmark. But is it really such a stretch to want something like this to look at? I would've loved if an alternative was proposed (or even executed) and presented. Yes, there's some issues with this one...but if you can make the same computation run faster in Lua / Python than in C, using the same semantics, I would love to see that demonstrated as a follow-up.
We just did a rust vs elixir comparison. We ended up finding elixir slightly faster. But, it took a lot of work to optimize for the erlang VM architecture. Given the huge value of OTP we would prefer elixir even if not too much slower. The rust version took 4 monks and elixir one month.
My question for this 'benchmark' initially was "why force all languages to follow a model that might not fit the language in question?" Elixir's going to be recursing along, and is known for (at least the beam is) easy parallelism and concurrency. Why not try that? Couple minutes later: Name ips average deviation median 99th % HandWritten 4.64 215.44 ms ±2.40% 215.21 ms 234.31 ms FlowBased 4.56 219.53 ms ±2.48% 218.48 ms 238.11 ms Just tweaking outer loop to chunk based on # of cores, and we get the fastest time on that chart. Just shows exactly what they were saying in the video. Can compare, but need to be comparing strength vs strength, not some mediocre middle ground. The Elixir parallel code is like 6 lines longer, and doesn't add almost any complexity, so I feel a pretty fair comparison. In fact, if you use Flow, only adds an extra line of actual code, and simplifies the "outer_loop" code. I'm sure you made the right choice for your use case, especially with the wins on dev time. And, if done right, you'll really not be losing too much on performance (though you knew that already).
without showing the code your comparison is meaningless, and without showing how much high the load used in the test so that the real difference in production work load is showed. because many mistake newbie rust did is comparing rust debug realease not the optimized release, and doing many clone because they dont know how to borrow or share ownership, rust can also be more optimized by setting in Cargo.toml and i bet you didnt do it.
@ in fact I did use a release build. The rust code is about 20k lines solving a real problem, not a benchmark. I have no need to try to convince anyone since every problem stresses languages and runtimes differently. My only point would be that for all people deciding between languages to do a real test that addresses the main risk areas of your problem. It is worth spending a month or two to get that right for a specific problem.
@michaellatta yeah that is could be you use clone all over the place which is doing reallocation everywhere because its the easiest way to get rid of value moved error rather than using borrowing or sharing ownership, because idiomatic rust is far more performant than idiomatic elixir, because no garbage collector and rich semantic informations to the compiler to do optimizations
@ I certainly expected rust to be faster. I do some clone of small values when needed for things like dictionary keys, no clones of large structs. Given my use case is managing a large number of maps built from a much larger set of json data it is possible the elixir map implementation (presumably in c) is dominating. I have multiple smaller benchmarks built to ease into the comparison to see if we could eliminate elixir right away. So, I have a pretty good idea about what is going on. While I would not claim to be a rust expert, I have worked commercially in 17 programming languages including c and assembly, and have a good understanding of architecture impact on execution. As in all performance discussions the only way to know anything is to measure it. Intuition is too often wrong. We are satisfied (for now) that elixir will meet our needs performance wise, and is FAR better from a programmer productivity and distributed computing point of view. But, if we have issues down the road, we will revisit the decision.
Based on the discussion at the end, a thing I'd personally rather see than the attempt to compare across languages is Here are a list of common programming patterns (loops whatever) written in various languages. Under these conditions they should take about x time to run. This way if you write code that can't be vectorized but should be or whatever, you can notice "oh crap my version is 3x slower, what's going on?" Because workloads vary this is still wildly imperfect, but I think it'd be wildly more useful to various developers than attempting to do some weird "my language is better than yours" garbage.
I think where it falls down is, what do you define as the performance of a language? Can you actually measure the performance of a language? Or do you measure the performance of the runtime, or the compiler, or more precise the capability of the compiler to efficiently compile to machine code? Also, are you actually measuring the process that is executing, or are you measuring the whole operating system and everything what it is doing? That means if anti-virus is kicking in at some point, or "disk clean-up", or any other "idle" process because there isn't actually much happening on the other 23 cores of the CPU? Or the CPU is running hot at some point so it's throttling because a previous test has built up heat that has not yet dissipated. That is already providing that these tests were all performed on the exact same system with the exact same environmental conditions, because ambient temperature influences the temperature of the CPU which can influence the performance of the CPU. In fact, I would go as far as to say that there are far too many "moving" parts that you can never accurately measure this.
what's the difference between the compiler and the language when the end result is what the compiler produces? the language is just your steering wheel of the compiler at all times anyway
@@harier64 Take C, you have GCC, Clang, and another host of different compilers. The language is separate from the compiler. Which is my point. You are testing the capabilities of the compiler, not the language. There's even differences between versions of compilers, even if the language doesn't change, the output of the compiler of version 1 and the output of compiler version 2 can be different, even for the exact same code.
"The performance of a language" doesn't make any sense as you point out, the only things you can measure are its implementation(s) so compilers/interpreters. When it comes to external factors then yea it can be really hard to measure and be sure that your results are not because of some factors you didn't control for. As an example the position of your code in memory can have a big effect so if you add/remove unused code it could shift things around and make your program faster/slower. I saw a paper a while ago and the authors made a framework to do benchmarks while trying to take into account a lot of these external factors. Some of the factors I remember were restarting the computer before the benchmark, having the network card disabled, disabling some daemons/services, running the program multiple times but with different stack positions/alignments, shorter/longer environment variables, different working directories. I don't think it's impossible to properly measure because of too many moving parts, it's just hard but at the end of the day the more variable you take into account the better. Most of these factors are relatively small so you don't need to control for all of them to have a realistic picture of what's actually going on. There is also the fact that some of these factors are independent from each others and will "cancel out" their effects so in average you'll get a similar performance regardless of the environment (not talking about different hardware here obviously)
Even a longer username can slow down a program because it makes the PATH environment variable longer which again can shift the address space of the program and make it less cache efficient. And some CPUs may not be (yet) supported well in the operating system or lack a microcode update and so on. There are simply too many variables for a single benchmark to ever be relevant by itself.
the language make the compiler output different, if the language can provide rich semantic informations to the compiler, the compiler can do more optimization (not if the compiler is already able to fully utilize the rich semantic informations provided)
I think its interesting to just know what overhead comes with a language, such as what memory tracking like ref counting or garbage collection is enforced that can't be circumvented, what kind of memory access is mediated. I wouldn't bother ever comparing interpreted languages to compiled languages, but these smaller performance tests are pretty useful for comparing interpretted languages since virtual machines are not equal, whereas languages sitting on LLVM are fairly equal.
@@thedeemon Ok the 'n' (binary exponent) is not known at compile time, so the compiler can't inject the result and eliminate all computation. Yes given what I've shown you it can unroll the loop into SIMD, I think this is a legitimate optimization and programming languages benchmarks should take this into account.
In C you can just use an empty asm block with the loop value as input, in Rust you can use std::hint::black_box, lots of languages have optimization barriers that you can use to inhibit code removal
42:58 The reason divide isn't in SIMD is that those lookup tables are expensive in terms of chip real estate. Designing a CPU is largely packing as much processing capability into a limited area and area depends directly on transistor size. So operations like trig ops or divide arent in SIMD not for time of op but area. There are libs that can approximate via SIMD ops but they cant get 0.5 units of last place precision which IEEE requires as that takes lookup tables and SIMD isnt efficient for such lookups. Otherwise you risk making it slower than a non vectorized version on the old FPU or CPU
To my recollection, Zig doesn't actually have a decent native benchmarking framework that's cross-platform. I think the most popular one is Linux specific and leans on performance counters. I actually have been working on porting nanobench from C++ to Zig just so I could have some numbers I could trust for some code I was testing, and in my experience it's about even with or slightly slower than C. However, in some cases Zig is actually significantly slower because the language doesn't give you subclassing or interfaces, which means that you have to roll those sorts of things yourself. When Zig developers roll their own, I find they tend to gravitate towards v-table driven design instead of duck typing or interfaces, which is naturally going to be slower due to the extra indirections inherent to pointers. Plus there's also the fact that Zig is still a work in progress so naturally it will implement some things suboptimally. Don't get me wrong, Zig is still a great language, and the up-and-comer I'm most excited about by a long shot. Just take performance numbers with an enormous grain of salt.
I think the meaning of the visualization is that the rate of the bouncing == the speed of the algo. Which would be better visualized by a bar chart (that doesn't bounce around) Or even more simple, an ordered list of numbers (without bouncing rectangles in the background)
Seems like the kind of thing that would make a great community project, both isolating valid comparisons and implementing them with real or real-ish data (recorded, versioned payloads, etc.) and assembling them into a feasible full-fat demo application. Shims to ignore shared or otherwise external codepaths would be an important function of course. Ideally you'd have some sort of constructor for your torture test with some arch flexibility, and it would be and to pull from a repo of community tests, slowly getting refined, graded, replaced with better ones.
Actually pretty sure even the inner loop can be optimized away entirely. A bit of maths: 1. We can use the pigeonhole principle to remove the modulo and shorten the loop from (0 to 100k) to (0 to u). (It's possible I have an off-by-one here somewhere) int sum = 0; int multiple = 100000/u; int switch_point = 100000%u; for (int i = 0; i < u; i++) { sum += multiple*i; } for (int i = 0; i < switch_point; i++) { sum += i; } 2. We have now eliminated the modulus from the summations, which means we can use a math trick where a sum of integer inputs to a polynomial results in a polynomial of one degree higher. In this case 1 + 2 + 3 + 4 + ... + x = x*(x+1)/2. Therefore: int multiple = 100000/u; int switch_point = 100000%u; int sum = switch_point*(switch_point-1)/2 + multiple*u*(u-1)/2;
I'm a big fan of kostya/benchmarks on Github, because they dive into the CPU, Memory, Energy in Joules, each with ± variation in time, and have different scenarios e.g. brainf***, mandelbrot, Base64, JSON encoding etc. Still need a grain of salt on hand of course. But every time I see speed or whatever benchmarks shared on Twitter or Reddit, it's always something far less rigorous, like the infamous "TS is slower than JS" one. But in the end the nuance all gets lost in the usual "C fast python slow 👈😮" discussion (only 7:20 in so far in case my comment is preempted).
I totally agree with this video. Benchmarks in which there is only one variable - and are produced under precisely the cirrcumstances of intended use - are of some use. Operational Acceptance Testing (OAT) exists for the same reason. It is virtually impossible to guess the performance of a modestly complex system, or the impact of relatively small changes. Donald Knuth used an assembler for a simplified processor architecture to compare complex algorithms, because high level lenguages hide too much in terms of memory usage and process complexity and processors can do buckets of work in the microcode. The speed of the network, the IOPS on the disk, memory bandwidth, the porcessor, these have so much more impacts than how fast you can do a loop.
Nested for loops are not a proxy for how fast a language is, because computational speed is not the only metric you need take into account when selecting a tool for a job. I would love to see a comparison across multiple domains, live servers, I/O, concurrency, etc. Truly these languages have specialization in specific domains. But also, why not take into account programming ergonomics??! i.e. Rust v. JS? It matters!
When I'm writing backend, three more important thing to consider: 1. Developer availability. Everyone knows JS so it may be good idea to write example with NestJS if it is unknown who will continue work after me. 2. How to authenticate. Language ecosystems differs on this matter what are available and usually it doesn't make sense to make critical technology itself. 3. Integrations to databases, payment systems and similar. 4. Framework availability. It is better to write code using production quality framework that standardize conventions and other people can understand code better. Performance comes after these. What usually is required is good overall performance that is enough. Performance is anyway getting better when implementations are more mature.
I was wondering what the code they did for lua looked like and it's worse than I expected. Instead of initializing each pair of the table normally by doing a[i] = 0 before the inner loop, they decide to write this: a[i] = (a[i] or 0) + j % u Writing a[i] = 0 or just using r as the value to initialize before the loop is about 10% to 15% faster on my machine Or you could just register a C function for such a task...
48:22 due to out-of-order window being so huge now, the unrolling of 4 or 8 should not be a big issue as only the loop end part should interfere. Also the code does basically what Prime said earlier, but for each i instead of pulling it out before the outer loop, it just add the sum + r to the a[i] later after the inner loop and store it in line 57.
I think the most effective optimisation would be a compiler recognizing that only the r-th element is accessed and thus remove the 10k array into a single value and compute the inner loop once (ie just do the outer loop a single time with i = r).
regarding the stuff around 1:19 or so, doing the work of decompiling and looking at the underlying instructions to investigate loops clearly takes more expertise and willingness to do work than the guy that posted the original benchmark is capable of if you read how he replies to people. I can't tell if he's actually trying to do a thing or if he's just farming engagement, but either way I wouldn't expect him to do anything useful
Ok that was so scary as it took me through a very simple performance memory sanity test loop I wrote a couple of decades ago to exercise memory in sets of 10000 item arrays in Java . This benchmark was so similar
[mod(a,c) + mod(b,c) = mod(a+b,c)] - this is untrue (you probably conflated being congruent modulo c with the operator itself or something), but some of the other comments show the correct closed form
Vectorization is what had me question my sanity, when C++ library functions like ranges::find ended up 4x faster than a raw loop, despite doing exactly the same. I wanted to get some numbers to show that the standard algorithms are NOT slower than writing your own raw loop to any significant degree (if at all). Seeing them not just equal, but 4x faster was a fun surprise.
So many people think they are smarter than compilers and write unnecessarily low-level code only to make compiler optimization worse. I've also seen people write unnecessarily low-level code in high-level io-bound code "because of performance" where it makes zero difference/sense.
I do like the criticism to a very narrow benchmark with works on my machine bro. I will say the moving bars make sense to me. It shows the relative slowness between the tests which i think is pretty intuitive shown visually even though its downside is that the time scale is arbitrary
27:14 Since it only reads a[r], the compiler could also optimise out the computation for all the other values of i theoretically & make the thing O(1) when in combination with optimising the sum over j values.
I guess the size of the array is only 10000 because this is what fits inside the stack (and is a nice number). For a larger number they might have had to call malloc in the C code which would have made it (somewhat) slower. I could easily see compilers fully optimize away both loops. In general it could happen that the compiler notices that the outer loop’s iterations are independent and only one result is printed. This could optimize away the outer loop. I’m not sure if a compiler is smart enough to figure out that all elements of the array are the same (and thus just compute one single element and always return that. I would have expected for the inner loop’s iterations to be optimized away (maybe edge cases with overflow combined with modulus don’t allow that). As overflow is undefined behavior in C++ it could certainly optimized by mathematical rules: according to the Gauss summation rule the sum of all numbers from 1 through n is (n * (n-1))/2 (and I think compilers implement that optimization in general). Furthermore, the sum of modulos equals the modulo of the sum (here is where you might need UB because of overflow). So, the inner loop reduces to a[i] += (99999*99998/2)%u. From there, some compilers might even figure out how to optimize the outer loop.
The constant Prime talks about around the 25-28 minutes of the video is larger than int32 (almost 5 billion). So the constant actually overflows, but since signed integer overflow is undefined behavior in C, the compiler should've optimized that still.
I don't know when or how it got into my head, but I used to think that doubles/ floats were generally much slower than integers at everything, plus the fact they cannot precisely represent some integers. Now it seems because of their ability to do many computations in one go, doubles or floats can be better when doing certain math operations
That inner loop works out to about 100k * u / 2, give or take some bits to handle odd/even u and remainder of 100k / u. Because it's actually adding [0, u) (sums up to u*(u-1)/2) for 100k / u number of times, and then adding [0, (100k % u)). And all the a's are the same, so, the whole program is really just printing out (100k / u) * (u * (u-1) / 2) + (100k % u) * (100k % u - 1) / 2 + r, and the loops could be skipped entirely. So, it's just using the fact that no compiler or language can optimize that far (maybe the semantics don't allow it (overflows, weird mod semantics, etc.), and/or the compiler just can't work out that much math).
Another optimization compilers dont do that would likely be worth doing in this code. Is computing the needed constansts for fast modulo by u at runtime. And using that instead of an idiv.
U is not known ahead of time, so it can't be done. If the divider is known, compiler replaces integer division by multiplication with modular multiplicative inverse, where modulus is 2^(width of the register).
@@Hexanitrobenzene but you can run the logic the compiler uses to find the constants at runtime after u is found, its probably faster then 100k idivs and makes modulos almost cost nothing.
@@__-dy9gv Oh, now I see what you mean. Still, the remainder would cost two multiplications and a subtraction each time. I'm no longer invested in optimising this particular calculation. It's so regular that the solution can be found in closed form, it's basically a series of arithmetic progressions with an incomplete progression at the end. The best benchmarks are based on excerpts from real world programs, like image processing.
52:15 -march=native allows for including all instructions available on the native machine. Probably not running on older/different cpus due to illegal (non-existing) instructions being in the binary. To the contrary -mtune=native supports scheduling/optimization that assumes the code runs on the machine that builds the code, without generating code that doesn't work on other cpus.
Isn't it dividing the loop to four parts because it's checking the i < n and the processor can do 0.25 additions per cycle, so it can do four at a time and then check them all at once? Running the results of those four after? IDK but entertained
I think microbenchmarks could make sense if they formed a suite of tests covering things like memory intensive tasks, IO intensive tasks, startup time, TCP/IP, poll/interrupt, signaling/realtime tasks, memory mapped io devices, compute intensive tasks with/without large dependency chains, etc. They would also need two versions of code for each language, an idiomatic version and an optimized version. The first should be the obvious code a knowledgeable dev would write following best practices for that language, and the second the peak micro-optimization you might do inside a key hot loop. It would be even better if relevant languages had an additional test run for embedded/kernel code without the normal runtime features.
Copilot answer :What is the definition of modulus in computer language In computer programming, the modulus operator is a very handy tool. It is represented by the % symbol and is used to calculate the remainder of a division operation. For example, if you have the expression 5 % 2, it means dividing 5 by 2, which gives a quotient of 2 and a remainder of 1. The modulus operation will return that remainder, which in this case is 1. Here's a quick code example in Python: python a = 5 b = 2 result = a % b print(result) # This will output 1 It's particularly useful in scenarios like determining if a number is even or odd (by checking if number % 2 is 0 or not), or when you're working with cyclic data structures like circular buffers.
That comparison is so sad. If you get rid of the use of var a [10000]int32 and use single int32 then the performance boost 3 times and that is even before applying any -pgo pprof-s. Essentially they don't test loops - they were testing how [array]int32 type performs in Go! If you get rid of the operations inside those loops then you get further doubling of performance, i.e. the loops runs 6 times faster.
The only type of benchmark which makes sense to me is framework benchmarks, with the condition that the code should be written using what's considered best practices in the community/documentation... Then you can see how fast your project will work generally, without any additional effort on your part. Sure you can optimize it further, but then the benchmark becomes useless, we just start comparing which benchmark writer is smarter :) Because you can always optimize further and further...
LuaJIT and nodejs are compiled (at some point) too.. Also sometimes JIT compilers can be faster since the JIT can analyze the usage/branches taken etc. With C or other 'pre-compiled languages' it takes 3 steps: 1. generate the original binary, 2. run the program with the metrics enabled, 3. compile again with optimization metrics that were collected I forget if luaJIT is actually just AOT or is actually JIT though. However there's no reason it would be slower than go afaik (except for maybe the fact that everything is a hash table..)
Just adding on the comment on no one runs massive loops as a service, also in data science languages like R the assumption is that if you use loops, you are doing something seriously wrong. A typical use case of being able to apply functions throughout every column element in a massive data frame is where the data science languages and Python libraries for such cases shine. The compute speed is an important metric though since in cloud it has a direct correlation with the compute bill. It just should be based on a relevant case and not these silly ones.
Great catch that the inner loop will be constant. But it made me wonder, if the compiler can't optimize away operations with unknown values, could a jit compiler get rid of this loop after parsing the input? Is this one of those mythical operations where jit can outperform aot?
I don't usually compare languages, but I've done it a couple times with a game of life. I made it using CLI graphics just to make sure it's working properly, and then I turned off graphics and just compute some amount of generations. It's important that all tests start from the same board configuration, as GoL is deterministic, so it's a leveled playing field. I don't know if this is a good benchmark, but at least it's an actual program doing actual stuff, and the implementations between languages turn out to be very similar.
Amazing. I knew division is super slow and compilers despise it, but it's so eye opening that writing code that looks like it would take longer, can actually be executed 3+4x faster. Just wow!
The things compiler devs have to wrestle with nowadays - unreal. Zen4 and Avx512 have gotten so complex, it's gotten impossible to predict how the machine code will look like, and how modern CPUs might sequence it.
The biggest problem with benchmarks I find is they write the same code for each language and not either idiomatic code or optimised code. All the code should be reasonably optimised for each language individually first.
I already know about and agree with the premise of the video so I don't need any convincing. Still, Casey's here, so there goes my next hour and a half lol
Another thing to consider is the languages on the lower end of that diagram are often utilizing libraries written in languages at the top for some of its more computationally intensive operations.
Given the poor scores though, I tend to think they probably didn't use those libraries. It'd defeat the whole purpose of the speed test anyway if they're just using python as a wrapper for a C function call.
the R code is atrociously written for anyone who is actually familiar with the language. two great things about R is that you can treat it like a functional language, and you can outsource expensive subroutines to C implementations
Knew this would be the case when I looked at the speed. Not to mention if we take a step even further back, I wouldn't be surprised if a loop isn't required to execute whatever task their loop implementation is doing
So the R code is bad, the PHP code is unnecessarily slow, the C code prevents compiler optimizations, the Julia code is much slower than it could be... Or in other words that benchmark is so bad the only way to make it worse is to literally just fill a table with random numbers.
Wow, that is soo bad... the most idiomatic way to write that in R would probably be by preallocating the entire "j" array, then calculating the entire "a" array using a map with the inner function doing a vectorized modulo of "j" by "u" followed by sum and adding "r"... (assuming that at this point it isn't painfully obvious that you can make the optimization they are talking about and just precompute "sum(j %% u)". 1) Theirs 2) "a
the 4x unroll is maybe more than enough. IDIV has 10 µops and can start each 6 cycles so it's using an average of 1.7 ports per clock cycle. there's a few other µops in there but that 6-cc throughput will still be the limiting factor with up to 20 other instructions in each loop, so unrolling is probably not helping much in this case.
Totally agree with the video. Just a minor note at 13:20, what you mentions about loops (inc, cmp, jmp) is correct for compiled languages, but for interpreters is not necessarily true. Also, in the specific case of NESTED loops there is a section (usually before the inner loop starts) that needs to reset the inner loop, that section (when not in nested loops) is irrelevant, but, in nested loops (and especially for languages like Lua or Python) can become a lot more relevant and thus, measuring (for example) empty nested loops chan help the interpreter's developer to improve the loop's overhead (using right data types etc.). On interpreters, loops overhead are quite important, given most of computations are done in loops and therefore we want loops that are as light-weight as possible, again as the interpreter's developer, while for the language user, you are absolutely correct and those tests are pretty useless. [edit] Not a biggie, Casey mention exactly this later on, so I have commented too quickly I guess.[/edit]
As someone that works with compilers, specifically those developed with LLVM as a backend, as my day-job, and often looking at the "code generated by the compiler, and how can we make that better for ", this was obviously fairly beginner level, but very good informative stuff. One of my hobby projects is doing prime number calculations. It spends over 80% of the time doing integer divide, exactly because integer divide is such a slow operation. I agree, these benchmarks are nonsense. If we want to compare compilers, let's write something useful. For my day-job, we use the Spec benchmark suite. It's definitely not perfect, but it's a much better variant than comparing "noddy loops". In my personal project, whcih is a Pascal compiler, I have a Sudoku solver that I run on my compiler and the FreePascal compiler, to compare the two for performance. That's a small enough project - but implementation details matter quite a bit [like, what method do you use for figuring out what values are and aren't used in a block, row or column], and there are some very clever tricks the compiler can do to make a big difference in something like that. I will compare my Pascal compiler with the C version of this benchmark, and my Pascal version. I expect very similar results, but who knows... :) Be back in a bit, need feeding myself before it gets too late... :)
So, using my LLVM-based Pascal compiler on my machine: 1.22s, flang (llvm based Fortran): 1.19s. Clang (C code): 1.22s. Using gcc for the C code gives very similar results, and free pascal (the generally used Pascal compiler) is a fraction slower, at 1.38s - from what I can tell, because it extends some of the integer values to 64-bit, for no particularly meaningful reason. This is on x86-64 AMD Ryzen machine, using -O3 for optimisation. Also, unrolling the loop won't help in this case - the divide takes so much longer than all the other work inside the loop. The Fortran compiler seems to unroll more - like 32 steps. I think that's the main difference in performance. Whilst the benchmark does measure the time it takes to convert command-line convert to integer, and printing - but those are literally zero percent of the time. We're doing 1 billion integer divides, and those take 1.2 seconds (on my machine). In the unrolled code, there are a total of 6 instructions, and they add up to 99.9% of the total time. (My Macbook M1 processor is significantly faster at integer divide/modulo than my x86 machine - I have not investigated why that is).
The problem with benchmarks: the code gets arguably optimized to fit the benchmark; the benchmark tests the wrong thing so it is useless. Usually it's a combo of the two mixed with the fact that giving that the code being tested in different languages might have been written by authors with different levels of skill you can't really compare them anyway.
About relative performance of avx2 and avx512. You need to know micro architecture of chosen CPU really well. You can have a single avx512 ALU and two avx2 ALUs. So avx2 code could run only slighter slower compared to avx512. You always need to know configuration of your compute ports/ALUs to make any performance predictions
My first reaction to this was just "zero chance this is a valid comparison" and the more I think of it the worse I think it is. Here is a few on my list: 1. There is a big difference between compiled and interpreted languages, C is fast because the compiler will look at it and understand "hey you are doing this thing, let me try and chop off some unneeded calls or use a particular method that is the most fast" whereas Python if you toss it something on the fly it has to work it out and then spit out the answer. So if you are saying compiled languages are by in large faster than interpreted languages then the answer is "duh" 2. Even with an interpreted language there are ways to ensure things run fast, I went in and looked at the code that was run for this comparison and got Python to run faster than most languages and with less code by using Numpy instead of a for loop which was the original approach. Numpy and standard Python got it on my machine 2ms slower than C on my machine. 3. The test itself is technically invalid too because it started off with a random number generator, I'd assume this is to avoid compiled languages cheating and just pre-computing the result before run with compiler tricks but it does cause serious issues with how valid the comparison is. Like if you generate a number that has 5 zeros in one language and 15 in another you aren't comparing apples with apples
At minimum they really should have done two args from cli, and made a list of say 1000 pairs and run each language against each pair so it would be more fair and balance out and strange edge cases a language might have on specific pairs while being reproducible yeah
1. With JIT-compilation the border between compiled and interpreted languages becomes very fuzzy. Is Java compiled or interpreted? Kinda both. With "cling", C++ is an interpreted language. Etc. 2. Numpy is cheating, as is any delegation of benchmarked work to a C library. 3. That random number doesn't really affect anything at all.
@ to be fair numpy isn’t cheating because that’s what a person in python doing that kind of workload would use. It isn’t all C either there is a not of native python in there and also there is cython compilation as well which is a valid option for python devs. Yes the random number matters a lot because it introduces variance into the environment. I ran the tests a few times and python was a good example of variance with it. Like I hard coded the number and the variance went down, before porting to numpy you could literally see seconds in the difference each run
If I had to guess, the `a[i] += r;` line in the C code is there to make sure the compiler doesn't vectorize the inner loop, as is the int32_t data type. Basically, the guy was trying for readable assembly output in Godbolt, which is a weird thing to optimize for. I bet the JS guy didn't do that. Anyways, gcc has a FORTRAN compiler. IDK if that's where the FORTRAN team went or what, but if a language has it's own compiler, there's a pretty good chance it's in gcc.
Wrong guesses. You see in the video that write doesn't prevent vectorization. And the original poster didn't care about assembly output at all, if you looked at his twitter posts.
very surprised clang isn't using the multiplicative inverse for modulo - you can get a modulo by doing 2 multiplies, which would be more efficient in this case since that's just 2 ops which can fully pipeline, and I thought LLVM compilers would default to that if the modulus is known before entering the loop so the inverse can be calculated once and used 10000000 times... not that I would use that anyway in a case like this you'd just have a second variable that is incremented and resets each time it reaches the limit.
@@thedeemon the math to calculate the multiplication constant is just a single DIV, and then add 1 if remainder is nonzero. I think doing this once outside the loop is easily more efficient than doing 1000000000 DIV's in the inner loop. Modulus takes 2 multiplies, so has a throughput of 3× what is possible with DIV.
I cannot get over how amazing these streams with Casey are. The way in which he explains things is so clear, and the chemistry you two have on stream is amazing.
Whenever I see an upload with Prime and Casey, the next 2 hours is gone. Everything else can wait.
jup 😂
Casie sounds better at 2x speed!
@@CobaiaHardcore all talking sounds better at 2x speed. maybe it's the coffee
@@CobaiaHardcore
why this comment is so accurate xD
God I love these videos with Cacey, sitting down in the morning with coffee listening to these two for an hour just motivates me and relaxes me at the same time.
First I thought this will be some simple explanation, I enjoyed, cover and absorbed first half of video, even assembly part... And then when double is introduced I started to lose myself... After that came words that I never heard before, my knees felt weak, my neck started pulsating from blood flow. I felt stupid again and I am admired by your knowledge! Great video 10/10
I remember when the 1 billion loops one did the rounds on social media. I janked together a similar example locally between ruby and python.
Both were slow, but ruby won. Then I switched from ruby 3.3.5 to truffle ruby and it went from ~30s to ~2s.
What did I learn from this? Nothing. Absolutely nothing.
I did an experiment between python and C++ where I switched from CPython to PyPy and it went faster than the C++ code. Also nothing to learn there.
you learnt to use truffle ruby if you want performance. sometimes its not the language, its the interpreter or compiler that matters.
No. It's only the interpreter/compiler that matters. Languages themselves don't determine the speed.
@@uberboat4512 most limitation on compilers is set by language standard, i.e. compilers is prohibited to do some optimizations, that's why undefined behavior in c/C++ and strict aliasing appeared in standard - give compilers some room to optimize code, judging by Rust speed it seems it was wrong road, basically first example they discussed is fail for all compilers and languages, because final code should not have any loops, they can be all optimized away. just simple sum without loops: "(int)(((u-1)*(i128)u*num_u + (numpercu + 1) * numpercu)>>1) + rand()%10000"
Language is matter quite a bit, or at least the design and spec itself.
Static typed system is needed for efficiently pack the data (and vectorize it), even if the JIT is capable of monomorphism, it still has to do runtime check for that.
Inlining cant be used if polymorphism is based on dynamic dispatch, you need static dispatch for that (like impl vs dyn in Rust).
Undefined behavior in the C spec, even if it’s generally a bad thing, gives a room for compiler to do more aggressive optimization.
There’s way more examples, but you get the idea.
seeing Casey on the channel always warms my heart. I love this guys' talks
I think the real question here is "why are C, Rust, and Zig different at all?" If you're compiling with clang, then every single one is just an llvm frontend, and for for loop iterations, there's more than likely going to be exactly 0 difference in the binary output except for whatever timer they're using and the calling convention of whatever functions there are.
Because the time measurement was also inaccurate lmao
because they provide different information to the compiler so the optimization is also different, for example rust provides richer semantic information to llvm than c and c++, so llvm can do more optimization to rust that they can't provide to c and c++, but for now llvm can't fully utilize the rich semantic information provided (still in the development process) so the difference between rust and c and c++ is not much different, the same or slightly higher, it may be some time before llvm can take full advantage of it.
Zig is ever so slightly faster because by default it targets your machine, making use of whatever features your cpu has.
Also functions by default don't follow a strict calling convention, this saves some moving around of registers when calling a function.
maybe the loading time of the binary and the time it takes to initialise (whereas the loop itself could be 100% the same for all 3)
Yeah I don't buy that part at all. I could just take this C code and load it in Zig and see what the difference would be but I'm too lazy and there's no way there's any meaningful difference. Even if C was 2x faster I'd never start a new project with it instead of Zig unless I had to because C is way more unsafe compared to Zig.
Casey is the goat. Always learn so much when listening to him.
Profiling and measuring code execution in general is such a rare skill these days. You are more likely to find a sane Rust programmer, than a programmer who knows how to profile their code. Needle in a haystack.
It so weird too. Intel VTUNE is free and works quite well, has a nice GUI etc...
@@nicholasredmon9851Never tried VTune, how is it compared to gprof cli
Most people just want to write code that does what they need it to do and don't care about speed. If they do any measurement it's usually the date.now() before and after the code section to calculate approximate amount of time the code takes and that's where their debugging and profiling ends. The amount of programmers that use only print statements to debug is way more than it should be.
@ Well, sometimes adding a few print statements to an app with HMR is a lot faster than starting a debugger. Sometimes a debugger also changes execution of the program if it is async.
Are you kidding me? Dude, there is a program called 'gperf'. Try it some time.
This was fantastic. Took something a lot of us know, which is that performance comparisons between languages are often both contrived, prone to small accidents that make big differences, and really hard to make "fair", and made a really interesting discussion out of it. Seeing Casey vectorize that code so elegantly, idiomatic even, and explain why it worked was really inspiring honestly.
Not often that I deal with critical sections in my work, but it makes me want to learn more about SIMD and LLVM in the event that I do. Thanks for the stream!
it's not only the inner loop that is constant. it can be calculated quiet easy. First remember the gauss summation formula (n-1)*n/2 = sum from 0..(n-1). If the compiler would be smart, it would notice, that we have 10000 / u full iteration for the sum from 0..(u-1) and a stopped iteration at 10000 % u.
Defining const int u_div = 10000 / u, u_mod = 10000 % u; is one instruction and yields the computation:
a[i] = a[r] = u*(u-1)/2 * u_div + u_mod*(u_mod-1)/2 + r;
and this has exactly the same overflow behavior as the loop. the n*(n-1)/2 can even be vectorized simultaneously and therefore we have 3 vector instructions, one FMA and 1 add.
Vec1 = ( u*(u-1) >> 1 )
Vec2 = ( u_mod*(u_mod-1) >> 1 )
FMA(Vec1, u_div, Vec2) + r
Another fun fact: using 2-adic numbers and 3-6 (readable) liner, one can precalculated the padic inverse in mod 2^32 and then use the vectorized multiplication.
That's also an optimization that compilers do, when using a constant.
Take a cat, a dog, an ant, an elephant and a shark and grade them all at how well they swim and declare the best swimmer as "the best animal ever". Obviously this doesn't work...
True, Java is not fast outside of the JVM for instance.
@@trex511ft, java works only within jvm. What do you mean?
@@alfagroupkz together with the JVM there is the JIT compiler, all of those things are tools to make programming in Java faster and makes the comparison unfair and confusing. Many other languages don't use that. So a lot of times it has more to do with the framework and the context which the framework runs, rather than the language itself that makes it fast or slow.
Plus the JVM is not Java exclusive, there are a few other languages that can take advantage of the JVM such as Clojure, Scala, Jython which makes the debate even more chaotic. Imagine someone complaining Python is slow but then run it on the JVM, using the "Jython" impelementation, all of a sudden it becomes even slower, know what I'm saying?
@@alfagroupkz that's the joke
we need a different word for fully submerged swimming.
If it's swimming the shark would die on the surface and the ant would win on technicality of floating.
If walk it is to run as swim is to dive, we are still missing a word.
It can't just be submergence as that seems a bit klunky.
Bless compiler makers hearts for dealing with this type of stuff.
28:40 TH-cam won't let me post links, but it seems that in Rust this is caught, _if_ you write it a bit idiomatically. If the a[i] access happens via an .iter_mut() loop instead, (
`for arr_i in a.iter_mut() { .. }`
instead of
`for i in 0..100_000 {..}`
) then the compiler catches that the inner loop is entirely pure and computes it before-hand!
wow. thanks for sharing that. compiler writers really are geniuses
Interesting that we'd have to have a "loop go fast" option. I would hope that the compiler just "knows" to optimize it into a classic for loop even with the iterator pattern in the source, but what would I know about "zero cost abstractions" 😂
@@SimGunther I'm not sure I understand what you're saying? Rust optimizes the iterator pattern better than it can optimize a classic for loop, so iterators are actually a net positive abstraction because they let the compiler better analyze how the pieces of code interact with each other
Clippy even warns about this:
the loop variable `i` is only used to index `a`
consider using an iterator: ``, `&mut a`
@@SimGunther iter_mut() is that abstraction. The reason this can't be done with the classic for loop syntax is that it makes the lack of side effects much harder (if not impossible) to detect. I'm not sure you know what zero-cost abstraction even refers to.
25:55 Looking at this the outer loop could definitely be removed, because every element of `a` is calculated to be the same value.
And the inner loop *could* be calculated in linear time using some math. (And is even a kind of optimization that could be added to a compiler backend like llvm)
Edit: Because just *saying* that the inner loop could be calculated in O(1) wasn't satisfying enough to be I just went ahead and implemented it:
v1 is the test implementation from the program and v2 is the optimized version just without having to do a loop.
fn v1(u: u64) -> u64 {
let mut o = 0;
for j in 0..100_000 {
o += j % u
}
o
}
fn v2(u: u64) -> u64 {
let d = (100_000 - 1) / u;
let m = (100_000 - 1) % u;
d * gauss(u - 1) + gauss(m)
}
/// Calculate sum of 1, 2, 3, .. n
fn gauss(n: u64) -> u64 {
n * (n + 1) / 2
}
Yeah the whole program can actually be simplified down to `print(9999900000 % u + r)`
@@nbjornestol Where did you get that number from? Because that's not how modulo works.
Wait what am I missing here? How come a[i] = a[i] + j % u is always the same number?
@@dexterman6361 The outer loop sets every element of the array to the same number.
(And then reads out *one* of them)
I assume that's what you mean.
The inner loop can be calculated with some *math* (which I put in my original comment).
@@remrevo3944 Edit: Nvm you're 100% right, I was thinking too much in moduli, the sum isn't done modulo u, only each number.
for Julia to be that low they probably didn't even put the code inside a function, showing that there wasn't even a care to write the loops in the most performant way in each language
just checked the code, they didn't use a function, right on the money lmao
@@eleven5707 I was also guessing and was correct
Guarantee the R implementation is similarly fucking stupid, has unvectorised operations and ultimately executes functionality for which loops just wouldn't be used
@@BIGAPEGANGLEADER Imean I *would* argue that the fact that you have to vectorize everything in R to get anything performant is annoying, plus not everything can even be vectorized. But yeah you also shouldn't pretend vectorization doesn't exist and isn't at least okayish. In Julia loops are fast (but you do have to put em in a function, and there are definitely more non-obvious things you have to know to make code performant)
Couldn't agree more.
one thing that’s interesting is that they talk about the cpu like it’s some god given inherent law of nature. but remember they created the cpu and these adder units etc BECAUSE it leads to fast execution of the code they run. like 50:00 cpus having spare adders and jump capacity BECAUSE it makes for-loops fast BECAUSE cpus run for loops a lot.
cpus codeveloped with code and they’re designed to make it fast and vv
And those CPU specifics can matter a lot, as the -march=native differences shows. And somewhat counter intuitively the jitted languages can have an average there. Because they compile on the fly on a specific machine they can do optimizations that cannot be used when producing a 'should run everywhere' binary. It comes at a cost in startup time, but for longer running processes you can gain those costs back. But that's something that rarely shows in those micro benchmarks.
-march=native will generate instructions for the native cpu. If the cpu that the code gets distributed to doesn’t support the extensions that the host cpu does, the executable will crash with SIGILL. Which means illegal instruction. Also if it supports it, then it still might not be optimal. Ideally you want to compile for the target machine architecture, for example zen5. If you don't know it you either need different versions of the executable to support the major extensions that matter for perf like avx512. But then you rely on the user to choose the correct one. The only seamless solution is to use runtime detection which will then dispatch the proper version of code. This is done for high performance libraries like simdjson. Though this adds complexity.
march=native is SUCH a funny flag. It's so architecture-specific that I can't imagine anyone but a gentoo user trying it
@@izd4 not a gentoo user but I guess that most Arch users if they compile something from the AUR are also using march=native
Please drop a bash oneliner here in order to robustly detect march of the precompiled binary (produced by gcc) that will definitely run at the target machine. Please keep in mind that it should work with a virtual environments (with cut-off instruction set) too. Thank you so much.
@@izd4 yes, it's mostly for testing. I'm on Gentoo and I use -march=znver2
@@alekseyburrovets4747 objdump -d /usr/bin/fish > fish_disassembly.txt
grep -i 'avx' fish_disassembly.txt
add whatever instruction you are looking for
YES, I was hoping you’d do a video on that!!
I've spent some time, over the past 30 years, on various forums, newsgroups, and email lists for then new languages, and every single time I have seen people posting micro-benchmarks that have had their inner loops optimized away because the computed results were never used. Such benchmarks are invariably posted with a message saying something like "wow! this is the fastest language I have ever seen!"
A lot of people also don’t write go correctly. If you fill your program with interfaces containing numerous methods and shit when there’s no clear benefit, you will slow your program down. Not understanding how to leverage the compiler and lower GC load further slows it down. Honestly, the more simple you write the code, the better it performs.
Aside from annoying things like no generics on methods without explicit type casting, and no optional function parameters without allocating slices via variadic arguments, and a couple of other things, go is by far one of the best languages out there.
People hate the verbose error handling. Just build your own error system. You don’t have to return the std library error type. It’s your code. Build what you want.
You can literally abstract the error handling to be simple. This is such a junior level complaint.
@@dunebuggy1292just because you can get around something doesn’t mean it’s not a pot hole in the road. Nice try
@@RomvnlyPlays No, it's a primitive that you would otherwise abstract upon replication/redundancy, like you would any primitive. That's the whole point of programming. Go makes errors as expressions for this very reason. Anyone who is a competent programmer should be able to put one and two together and abstract for the problem set.
@@dunebuggy1292I think the argument is more that you shouldn't have to do that
> no optionals
could be sidestepped with pointers or builder pattern, perhaps
I am the fastest language:
1) all the numbers are the same - just run once and print that value
2) even with only the inner loop, you can compute in closed form without a loop:
if u < 100000, print (100000/u)*u*(u-1)/2 + (100000%u)*(100000%u - 1)/2 + r; // all calculated using integer arithmetics
if u >= 100000, print 4999950000 + r
and I got the result approaching to minute 24:00, where prime mentioned the constness of the result of the inner loop. While 1) wasn't mentioned yet.
Hrmm.
I would say the for loops there are for easier human conceptualization the kind of "hoops" the author is asking the compiler jumps through in order come up with the expected answer. Otherwise we could have started with one of those "in the fewest characters write a..." competition submissions.
Also, how we are easily able to transcribe the actions/concept of this program to simple/"human readable" math is not surprising as that is exactly why we learn to program.
The question there was "are we really making it do that much work? was this any sort of a benchmark beyond add a bunch of 1s?"
It still stands for me though, there @ 24:00, that it would be very difficult to optimize given that: "u" is to be considered unknown, and so "j%u" would need to vary down to @primes ;p of acceptable values "u"...
Or wait, compiler should recognize it needs to calculate the primes outside of the first loop? [how far are we from AI compilers:)]
Love talking about this stuff, especially when I'm corrected
Best of luck to your 2025 prime>agen. Always love your informative approach to programming. I've learnt a lot, thank you
43:58 I'm no hardware engineer but I have designed my own CPU. Integer division (by an undetermined N) is NOT a very common operation, usually you divide by two or a multiple of two which the compiler will optimize to be a simple bit shift (eg. 1024 shift right 1 becomes 512, 69 shift right 1 becomes 34). Even dividing by some odd numbers it can be optimized out. Division by constants of two for instance take literally no logic gates to implement, it's can be as simple as changing the wiring from the input to the output and could fundamentally be done in less then a clock cycle. However, when you divide by an unknown N, this process has to be broken up on-the-fly into a horrible amount of complexity that involves dealing with faults, fractionality, etc - this takes up an enormous amount of silicon space that can otherwise be used for SIMD, Floating points, etc. All of which would be FAR more beneficial to the overall performance of the chip, divides by N almost never happen the OS usually divides by 2, 512, 4096 all capable of being done with a single bit shift. For instance, consider a modulus by two takes no logic gates whatsoever while a modulus by 17 might take thousands of logic gates to implement, when you call IDIV you're using a monstrous piece of silicon that has to break up division up into thousands of bitwise shift, and bitwise addition operations. In fact many CPUs have single integer dividers that all cores must share.
But a division on floating point would be more or less at the same hardness, since it's also dividing integers but additionally scaled by the exponents, not to mention also having to deal with more edge cases. At least a division on 64-bit floating point would surely be harder than a division on 32-bit signed integer, given the mantissa being bigger. Even with 64-bit integers divison can be done by adding a little bit more bit precision to the already existing unit for 64-bit floating point division. It's weird that AVX supports SIMD floating point division but not SIMD integer division.
@@phucminhnguyenle250 Not the same, it is way harder - since exponents have to be dealt with (multiply N times), and generally the floating point stuff is already done by a monolithic beast in the silicon, on Zen 4 this is a giant FPU block that is about the same size of the L2 cache and all of it's necessary control logic (maybe 30% of the silicon), if you care about performance don't touch floating points. Since generally speaking floating point operations are scheduled to a giant FPU shared amongst all cores.
@@linuxguy1199 the point is since integer division can be done by floating point division, then a CPU that supports SIMD floating point division should also be able to support SIMD integer divsion too. You don't even need new circuit, you only need extra microcodes. There's no reason not to do so when in fact, a non SIMD integer division in a loop is *slowlier* than an SIMD floating point division, as proven in the video, and that is 32-bit integer against 64-bit floating point.
@@phucminhnguyenle250 I see what your saying they do still have separate integer and floating point dividers though. But yes you can use the SIMD block but when multi-threading comes in it gets weird since the SIMD block is shared by all cores while some processors have cores each with their own integer dividers. SIMD in a single threaded context will always be faster. Multi threaded it can vary heavily on CPU model.
For those curious on why Go performance is so bad:
The Go code uses int, which will default to 64 bit integers; basically every other language is using 32 bit integers. Weirdly it seems like on Apple Silicon, switching it to 32 bit integers reduces the performance. On x64, when you switch it to 32 bit integers then it runs about as fast as C.
This is right but wrong. The int, uint, and uintptr types are usually 32 bits wide on 32-bit systems and 64 bits wide on 64-bit systems. Go has these int types:
int int8 int16 int32 int64
uint uint8 uint16 uint32 uint64 uintptr
byte // alias for uint8
It is automatically based on the target architecture. It also has specific types like uint32 when you care about size.
I knew there's no way go is almost the same speed as node. Go is going to get you similar speeds to other compiled languages like C, rust if your application is not memory heavy.
32 or 64 bits integers have the same speed until you have to transfer a huge amount of them to/from RAM
go programs also tend to consume a second CPU to clean all the garbage the language let the programmers create.
When talking about all the overhead for the stuff outside the loop, Casey and Prime mentioned srand() initialization and rand() and fetching CLAs, but neither one got to the elephant in the room: printf() in C is _notoriously_ slow, enough that I wouldn't be surprised if the time for that one printf() really did equal the time for several iterations of the outer loop.
Also, that bit Prime noticed about how the %u doesn't actually have to force a loop in the final compile? I'm not entirely certain, but given the example Casey tossed together and Prime said is exactly what he was talking about, I _think_ what I noticed when I was trying to figure out what he meant is even worse, because you could replace the entire outer loop with, like, a couple of multiplications and adds (EDIT: I meant a couple of multiplications and an add and one more run of an inner loop. So, 100,001 loops instead of 1,000,000,000 loops, with an add in that last loop.), even with the run-time user input and random number. (Apologies if that actually _is_ what he meant - I'm not entirely certain I understood that part - but I got the impression they were talking about optimizing the inner loop.)
If you look at all the code that the original author submitted, it is all written by someone who is very new to programming. They didn't write anything in any of the original languages' code that made sense
25:30 Not only does the sum of j%u always produce the same number, I am pretty sure modulo as used by most languages (if not all of them?) is distributive so you can effectively compute the sum of j and then just use modulo u once, completely solving the inner loop in constant time.
Yes, but you would have to use 64-bit integer for accumulation.
Maybe the j%u const replacement doesn't work because you add it to the array value, which is an int, for which the overflow is undefined behaviour (at least in C++ in prior versions, maybe here too), and it isn't allowed to optimize because it doesn't know a prio the result of a possibly occuring overflow.
Wait no, j%u can't overflow the int I think.
Modulo is not distributive though. 1 % 3 + 2 % 3 != (1 + 2) % 3.
@@JuanMamian depends on how you define it… in purely mathematical sense (1 mod 3)+(2 mod 3) = 0, obviously that’s not how C defines it, but some other languages do…
Great video. Very informative. Learned a lot!
Very interesting discussion. Regarding "comparing languages on real workloads" by for example http servers: there is a channel (Anton Putra) that does that, for example comparing Go to Rust. However he usually uses frameworks (understandably) but does compare decent server performance metrics. All in all, it's insightful but again, should be taken with care due to the complexity of framing and measuring.
agree. comparing language with lightweight load will not show the difference of the language, because all they do is just lightweight load. just like comparing olympic math person vs common person over lightweight question suct as 1 + 1 = ...
Look up "techempower benchmarks", they've done this for years, for many many languages, on different web serving tasks.
Casey is right about Lua: I don't know about vanilla Lua, but LuaJIT is indeed slower on the outset. I always run my Lua benchmarks in waves because of that. The first wave or two are almost always slower. It takes a bit for LuaJIT to gather information and do its magic and actually start accelerating the language.
Thanks for the discussion of this...I love when you guys chat together, I learn a whole lot. I understand the point that Casey makes about this particular benchmark, but there's many reasons why we use AOT languages like C instead of interpreted & garbage-collected languages like Python. To say that "Language Performance Comparisons are Junk" seems to ignore the performance tradeoffs that some languages make. This is a very interesting, applicable, and important area of study. As Casey has demonstrated here and many times before, performance has lots to do with how you write code. But it does also have to do with the mechanics of the language and its build/execution strategy. The experience of using different languages is different not just because of semantics, but also because they make performance tradeoffs in the way they build and execute a program.
I get it...don't plant our flag in this particular benchmark. But is it really such a stretch to want something like this to look at? I would've loved if an alternative was proposed (or even executed) and presented. Yes, there's some issues with this one...but if you can make the same computation run faster in Lua / Python than in C, using the same semantics, I would love to see that demonstrated as a follow-up.
We just did a rust vs elixir comparison. We ended up finding elixir slightly faster. But, it took a lot of work to optimize for the erlang VM architecture. Given the huge value of OTP we would prefer elixir even if not too much slower. The rust version took 4 monks and elixir one month.
My question for this 'benchmark' initially was "why force all languages to follow a model that might not fit the language in question?" Elixir's going to be recursing along, and is known for (at least the beam is) easy parallelism and concurrency. Why not try that?
Couple minutes later:
Name ips average deviation median 99th %
HandWritten 4.64 215.44 ms ±2.40% 215.21 ms 234.31 ms
FlowBased 4.56 219.53 ms ±2.48% 218.48 ms 238.11 ms
Just tweaking outer loop to chunk based on # of cores, and we get the fastest time on that chart.
Just shows exactly what they were saying in the video.
Can compare, but need to be comparing strength vs strength, not some mediocre middle ground.
The Elixir parallel code is like 6 lines longer, and doesn't add almost any complexity, so I feel a pretty fair comparison. In fact, if you use Flow, only adds an extra line of actual code, and simplifies the "outer_loop" code.
I'm sure you made the right choice for your use case, especially with the wins on dev time. And, if done right, you'll really not be losing too much on performance (though you knew that already).
without showing the code your comparison is meaningless, and without showing how much high the load used in the test so that the real difference in production work load is showed. because many mistake newbie rust did is comparing rust debug realease not the optimized release, and doing many clone because they dont know how to borrow or share ownership, rust can also be more optimized by setting in Cargo.toml and i bet you didnt do it.
@ in fact I did use a release build. The rust code is about 20k lines solving a real problem, not a benchmark. I have no need to try to convince anyone since every problem stresses languages and runtimes differently. My only point would be that for all people deciding between languages to do a real test that addresses the main risk areas of your problem. It is worth spending a month or two to get that right for a specific problem.
@michaellatta yeah that is could be you use clone all over the place which is doing reallocation everywhere because its the easiest way to get rid of value moved error rather than using borrowing or sharing ownership, because idiomatic rust is far more performant than idiomatic elixir, because no garbage collector and rich semantic informations to the compiler to do optimizations
@ I certainly expected rust to be faster. I do some clone of small values when needed for things like dictionary keys, no clones of large structs. Given my use case is managing a large number of maps built from a much larger set of json data it is possible the elixir map implementation (presumably in c) is dominating. I have multiple smaller benchmarks built to ease into the comparison to see if we could eliminate elixir right away. So, I have a pretty good idea about what is going on. While I would not claim to be a rust expert, I have worked commercially in 17 programming languages including c and assembly, and have a good understanding of architecture impact on execution. As in all performance discussions the only way to know anything is to measure it. Intuition is too often wrong. We are satisfied (for now) that elixir will meet our needs performance wise, and is FAR better from a programmer productivity and distributed computing point of view. But, if we have issues down the road, we will revisit the decision.
I didn't know how slow % was. I just managed to make a function I wrote a few days ago twice as fast by avoiding it.
If modulo a power of 2 the compiler will optimize modulo 2**n it will bitwise and of 2**n-1
Based on the discussion at the end, a thing I'd personally rather see than the attempt to compare across languages is
Here are a list of common programming patterns (loops whatever) written in various languages. Under these conditions they should take about x time to run. This way if you write code that can't be vectorized but should be or whatever, you can notice "oh crap my version is 3x slower, what's going on?"
Because workloads vary this is still wildly imperfect, but I think it'd be wildly more useful to various developers than attempting to do some weird "my language is better than yours" garbage.
I think where it falls down is, what do you define as the performance of a language? Can you actually measure the performance of a language? Or do you measure the performance of the runtime, or the compiler, or more precise the capability of the compiler to efficiently compile to machine code? Also, are you actually measuring the process that is executing, or are you measuring the whole operating system and everything what it is doing? That means if anti-virus is kicking in at some point, or "disk clean-up", or any other "idle" process because there isn't actually much happening on the other 23 cores of the CPU? Or the CPU is running hot at some point so it's throttling because a previous test has built up heat that has not yet dissipated. That is already providing that these tests were all performed on the exact same system with the exact same environmental conditions, because ambient temperature influences the temperature of the CPU which can influence the performance of the CPU. In fact, I would go as far as to say that there are far too many "moving" parts that you can never accurately measure this.
what's the difference between the compiler and the language when the end result is what the compiler produces?
the language is just your steering wheel of the compiler at all times anyway
@@harier64 Take C, you have GCC, Clang, and another host of different compilers. The language is separate from the compiler. Which is my point. You are testing the capabilities of the compiler, not the language. There's even differences between versions of compilers, even if the language doesn't change, the output of the compiler of version 1 and the output of compiler version 2 can be different, even for the exact same code.
"The performance of a language" doesn't make any sense as you point out, the only things you can measure are its implementation(s) so compilers/interpreters.
When it comes to external factors then yea it can be really hard to measure and be sure that your results are not because of some factors you didn't control for. As an example the position of your code in memory can have a big effect so if you add/remove unused code it could shift things around and make your program faster/slower. I saw a paper a while ago and the authors made a framework to do benchmarks while trying to take into account a lot of these external factors. Some of the factors I remember were restarting the computer before the benchmark, having the network card disabled, disabling some daemons/services, running the program multiple times but with different stack positions/alignments, shorter/longer environment variables, different working directories.
I don't think it's impossible to properly measure because of too many moving parts, it's just hard but at the end of the day the more variable you take into account the better. Most of these factors are relatively small so you don't need to control for all of them to have a realistic picture of what's actually going on. There is also the fact that some of these factors are independent from each others and will "cancel out" their effects so in average you'll get a similar performance regardless of the environment (not talking about different hardware here obviously)
Even a longer username can slow down a program because it makes the PATH environment variable longer which again can shift the address space of the program and make it less cache efficient. And some CPUs may not be (yet) supported well in the operating system or lack a microcode update and so on. There are simply too many variables for a single benchmark to ever be relevant by itself.
the language make the compiler output different, if the language can provide rich semantic informations to the compiler, the compiler can do more optimization (not if the compiler is already able to fully utilize the rich semantic informations provided)
I think its interesting to just know what overhead comes with a language, such as what memory tracking like ref counting or garbage collection is enforced that can't be circumvented, what kind of memory access is mediated. I wouldn't bother ever comparing interpreted languages to compiled languages, but these smaller performance tests are pretty useful for comparing interpretted languages since virtual machines are not equal, whereas languages sitting on LLVM are fairly equal.
I think they could've optimized the division (remainder/modulo) by choosing a random power of two. Then the modulo operation would look like "j & ((1
With something this simple, compilers could remove the loop, as they did in earlier versions of this "benchmark".
@@thedeemon Ok the 'n' (binary exponent) is not known at compile time, so the compiler can't inject the result and eliminate all computation. Yes given what I've shown you it can unroll the loop into SIMD, I think this is a legitimate optimization and programming languages benchmarks should take this into account.
In C you can just use an empty asm block with the loop value as input, in Rust you can use std::hint::black_box, lots of languages have optimization barriers that you can use to inhibit code removal
This discussion is so out of the world for me. Got to know about different kinds of benchmarking.
42:58 The reason divide isn't in SIMD is that those lookup tables are expensive in terms of chip real estate. Designing a CPU is largely packing as much processing capability into a limited area and area depends directly on transistor size. So operations like trig ops or divide arent in SIMD not for time of op but area. There are libs that can approximate via SIMD ops but they cant get 0.5 units of last place precision which IEEE requires as that takes lookup tables and SIMD isnt efficient for such lookups. Otherwise you risk making it slower than a non vectorized version on the old FPU or CPU
its should be compiled to target native cpu if want to use simd well
To my recollection, Zig doesn't actually have a decent native benchmarking framework that's cross-platform. I think the most popular one is Linux specific and leans on performance counters. I actually have been working on porting nanobench from C++ to Zig just so I could have some numbers I could trust for some code I was testing, and in my experience it's about even with or slightly slower than C.
However, in some cases Zig is actually significantly slower because the language doesn't give you subclassing or interfaces, which means that you have to roll those sorts of things yourself. When Zig developers roll their own, I find they tend to gravitate towards v-table driven design instead of duck typing or interfaces, which is naturally going to be slower due to the extra indirections inherent to pointers. Plus there's also the fact that Zig is still a work in progress so naturally it will implement some things suboptimally.
Don't get me wrong, Zig is still a great language, and the up-and-comer I'm most excited about by a long shot. Just take performance numbers with an enormous grain of salt.
my man beating the compilers, lets goo!
I think the meaning of the visualization is that the rate of the bouncing == the speed of the algo.
Which would be better visualized by a bar chart (that doesn't bounce around)
Or even more simple, an ordered list of numbers (without bouncing rectangles in the background)
What a great conversation about 12 lines of code. This is why I fell in love with computer science as a kid.
Seems like the kind of thing that would make a great community project, both isolating valid comparisons and implementing them with real or real-ish data (recorded, versioned payloads, etc.) and assembling them into a feasible full-fat demo application. Shims to ignore shared or otherwise external codepaths would be an important function of course.
Ideally you'd have some sort of constructor for your torture test with some arch flexibility, and it would be and to pull from a repo of community tests, slowly getting refined, graded, replaced with better ones.
Actually pretty sure even the inner loop can be optimized away entirely. A bit of maths:
1. We can use the pigeonhole principle to remove the modulo and shorten the loop from (0 to 100k) to (0 to u).
(It's possible I have an off-by-one here somewhere)
int sum = 0;
int multiple = 100000/u;
int switch_point = 100000%u;
for (int i = 0; i < u; i++) {
sum += multiple*i;
}
for (int i = 0; i < switch_point; i++) {
sum += i;
}
2. We have now eliminated the modulus from the summations, which means we can use a math trick where a sum of integer inputs to a polynomial results in a polynomial of one degree higher. In this case 1 + 2 + 3 + 4 + ... + x = x*(x+1)/2.
Therefore:
int multiple = 100000/u;
int switch_point = 100000%u;
int sum = switch_point*(switch_point-1)/2 + multiple*u*(u-1)/2;
Guess I'll still have to know how my computer works, and how to write efficient algorithms. Which is exciting!
I'm a big fan of kostya/benchmarks on Github, because they dive into the CPU, Memory, Energy in Joules, each with ± variation in time, and have different scenarios e.g. brainf***, mandelbrot, Base64, JSON encoding etc. Still need a grain of salt on hand of course. But every time I see speed or whatever benchmarks shared on Twitter or Reddit, it's always something far less rigorous, like the infamous "TS is slower than JS" one. But in the end the nuance all gets lost in the usual "C fast python slow 👈😮" discussion (only 7:20 in so far in case my comment is preempted).
The best thing about these "benchmarks" is this video :)
I totally agree with this video.
Benchmarks in which there is only one variable - and are produced under precisely the cirrcumstances of intended use - are of some use. Operational Acceptance Testing (OAT) exists for the same reason. It is virtually impossible to guess the performance of a modestly complex system, or the impact of relatively small changes.
Donald Knuth used an assembler for a simplified processor architecture to compare complex algorithms, because high level lenguages hide too much in terms of memory usage and process complexity and processors can do buckets of work in the microcode.
The speed of the network, the IOPS on the disk, memory bandwidth, the porcessor, these have so much more impacts than how fast you can do a loop.
Nested for loops are not a proxy for how fast a language is, because computational speed is not the only metric you need take into account when selecting a tool for a job. I would love to see a comparison across multiple domains, live servers, I/O, concurrency, etc. Truly these languages have specialization in specific domains. But also, why not take into account programming ergonomics??! i.e. Rust v. JS? It matters!
When I'm writing backend, three more important thing to consider:
1. Developer availability. Everyone knows JS so it may be good idea to write example with NestJS if it is unknown who will continue work after me.
2. How to authenticate. Language ecosystems differs on this matter what are available and usually it doesn't make sense to make critical technology itself.
3. Integrations to databases, payment systems and similar.
4. Framework availability. It is better to write code using production quality framework that standardize conventions and other people can understand code better.
Performance comes after these. What usually is required is good overall performance that is enough. Performance is anyway getting better when implementations are more mature.
I was wondering what the code they did for lua looked like and it's worse than I expected.
Instead of initializing each pair of the table normally by doing a[i] = 0 before the inner loop, they decide to write this:
a[i] = (a[i] or 0) + j % u
Writing a[i] = 0 or just using r as the value to initialize before the loop is about 10% to 15% faster on my machine
Or you could just register a C function for such a task...
48:22 due to out-of-order window being so huge now, the unrolling of 4 or 8 should not be a big issue as only the loop end part should interfere.
Also the code does basically what Prime said earlier, but for each i instead of pulling it out before the outer loop, it just add the sum + r to the a[i] later after the inner loop and store it in line 57.
Prime & Casey Show let's go!!!!
I think the most effective optimisation would be a compiler recognizing that only the r-th element is accessed and thus remove the 10k array into a single value and compute the inner loop once (ie just do the outer loop a single time with i = r).
regarding the stuff around 1:19 or so, doing the work of decompiling and looking at the underlying instructions to investigate loops clearly takes more expertise and willingness to do work than the guy that posted the original benchmark is capable of if you read how he replies to people. I can't tell if he's actually trying to do a thing or if he's just farming engagement, but either way I wouldn't expect him to do anything useful
Ok that was so scary as it took me through a very simple performance memory sanity test loop I wrote a couple of decades ago to exercise memory in sets of 10000 item arrays in Java . This benchmark was so similar
the modulus loop can be expressed in closed form [ since mod(a,c) + mod(b,c) = mod(a+b,c) ], so basically mod(100k * (100k+1)/2, u)?
[mod(a,c) + mod(b,c) = mod(a+b,c)] - this is untrue (you probably conflated being congruent modulo c with the operator itself or something), but some of the other comments show the correct closed form
@@disquettepoppy ok, of course... untrue by inspection as there exist a,b,c (a=b=c-1) such that mod(a,c) + mod(b,c) = 2c - 2; while mod(a+b,c) < c
Vectorization is what had me question my sanity, when C++ library functions like ranges::find ended up 4x faster than a raw loop, despite doing exactly the same. I wanted to get some numbers to show that the standard algorithms are NOT slower than writing your own raw loop to any significant degree (if at all). Seeing them not just equal, but 4x faster was a fun surprise.
So many people think they are smarter than compilers and write unnecessarily low-level code only to make compiler optimization worse.
I've also seen people write unnecessarily low-level code in high-level io-bound code "because of performance" where it makes zero difference/sense.
I do like the criticism to a very narrow benchmark with works on my machine bro. I will say the moving bars make sense to me. It shows the relative slowness between the tests which i think is pretty intuitive shown visually even though its downside is that the time scale is arbitrary
27:14
Since it only reads a[r], the compiler could also optimise out the computation for all the other values of i theoretically & make the thing O(1) when in combination with optimising the sum over j values.
create cache friendly code so that the cpu will not recompute if the computation is the same, but instead directly fetch the result from the cache
I guess the size of the array is only 10000 because this is what fits inside the stack (and is a nice number). For a larger number they might have had to call malloc in the C code which would have made it (somewhat) slower.
I could easily see compilers fully optimize away both loops. In general it could happen that the compiler notices that the outer loop’s iterations are independent and only one result is printed. This could optimize away the outer loop. I’m not sure if a compiler is smart enough to figure out that all elements of the array are the same (and thus just compute one single element and always return that.
I would have expected for the inner loop’s iterations to be optimized away (maybe edge cases with overflow combined with modulus don’t allow that). As overflow is undefined behavior in C++ it could certainly optimized by mathematical rules: according to the Gauss summation rule the sum of all numbers from 1 through n is (n * (n-1))/2 (and I think compilers implement that optimization in general). Furthermore, the sum of modulos equals the modulo of the sum (here is where you might need UB because of overflow). So, the inner loop reduces to a[i] += (99999*99998/2)%u. From there, some compilers might even figure out how to optimize the outer loop.
The constant Prime talks about around the 25-28 minutes of the video is larger than int32 (almost 5 billion). So the constant actually overflows, but since signed integer overflow is undefined behavior in C, the compiler should've optimized that still.
No necessarily, it depends on "u", user input. For u=2 it's like 50000.
I don't know when or how it got into my head, but I used to think that doubles/ floats were generally much slower than integers at everything, plus the fact they cannot precisely represent some integers. Now it seems because of their ability to do many computations in one go, doubles or floats can be better when doing certain math operations
That inner loop works out to about 100k * u / 2, give or take some bits to handle odd/even u and remainder of 100k / u. Because it's actually adding [0, u) (sums up to u*(u-1)/2) for 100k / u number of times, and then adding [0, (100k % u)). And all the a's are the same, so, the whole program is really just printing out (100k / u) * (u * (u-1) / 2) + (100k % u) * (100k % u - 1) / 2 + r, and the loops could be skipped entirely. So, it's just using the fact that no compiler or language can optimize that far (maybe the semantics don't allow it (overflows, weird mod semantics, etc.), and/or the compiler just can't work out that much math).
Another optimization compilers dont do that would likely be worth doing in this code. Is computing the needed constansts for fast modulo by u at runtime. And using that instead of an idiv.
U is not known ahead of time, so it can't be done. If the divider is known, compiler replaces integer division by multiplication with modular multiplicative inverse, where modulus is 2^(width of the register).
@@Hexanitrobenzene but you can run the logic the compiler uses to find the constants at runtime after u is found, its probably faster then 100k idivs and makes modulos almost cost nothing.
@@__-dy9gv
Oh, now I see what you mean. Still, the remainder would cost two multiplications and a subtraction each time.
I'm no longer invested in optimising this particular calculation. It's so regular that the solution can be found in closed form, it's basically a series of arithmetic progressions with an incomplete progression at the end.
The best benchmarks are based on excerpts from real world programs, like image processing.
52:15 -march=native allows for including all instructions available on the native machine. Probably not running on older/different cpus due to illegal (non-existing) instructions being in the binary. To the contrary -mtune=native supports scheduling/optimization that assumes the code runs on the machine that builds the code, without generating code that doesn't work on other cpus.
Isn't it dividing the loop to four parts because it's checking the i < n and the processor can do 0.25 additions per cycle, so it can do four at a time and then check them all at once? Running the results of those four after? IDK but entertained
I think microbenchmarks could make sense if they formed a suite of tests covering things like memory intensive tasks, IO intensive tasks, startup time, TCP/IP, poll/interrupt, signaling/realtime tasks, memory mapped io devices, compute intensive tasks with/without large dependency chains, etc. They would also need two versions of code for each language, an idiomatic version and an optimized version. The first should be the obvious code a knowledgeable dev would write following best practices for that language, and the second the peak micro-optimization you might do inside a key hot loop. It would be even better if relevant languages had an additional test run for embedded/kernel code without the normal runtime features.
Copilot answer :What is the definition of modulus in computer language
In computer programming, the modulus operator is a very handy tool. It is represented by the % symbol and is used to calculate the remainder of a division operation.
For example, if you have the expression 5 % 2, it means dividing 5 by 2, which gives a quotient of 2 and a remainder of 1. The modulus operation will return that remainder, which in this case is 1.
Here's a quick code example in Python:
python
a = 5
b = 2
result = a % b
print(result) # This will output 1
It's particularly useful in scenarios like determining if a number is even or odd (by checking if number % 2 is 0 or not), or when you're working with cyclic data structures like circular buffers.
That comparison is so sad.
If you get rid of the use of var a [10000]int32 and use single int32 then the performance boost 3 times and that is even before applying any -pgo pprof-s.
Essentially they don't test loops - they were testing how [array]int32 type performs in Go!
If you get rid of the operations inside those loops then you get further doubling of performance, i.e. the loops runs 6 times faster.
The only type of benchmark which makes sense to me is framework benchmarks, with the condition that the code should be written using what's considered best practices in the community/documentation... Then you can see how fast your project will work generally, without any additional effort on your part. Sure you can optimize it further, but then the benchmark becomes useless, we just start comparing which benchmark writer is smarter :) Because you can always optimize further and further...
These two should make this a regular routine.
Benchmark Busters
@@SaHaRaSquad && -> !exception
LuaJIT and nodejs are compiled (at some point) too.. Also sometimes JIT compilers can be faster since the JIT can analyze the usage/branches taken etc.
With C or other 'pre-compiled languages' it takes 3 steps: 1. generate the original binary, 2. run the program with the metrics enabled, 3. compile again with optimization metrics that were collected
I forget if luaJIT is actually just AOT or is actually JIT though. However there's no reason it would be slower than go afaik (except for maybe the fact that everything is a hash table..)
God bless you Prime
I've learnt a lot from you
Just adding on the comment on no one runs massive loops as a service, also in data science languages like R the assumption is that if you use loops, you are doing something seriously wrong. A typical use case of being able to apply functions throughout every column element in a massive data frame is where the data science languages and Python libraries for such cases shine.
The compute speed is an important metric though since in cloud it has a direct correlation with the compute bill. It just should be based on a relevant case and not these silly ones.
Hell yeah another long video with Casey
Great catch that the inner loop will be constant. But it made me wonder, if the compiler can't optimize away operations with unknown values, could a jit compiler get rid of this loop after parsing the input? Is this one of those mythical operations where jit can outperform aot?
I don't usually compare languages, but I've done it a couple times with a game of life. I made it using CLI graphics just to make sure it's working properly, and then I turned off graphics and just compute some amount of generations. It's important that all tests start from the same board configuration, as GoL is deterministic, so it's a leveled playing field. I don't know if this is a good benchmark, but at least it's an actual program doing actual stuff, and the implementations between languages turn out to be very similar.
Amazing. I knew division is super slow and compilers despise it, but it's so eye opening that writing code that looks like it would take longer, can actually be executed 3+4x faster. Just wow!
The things compiler devs have to wrestle with nowadays - unreal.
Zen4 and Avx512 have gotten so complex, it's gotten impossible to predict how the machine code will look like, and how modern CPUs might sequence it.
"We test C by hindering all possible optimization that are available for C" :D
Lua 5.3 or 5.4 introduced Integers, so this programs as it is probably would run with integers
The biggest problem with benchmarks I find is they write the same code for each language and not either idiomatic code or optimised code.
All the code should be reasonably optimised for each language individually first.
I already know about and agree with the premise of the video so I don't need any convincing. Still, Casey's here, so there goes my next hour and a half lol
Another thing to consider is the languages on the lower end of that diagram are often utilizing libraries written in languages at the top for some of its more computationally intensive operations.
Given the poor scores though, I tend to think they probably didn't use those libraries. It'd defeat the whole purpose of the speed test anyway if they're just using python as a wrapper for a C function call.
@ I would agree. I’m just making a point that you’re not using Python do that kind of stuff that’s being measured
the R code is atrociously written for anyone who is actually familiar with the language. two great things about R is that you can treat it like a functional language, and you can outsource expensive subroutines to C implementations
Knew this would be the case when I looked at the speed. Not to mention if we take a step even further back, I wouldn't be surprised if a loop isn't required to execute whatever task their loop implementation is doing
So the R code is bad, the PHP code is unnecessarily slow, the C code prevents compiler optimizations, the Julia code is much slower than it could be...
Or in other words that benchmark is so bad the only way to make it worse is to literally just fill a table with random numbers.
You can use R for ages without ever having to write a for loop.
@@Houshalter I've written an entire R package without a single for-loop in it
Wow, that is soo bad... the most idiomatic way to write that in R would probably be by preallocating the entire "j" array, then calculating the entire "a" array using a map with the inner function doing a vectorized modulo of "j" by "u" followed by sum and adding "r"... (assuming that at this point it isn't painfully obvious that you can make the optimization they are talking about and just precompute "sum(j %% u)".
1) Theirs
2) "a
the 4x unroll is maybe more than enough. IDIV has 10 µops and can start each 6 cycles so it's using an average of 1.7 ports per clock cycle. there's a few other µops in there but that 6-cc throughput will still be the limiting factor with up to 20 other instructions in each loop, so unrolling is probably not helping much in this case.
Amazing, thank you. Really well explained also.
Totally agree with the video. Just a minor note at 13:20, what you mentions about loops (inc, cmp, jmp) is correct for compiled languages, but for interpreters is not necessarily true. Also, in the specific case of NESTED loops there is a section (usually before the inner loop starts) that needs to reset the inner loop, that section (when not in nested loops) is irrelevant, but, in nested loops (and especially for languages like Lua or Python) can become a lot more relevant and thus, measuring (for example) empty nested loops chan help the interpreter's developer to improve the loop's overhead (using right data types etc.). On interpreters, loops overhead are quite important, given most of computations are done in loops and therefore we want loops that are as light-weight as possible, again as the interpreter's developer, while for the language user, you are absolutely correct and those tests are pretty useless. [edit] Not a biggie, Casey mention exactly this later on, so I have commented too quickly I guess.[/edit]
I would be very surprised if in compiler explorer for compiled languages we'd see any differences in a generated assembly for loops.
As someone that works with compilers, specifically those developed with LLVM as a backend, as my day-job, and often looking at the "code generated by the compiler, and how can we make that better for ", this was obviously fairly beginner level, but very good informative stuff.
One of my hobby projects is doing prime number calculations. It spends over 80% of the time doing integer divide, exactly because integer divide is such a slow operation.
I agree, these benchmarks are nonsense.
If we want to compare compilers, let's write something useful. For my day-job, we use the Spec benchmark suite. It's definitely not perfect, but it's a much better variant than comparing "noddy loops". In my personal project, whcih is a Pascal compiler, I have a Sudoku solver that I run on my compiler and the FreePascal compiler, to compare the two for performance. That's a small enough project - but implementation details matter quite a bit [like, what method do you use for figuring out what values are and aren't used in a block, row or column], and there are some very clever tricks the compiler can do to make a big difference in something like that.
I will compare my Pascal compiler with the C version of this benchmark, and my Pascal version. I expect very similar results, but who knows... :) Be back in a bit, need feeding myself before it gets too late... :)
So, using my LLVM-based Pascal compiler on my machine: 1.22s, flang (llvm based Fortran): 1.19s. Clang (C code): 1.22s. Using gcc for the C code gives very similar results, and free pascal (the generally used Pascal compiler) is a fraction slower, at 1.38s - from what I can tell, because it extends some of the integer values to 64-bit, for no particularly meaningful reason.
This is on x86-64 AMD Ryzen machine, using -O3 for optimisation. Also, unrolling the loop won't help in this case - the divide takes so much longer than all the other work inside the loop. The Fortran compiler seems to unroll more - like 32 steps. I think that's the main difference in performance.
Whilst the benchmark does measure the time it takes to convert command-line convert to integer, and printing - but those are literally zero percent of the time. We're doing 1 billion integer divides, and those take 1.2 seconds (on my machine). In the unrolled code, there are a total of 6 instructions, and they add up to 99.9% of the total time.
(My Macbook M1 processor is significantly faster at integer divide/modulo than my x86 machine - I have not investigated why that is).
The problem with benchmarks: the code gets arguably optimized to fit the benchmark; the benchmark tests the wrong thing so it is useless. Usually it's a combo of the two mixed with the fact that giving that the code being tested in different languages might have been written by authors with different levels of skill you can't really compare them anyway.
About relative performance of avx2 and avx512. You need to know micro architecture of chosen CPU really well. You can have a single avx512 ALU and two avx2 ALUs. So avx2 code could run only slighter slower compared to avx512. You always need to know configuration of your compute ports/ALUs to make any performance predictions
My first reaction to this was just "zero chance this is a valid comparison" and the more I think of it the worse I think it is. Here is a few on my list:
1. There is a big difference between compiled and interpreted languages, C is fast because the compiler will look at it and understand "hey you are doing this thing, let me try and chop off some unneeded calls or use a particular method that is the most fast" whereas Python if you toss it something on the fly it has to work it out and then spit out the answer. So if you are saying compiled languages are by in large faster than interpreted languages then the answer is "duh"
2. Even with an interpreted language there are ways to ensure things run fast, I went in and looked at the code that was run for this comparison and got Python to run faster than most languages and with less code by using Numpy instead of a for loop which was the original approach. Numpy and standard Python got it on my machine 2ms slower than C on my machine.
3. The test itself is technically invalid too because it started off with a random number generator, I'd assume this is to avoid compiled languages cheating and just pre-computing the result before run with compiler tricks but it does cause serious issues with how valid the comparison is. Like if you generate a number that has 5 zeros in one language and 15 in another you aren't comparing apples with apples
At minimum they really should have done two args from cli, and made a list of say 1000 pairs and run each language against each pair so it would be more fair and balance out and strange edge cases a language might have on specific pairs while being reproducible yeah
@@TurtleKwitty The video itself had a good explanation of why the test was invalid regardless
1. With JIT-compilation the border between compiled and interpreted languages becomes very fuzzy. Is Java compiled or interpreted? Kinda both. With "cling", C++ is an interpreted language. Etc.
2. Numpy is cheating, as is any delegation of benchmarked work to a C library.
3. That random number doesn't really affect anything at all.
@ to be fair numpy isn’t cheating because that’s what a person in python doing that kind of workload would use. It isn’t all C either there is a not of native python in there and also there is cython compilation as well which is a valid option for python devs.
Yes the random number matters a lot because it introduces variance into the environment. I ran the tests a few times and python was a good example of variance with it. Like I hard coded the number and the variance went down, before porting to numpy you could literally see seconds in the difference each run
@@ShaneFagan Interesting, can you explain how exactly that random number affects it?
If I had to guess, the `a[i] += r;` line in the C code is there to make sure the compiler doesn't vectorize the inner loop, as is the int32_t data type. Basically, the guy was trying for readable assembly output in Godbolt, which is a weird thing to optimize for. I bet the JS guy didn't do that.
Anyways, gcc has a FORTRAN compiler. IDK if that's where the FORTRAN team went or what, but if a language has it's own compiler, there's a pretty good chance it's in gcc.
Wrong guesses. You see in the video that write doesn't prevent vectorization. And the original poster didn't care about assembly output at all, if you looked at his twitter posts.
The moment they had to trick the compiler should have been the moment they realized that their methodology is fundamentally flawed.
Great video. This is good stuff.
very surprised clang isn't using the multiplicative inverse for modulo - you can get a modulo by doing 2 multiplies, which would be more efficient in this case since that's just 2 ops which can fully pipeline, and I thought LLVM compilers would default to that if the modulus is known before entering the loop so the inverse can be calculated once and used 10000000 times...
not that I would use that anyway in a case like this you'd just have a second variable that is incremented and resets each time it reaches the limit.
How well does this work with "u" only known at run time?
@@thedeemon the math to calculate the multiplication constant is just a single DIV, and then add 1 if remainder is nonzero. I think doing this once outside the loop is easily more efficient than doing 1000000000 DIV's in the inner loop. Modulus takes 2 multiplies, so has a throughput of 3× what is possible with DIV.
I would expect most languages compiled with LLVM for simple cases to perform nearly identical.
i want to know what primes dog was doing with a quarter of a deer carcas