You remind me of the algorithm i designed to check an array of positive integers to return true for each value divisible by 3. My Algorithm: Return false It is fast, it doesn't contain unpredictable failure, and it is correct most of the time given the set of all positive integers.
Dark background and blue color aren't a good combination of colors for readable text. Try white or yellow next time or something like that. Anyway great comparison and great job on the optimizations!
9:35 don't divide by 2, multiply by 0.5. The compiler is not allowed to do so, as it will sometimes produce different bitwise results. ... Unless you have fast unsafe math enabled, which will also allow for FMA reordering, which is a pretty significant optimization... Also, much of the runtime is actually float division. It's not pipelined and you can only start a division after another has finished, which is 14 clock cycles at best, which is _extremely_ relevant when you can replace a FP division by using "RCP_PS" and do some Newton-Raphson, which has _WAY_ higher throughput than FP divison (relevant for tight loops such as this one), and can be FMA fused when multiplying by the reciprocal instead. Here's the code for an accurate reciprocal (only _possibly_ worth it if your CPU supports FMA; total latency (fastest possible): 4 (rcp) + 2 * 4 (FMA) = 12 cycles vs 11 cycles (fastest possible division), but way higher throughput and FMA "fusable", meaning another - up to - 3 clock cycles saved): __m128 rcp23_ps(__m128 a) { const __m128 ONE = _mm_set1_ps(1f); __m128 r = _mm_rcp_ps(a); return _mm_fnmadd_ps(_mm_fmsub_ps(r, a, ONE), r, r); }
I was looking for this one. I also want to add: even if fast unsafe math is enabled, on some compilers (vcc: /fp:fast) you should also enable optimizations for speed to rewrite constants as needed (done when using /O2 or /Ot but NOT /Ox). /Ot is default. /O2 does some more afaik. (no indepth knowledge use at your own risk. Please corret me if im wrong) P.S.: AFAIK in gcc constant rewriting is auto enabled when using fast unsafe math (-ffast-math) and optimization levels do not affect constants, -ffasm-math takes precedence?
Dividing by powers of two can and will actually be replaced by multiplying with the inverse by the compiler even without fast-math flags, since the result will be exactly the same. One could replace the divisions by a with a multiplications by (1/a) -- this is a thing the compiler won't do in exact-math mode. In fast-math mode, it may even use the fast rcpps (reciprocal packed single) instruction with a refinement step to do this faster.
I’m guessing the new version might actually use less power! They should be practically the same speed due to ILP, but the the new version requires less calculations
The combo of a branch + floating point, may have resulted in the compiler not realizing it was redundant. And having a branch that also probably didn't predict very well (its based soley on incoming data) it may have done major performance damage. Would be cool to see the pattern of which way the branch goes, given its based solely on the input. Wonder if "sorting" the input to make it predictable would allow removal of the divide (by the CPU's branch predictor), for some of the solve attempts and thus, maybe go even faster for those cases. @Creel looking at the branch predictability could be an interesting video.
26:24 it’s worth noting that because you’re dividing my a constant power of two for the third div in the new version, all modern compilers will optimize that to a single mul instruction. I’m almost certain that -x/y will always become an xor and mul when y is a power of two Edit: I don’t want to litter the comment section too much, so I’ll just put this here: awesome video, man! Some of my work involves optimization, and I absolutely love when other people make this knowledge super accessible!
I think you're right about the division being swapped for a mul! I can't remember what I did in the ASM, but hopefully a compiler would figure that one out... Cheers for watching mate, have a good one :)
An actual implementation would distinguish 4 main cases, depending on if 'a' is zero or not, and on if 'b' is zero on not: - 1. When (a=b=0), if (c=0) there's a single root (r1=r2=0), otherwise there's no solution (r1=r2=NaN). - 2. Otherwise when only (a=0), there's no solution if (b=0), otherwise a single solution (r1=r2=-c/b). - 3. Otherwise when only (b=0), there are two solutions of opposite signs (r2=-(r1=√(c/a)). - 4. Otherwise, we have the two quadratic roots, possibly complex depending on the sign of the determinant (b²-4ac). For the last 4th case, you need **numeric stability** (due to limited precision for "float", "double" or "long double" types used by FPU) and the two quadratic solutions should be given by (r{1,2}=[√(1-4ac/b²)±1]*b/2a), where the square root may be imaginary if the "normalized" determinant (1-4ac/b²) is negative, i.e. when the usual determinant (b²-4ac) is negative (both conditions are equivalent because b≠0 in that 4th case). This gives a total of 7 cases, where conditionals (if statements) are justified. But the test on if b=1 in this video is just useless and stupid (it tries to avoid a trivial division by 1 but at the cost of a branch, which disables the optimization or parallelisation by the compiler: divisions by 1 are trivial also in FPUs, may just cost 1 cycle, LESS than the cost of a branch).
(-x/y) cannot be "optimized" with a xor and mul: we're working here with floatting point values, not integers! The negation of floatting points just flips & single bit, the sign bit, it does not perform any "xor" on the whole value (the exponent or the mantissa part are not modified), and this can be done before the division on any operand x or y, or on the result; but in fact the result sign is NOT[sign(x)AND sign(y)], and the absolute quotient is the separate division of abs(x)/abs(y). To compute that (positiive) dividend of these two positive floatting points, - the FPU separately computes the quotient of the two mantissa, which are in (0.5, 1.0], so that non-normalized quotient is in (0.5, 2], and the difference of exponent parts; - finally it renormalizes the quotient into (0.5, 1.0] and accordingly adjusts the difference of exponent parts by conditionally substracting 1. The final result may become a denormal (or zero) floatting point value, or an "infinite" floatting point value, depending if the final exponent part results in an underflow or overflow (for its allowed range), but it will be a "NaN" (or would raise a "'division-by-zero" exception) if both mantissa parts (including the implicit most significant bit if it's not stored in the floatting point format) are zero. In all cases, the result sign computed separately is blindly applied as is (possibly giving "positive or negative zero", "'positive or negative NaN", or "positive or negative infinite" values; some libraries may or may not discard the sign distinction for zeroes and NaNs, but with some extra cost using conditional branches)
@@PhilippeVerdy, you probably should have checked godbolt before that whole explanation. I just confirmed that on some compilers it becomes a mul and xor, on others it just becomes a mul by -0.5. Remember, in this case we are dividing by a positive constant power of two and negating. This means the compiler can take shortcuts.
@@pyromen321 if you are dividing à float by a negative power of two, you don't need a mul, just z decrementation of the exposant part, and flipping the sign bit, this tales two instructions, worse than a single FPU instruction performing a single multiplication by constant like -0.5 (consider data path on thd same word containing thd sign bit and the exponent, plus you need to test underflows when decrementing the exponent, or you'll wrong results !
just so everybody knows: the OG quadratic formula is the closed form of the Po-Shen Lo algorithm. the PSL algorithm is simpler to understand and perform by a mere human, but I would not expect optimized code for the two variants to differ al to much.
If the least significant bit of precision can be sacrificed in the new method, the two divisions by a can be replaced by one reciprocal (i.e. 1.0/a) and two multiplications with that reciprocal. Thus one division can be traded for two multiplications, which may be faster.
Honestly, that's the real solution here. How much accuracy are you allowed to sacrifice in order to speed things up. I'm taking accounting courses right now and there's a bunch of shortcuts that are taken for similar reasons. The result is somewhat different from what it would be with proper math, but it's usually a small amount and you're rarely dealing with more than 4 decimal places when dealing with currency anways. Plus, since people are using one of two sets of rules regarding how these things are don, there usually isn't much trouble.
Accuracy is also why std:complex is usually slow. The C++ standard library has to work in all cases and for complex numbers not sacraficing precidion means a lot slower code. You also cant use a fast-math optimised version because the standard library is precompiled (altough you can work around this by compiling a glibc with fast-math and using that with static linking it would probably still preform worse than hand-written speed-for-accuracy optimised code). I would personally not recommend std:complex in any case unless you are sure you need super-duper accurate answers but than you would probably go with an arbitrary precidion library wich has complex numbers.
Isn't the step B = b/a and then mid = B/2 the real waste here? You only need mid. Also you can use a trick used for the quadratic, where r2 is used for the discriminant, r1 = (-b + r2)/2a and r2 = (-b - r2)/2a. For the Po Shen Lo it works even better: mid = b/2a c = c/a r2 = sqrt(mid*mid - c) r1 = mid - r2 r2 = mid + r2 the last step I imagine can be faster as it's adding mid to itself ( r2 += mid?) This should already always beat the quadratic. Would using reciprocals be even faster?
CPU's & Compilers can vectorize Array of Structures as well, the difference between this and SOA is more about optimizing cache access patterns for functions that dont use every element in the structure. You should double check your compiler optimization settings if you are getting wildly differing results from a structure this small.
After a little bit of tinkering, I got the Po Shen method down to: 2 Float subtractions 2 Transfers (a := b) 1 Integers subtraction (from a float to reduce the exponent by 1 aka dividing by 2) 1 XOR (to multiply by -1) 1 Float addition 1 Square root It also only uses 3 Registers. It goes as follows (from x² + Ax + B; A, B, C are floats; p is the number of significand bits of the type of float, l is the size of the total number for example 32, 64): 1. Subtract from A 2^p 2. XOR A with 2^l 3. Set C to A 4. Multiply C by A 5. Subtract from C B 6. Take the square root of C 7. Set B to C 8. Add to B A 9. Subtract from A C
As a tip: at 3:45 the contrast with the blue text is so low that I can’t read it. So I’m following along with your commentary by imagining the route in my head rather than reading along with you! EDIT: ok so this happens a lot in this video.
I felt that you haven't watch Po-Shen Loh's video at its full, because he explicitly mentions this is NOT his own equation. It's his way to interpret the logic in solving quadratic equations. He even mentions who originally had this idea.
For the CPU/FPU code, if the bottleneck is reading from RAM, you might try putting in some prefetch loads of the next set of data (particularly for your ASM code). It may not make much difference, though, given the loads are cache line at a time. I remember way back in the day we were looking at the Fastest Fourier Transform in the West (FFTW) on the UltraSparc and telling the C compiler to heavily optimize it (including using static profile guided optimization) and the ASM it generated was pretty slick, including prefetch loading so the data for the next loop was in L1 when the next loop iteration needed it. If the compiler you used didn't use PGO to optimize but has the option, it might be interesting to see if it helps. Also, the dark blue text used for Po-Shen Loh code is pretty hard (for at least me) to read. The red was OK even if a little dark.
Manual prefetching will be slower because there are now additional instructions to be executed. He is just iterating over arrays, the hardware prefetcher likes that pattern.
When doing computations not only performance is important, but also the accuracy of the results. Quadratic equations have two cases which need to be treated carefully from a numerical point of view, or you might suffer a loss of accuracy: (1) where -b ~ sqrt(b^2 - 4ac) and (2) where b^2 ~ 4ac. In both cases you basically subtract two almost equal numbers. In combination with the finite precision of floats the siginificant digits of the result are reduced. Both algorithms (as shown here) suffer from these problems. Obviously, for numerically stable code these cases should be handled appropriately.
I do remember this coming up in Numerical Analysis class. Avoid subtracting two numbers if the result might be close to zero! College may have been a long time ago for me but that one guideline has stuck with me.
ปีที่แล้ว
Totally underrated comment. This is overlooked so often.
So, you take (A ± B)/C expression, separate it into A/C ± B/C and call it a new revolutionary formula? Seriously? The whole comparison is about how smart is the optimizer in your compiler and how efficiently or inefficiently can you put the two into source code. But in the end of the day, it's exactly the same formula. Programmers go to far greater lengths in deviating from formulae optimized for viewing to formulae optimized for execution when it's really important.
4:31 skip the if. Just always divide. If the coefficients are random, the probability of a being equal to 1 are very slim. Also it might be worthwhile to multiply by the precomputed inverse of a.
I was just thinking what you put down here. The if statement is actually probably his biggest adder of time here! AND usually the vectorization wants to have pure math, nothing with conditionals
When I first saw Po-Shen Loh's method, I thought it looked like he was solving it by substitution, just as Tartaglia and Cardano solved cubics with at first before generalizing the formula. Po-Shen Loh's way is cool and nifty even though it's slower than the regular quadratic formula. Plus, there are situations where we'd get solutions in human-friendly form faster using Po-Shen Loh's method than we would with the regular quadratic formula.
Super interesting video! At the end, you do mention how multiplications versus divisions may affect running time, and list that Po-Shen Loh has 3 divisions. However, one of the divisions from Po-Shen Loh is a simple division by 2, which is effectively a multiplication by 0.5. With that said, Po-Shen Loh would have the same addition, 1 less subtraction, 3 less multiplication, the same division, and the same square root! Making it strictly better in terms of operation counts! I'd be curious to understand what you think.
I found it very hard to count the operations. The negatives can become subtractions, the divisions can be invterted (as you mentioned), and the FMADD instructions can perform add/sub along with a multiplication. My operation counts were a little hit and miss! I think you’re right :)
Really interesting. I was really hoping Po-Shen Loh was going to be faster across the board :P But the results and the overall lesson of the video was still very interesting. It would be cool to see a video where you only talk about the Op Times table you showed at 26:36 and maybe showing other examples of algorithms and how they stack up in terms of Op Times. On top of that, I also found myself wondering a little bit about "compiler vectorization". Idk if you have any video's on that subject, but I am interested in learning more about compiler vectorization. Especially including parts where you talked about "Structure of Arrays" and how the compiler (I guess) can more easily take advantage of simd instructions.
Worked on old 32-bit mainframes and we used to pour over our code for hours to hand-optimize the FORTRAN. Looking for division operations that could be replaced with multiplication by an inverse. And the bane of our work were pesky 'implicit type converions' where someone would do something like * instead of *. Yeah, those old compilers weren't very good at optimization so we had to do it ourselves. Love your channel and loved this dive into some interesting bit of math.
CPU latency of modern computers are roughly: fadd 4 (as well as fsub) fmul 4 fdiv 11-25 fsqrt 35 At the very least, this puts the original at between 97 and 125 latency, while the new formula takes between 88 and 130 latency. This means on natural programming it is both slower and faster depending on division requirements. The issue I have with your assessment at the end is you say that the new formula takes 3 divisions, it actually takes 2 to normalize the equation off the first coefficient. You are dividing by 2.0f, this is wasting 11-25 cpu latency on chip where you could simply multiply by 0.5f which is 4 latency. Even worse is the fact that floats are in powers of 2, so dividing by 2 is just decrementing the exponent, which is a 1 latency to perform an integer decrement prior to loading the float into register for use (even before normalizing division in the fpu). Some compilers would have noticed the 2.0f division and possibly switched to a fmul on 0.5f for speed, but it wouldn't have figured out that you could do a dec prior to any of it, which means that you didn't actually test the right code. If done correctly, the new method should have used 1xDEC (1), 2xFDIV (22-50), 1xFMUL (4), 1xFSQRT (35), and 3xFADD/FSUB (12), which comes to 74-102, faster than the original by 21%. Also, complex numbers are automatically formatted correctly with the new method, the old method has a division after sqrt, which takes longer to simplify (squaring it, dividing the sqrt answer by it to get the complex part formatted). You can make the code a little bit faster by taking the reciprocal of the leading coefficient (1xFDIV) and then multiplying it to the other coefficients to speed up the process further, but both can make use of that trick....despite that, the new method becomes almost 23% faster. The real bottleneck comes down to the square root function, it is the most expensive part in both scenarios. If we could get that down to 15 latency, the new method would be 28% faster than the old.
Love all the content!!! Reminds me of my random coding adventures! And very inspiring!!! Very happy that you are active again! I am struggling to get myself back up and going.., would love to see a continuation of your GraceAI neural network series!! ❤
Instead of: b[i] /= a[i]; c[i] /= a[i]; do: float a_inverse = 1.0 / a[i]; b[i] *= a_inverse; c[i] *= a_inverse; This will remove one division and replacing it with a much faster multiply.
I like the *= steps. Someone else remarked instead of mid b /= 2 or b *=.5. will using r2 instead of u be faster still? so r2 = sqrt(mid*mid - c) r1 = mid - r2 r2 += mid
For the red code at 3:30, wouldn't it be faster to compute the sqrt(...) part only once in a variable and then insert the variable twice for the +/- solutions? Or does the compiler do that automatically? Edit: Nevermind, just saw you test this later in the video and the compiler does indeed optimize it :) I personally never trust the compiler :D
is this actually a legal optimisation for the compiler though? if r1 points to the same address as a, b or c then the first computation would change the parameters for the second. So optimising this would lead to the wrong result. I guess if the function is only called once and it knows all the arguments at compile time, then it knows this can't happen, but I'm still surprised it would do this.
love the video, but some of the text is a little hard to read, maybe add a centered shadow text effect to make it pop out a bit, and lighten the text colours just a small amount too?
Would really like to see a discussion of whether Po-Shen Loh's method is preferable to the one we all learned from a numerical analysis POV. Not sure error analysis leads to quite as spiffy graphics, though. :)
Just to be clear here, these are algebraically identical. There was no 'new math' discovered here as some people claim. This is merely an optimization for computation
Po-Shen Lo did NOT derive a different quadratic formula, his’ is the same! He just derives it a little bit differently. The way he writes the formula just reshuffled common factors.
15:09 - my guessing that the compiler is using branch prediction, meaning that the program is already precomputing "a" and prepares for the branch as part of low level optimisation
The two algorithms are practically indistinguishable. One can regard the principle difference between them as an algebraically motivated optimization. The main value of this investigation is to underscore the fact that divisions are more expensive than multiplications. This difference would be substantially amplified by the use of double precision math. I haven't looked closely; but it may be possible to reduce the number of divisions in the Po Shen Loh approach. Regardless of approach, trading division for multiplication should improve performance - especially for double precision math. Note, however, I am not referring to constant division by powers of two. Though I am not certain, it is possible those will be converted to arithmetic shifts.
That's no longer true with modern FPUs and GPUs, where divisions are now actually a bit FASTER than multiplications! But this is actually visible only with GPUs, not with existing x64 CPU/CPU where both perform the same in one cycle... except when using "vectorized" expressions: AVX2 processor extensions (or newer), if they are used by your C/C++ compiler or its libraries, may show that divisions CAN perform faster than multiplications! You may also get different figures depending on CPU brands and models (how their internal FPU units are internally built). Do not assume things from the time where divisions were still using our "naive" algorithms (with bit-by-bit compares, shifts, and substraction) when we had no FPU to compute them in a single hardware instruction. As well, there are optimizations in modern FPUs to efficiently compute square roots. Modern FPUS can use "FFT-based" algorithms implemented by hardware (using many more silicon gates but structured in a much larger "tree" or parallel branches, but with lower depth, so with faster total propagation time): this is true for the multiplication, the division, the square root, and other trigonometric/exponential/logarithmic functions natively supported by FPU or GPU instructions.
As well there are new types of processing units, notably for IA and signal processing, using optimization of tensor flows. They start appearing on CPUs for smartphones and will be soon available on desktop PCs, and are already used in large computers performing massive computation (think about what the CERN needs for the LHC, or what Google needs for its IA with BigData, or what is needed in astronomy, fluid mechanic, or computer-aided design, notably in the aerospatial and defense sectors). These processing units have new specialized instructions for efficiently computing "convolution products" and "FFT's" by hardware. Here also, they can accelerate basic divisions faster than basic multiplications (and the difference is even more visible with massively parallel computing, with very large vectors or matrices rather than individual scalars).
It's interesting that the std::complex is so much slower than the scalar. It would have been interesting to see a complex implementation in C that didn't use the std::complex methods.
It does make sense that for complex numbers you're looking at much more compute. In the complex domain you have to store 2 values per number for the same level of precision. You'd think that means you only have double the work, but you actually end up with substantially more in certain operations. Consider addition and subtraction: that maps quite nicely to ReC = ReA + ReB, and ImC = ImA + ImB. So double the operations. Now consider multiplication: ReC = ReA × ReB - ImA × ImB, and ImC = ReA × ImB + ImA × ReB. Now you have 4 multiplications instead of 1, and two additions. Division and square roots are even worse. Now granted, you could speed this up by working in polar form, but then you need trig functions to handle addition and subtraction.
@@zactron1997 it is true that when operating in the complex domain that some maths operations require a lot more operations than its real counterpart. My point is that when he used std::complex implementation it took 10,000 ms to run as opposed to a few hundred ms. But his asm real vs complex implementations did not have such a wide time difference between the two.
An idea for future videos: Can you put the standard deviation with the speed result if its less than 2? I found myself really wanting to see it for some of those closer matches. Its only us nerds watching so there is no shame! Awesome video btw!
I was excited for a new formula, but it's just the regular one we are taught in school here in Germany :( It's called "pq-Formel", with p=b/a and q=c/a
I love your videos. I don't want to critique your style and this is the first I've seen with this style...but if you speed up all the animations (mainly the countdowns a lot, like 5 seconds less) I think it would flow better. I know animation is hard and a trial by error process sometimes.
that's the form for finding x axis intercepts. if just computing the y value for any particular a,b,c and x, then just plug it into y=axx+bx+c. no square roots and no division.
Can anyone explain these instructions which I found in a shellcode document for getting a bind shell? function hashes (executable as nop-equivalent) _emit 0x59 ; LoadLibraryA ; pop ecx _emit 0x81 ; CreateProcessA ; or ecx, 0x203062d3 _emit 0xc9 ; ExitProcess _emit 0xd3 ; WSAStartup _emit 0x62 ; WSASocketA _emit 0x30 ; bind _emit 0x20 ; listen _emit 0x41 ; accept ; inc ecx
At first I was angry with you for implementing it using a branch, the I was very happy that you did so because it made me feel good spotting the issue from the start :D So thank you for making me feel clever ;D
I would try to calculate 1 / (2.0) * a[i]) and then do multiplications instead of the two fivisions. Also I aliasing rules prevent further compiler optimizations, making the pointers restrict should help in this case, in particular in Quad2 as it does more memory access.
I am a bit curious about the aliasing rules here. If the compiler was reusing the value for the square root in the first case (at around 7:00), wouldn't this be against the aliasing rules, as if r1 would be the same as a, b or c, it would have changed the values for the square root for the second calculation. So the compiler may already assume that there is no aliased pointers
¡ Excelente trabajo !. You deserve the gold medal in computer olympics. For real numbers OQ performs very well.I'm going to try this on ARM V8 and compare with similar Intel. Cheers
fascinating. I haven't watched the other video yet to better understand this other method but from the 1-sentence description early on in your video, it doesn't sound like they're doing different things. The description of Po Shen Lo as finding the middle and then adding and subtracting the difference to find the roots... that's exactly what the quadratic formula does. -b/2a is the middle and then the difference is sqrt(b^2-4ac)/2a. You might be able to speed up the quadratic code by a tiny amount by calculating the -b/2a as well as the discriminant. Although in the scalar version, the compiler probably already has. I'm really curious to go see the more explanatory video soon, though. Maybe there's something I'm missing...
Suppose every program is just a number. I'm wondering how that number would be calculated. Is it left to right, or right to left? Then there is the stack and heap. Is there a stage before it enters memory where it is one complete number? Suppose we added 1 to it, where would it be? Is there a halt at the end? Does the increment have to take place before the halt?
I don't get how you can get rid of the if branch.... It says if a is not 1 then you adjust b and c. so for all values of a that are not 1 you need that adjustment. How can you just remove it? That's changing the formula. Am I missing something here?
The branch said to not do the division if it would divide by 1 " which is completely irrelevant cause dividing by 1 does not change the result but having a branch can seriously impact performance
Is summary, we cannot decide which approach ("original" from maths on real or complex numbers with infinite precision, or "PSL"-optimized) is faster or slower as it largely depends on how you program your code, the programming language you use (notably how it rounds the results per its semantics), the compiler you use (and its compilation options notably about rounding of intermediate results, and how it may cache multiple reads from external structures/variables into local registers), the library you use, and the hardware you use: finally you can only decide after really making some local benchmark. NEVER forget the language semantics (effects of "implicit" roundings, and of "if" branches) when creating ANY implementation with that language.
26:12 One of the divisions in Po-Shen Loh is x/2. It can be turned into a multiplication x*0.5. I'm kinda surprised if the compiler doesn't do that. So PSL vs "old" should have 1 less sub, 3 less mul and all the others the same. I.e. 7 clocks less than the "old" in the first table, and 2.5 less in the second table.
How does the tie rule work? Is it something like abs(resultsA[i] - resultsB[i]) < stddev(resultsA)? I'm not really a statistician, but I like learning these things
I think processor architecture plays a big role in this. The operation times depend on FPU implementation - for example Intel improved division by a factor of two and sqrt by 4 in the Penryn architecture if I remember correctly. More cache or memory bandwidth will target the key bottleneck here - that's probably the main gain from using a GPU. Branch prediction is powerful, but difficult and I think still evolving, so this affects different CPUs differently, as well as depending on the data fed in, and the order it's in. (The branch was only there to demonstrate bad code but I thought I'd include it.)
Hey friend. Why didnt you upload for such a long time? You were a reallllllllly good introduction for me for assembly C++ and in general compsci. Please reply
Could you use something like the inverse root hack from quake 3? I'm sure there's other approximations of square roots that could also improve that calc.
You can substitute one division in the po-shen lo formula by a multiplication if you instead of dividing b by a and then by 2 simply devide it by 2*a. On the other hand, multiplication and division by 2 shouldn't be a problem at all for a binary machine.
The 'blue' version @ 9:26, since 'bt' is only used for 'mid', why not eliminate it completely? Just 'float mid = - b[i] / (a[i] + a[i]). You eliminate a division operation as well as eliminate the storage 'bt' usage (and you don't need the constant '2' anymore). Just seems this code could be 'tweaked' a bit further and maybe beat the original quadratic formula. Later at 26:20, you would then beat the original by having fewer multiplications and equal number of divisions.
At differences so low you more quickly run into what compiler can optimize better at given flags (O0 vs O1 vs O3, -ffast-math/-fno-fast-math) For example: final table is unoptimized. Nobody counts 2a as (2*a). It's (a+a). (Unless compiler would decide that 0.5 * X/a is faster than X/(a+a) -- it generally shouldn't, floats have no instruction for quickly halving a value)
√(b²-4ac)/(2a) == √((b/2a * b/2a) - c/a) == u. Therefore both formulas are the same. If a is > 1 in the majority of your dataset, then the if block adds 2 division operations. Your red code can be optimized to make it even faster.
From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c. So we will need to use more different methods for computing the roots. Btw the "red an blue tower thing" was really anxiety-inducing 😂
"From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c" Really? Like, in a way that po-shen loh wouldn't?
@@user-sl6gn1ss8p That's a good question, po's formula divides all the coefficients by a, and then does the calculations (but when abs(a) is small there might be some problems in general). @Creel would you share how you generated the test cases?
@@monad_tcp Not all parabolas cross the x axis... But even then there are some operations (like subtraction, sum and division) that introduces errors because the computer works with a finite number of digits
First, the two algorithms really are not separate: They have exactly the same numerical qualities and effectively do the exact same operations, only in a slightly modified order! On a modern CPU, sqrt() and FDIV are both far slower than fadd/fsub/fmul & fmadd, so I would start by calculating the a_reciprocal = (1/a) term: The FDIV in that operation will then be nearly or completely free because it can overlap with all the rest of the calculations. In the PSL version we immediately use the result of that division to scale b and c, which simplifies the logic but generates a latency barrier. If we instead calculate half_a_recip = (0.5/a) then we can multiply the final answers by that value. The sqrt term is sqrt(b*b-4*a*c), here the b2=b*b term and the ac=a*c term can run at the same time, then we combine them in tsqrt=sqrt(fmadd(ac,4.0,-b2)). By the time that has finished, the fdiv will be be all done, so we just have to do the final res0 = (-b + tsqrt) * half_a_recip and res1 = (-b-tqrt) * half_a_recip calculations which will also overlap. Since sqrt is slightly slower than fdiv, we can in fact delay the half_a_recip calculation a smidgen, starting it just after the b2=b*b; and ac=a*c; operations, it will still finish before the sqrt. On an x64 cpu with at least two fmadd pipelines, but in scalar mode, these calculations should take ~40 clock cycles (I'm assuming 5-cycle fmadd and 20-cycle fsqrt).
I doubt the latency of the instructions matters at all in case you calculate millions of independent roots -- only instruction throughput and port-usage should matter (and in fact memory throughput). Instruction latency would matter, if the output of the previous calculation influences the input of the next.
@@cH3rtzb3rg I mostly agree: When processing huge amounts of data, typically more than last level cache, almost any calculation will be totally dominated by memory bandwidth. OTOH, that is pretty much unimagninable for the case of solving quadratic roots. :-)
@@TerjeMathisen I also agree. The benchmark in the video is pretty much meaningless for most practical purposes. One could compare the latency of the algorithms by calculating new coefficients based on the previous roots (that way, there would also be no memory bottleneck).
In the end, both are good enough, most of the time goes to pure moving data around, and the square root itself. So it doesn't matter much. But it was a cool thing to know.
The way I learnt it, the expression (double==constant) is a bug anyway. (9:12) Before going to assembly I would have replaced std::complex, which seems to me to be the culprit messing up memory access.
That "if" breaks a lot of processor optimizations and it's not needed i think. if "quads[i].a" is 1.0f the equation inside the if is: b/=1.0f; c/=1.0f which didn't change the values. This would speed up the code a lot. With that it should have a fighting chance. The slowest part of both codes is the "sqrt" it is a magnitude slower than any other operation in that code. So using it as little as possible is always preferable. So this code: void Po_Shen_Loh_Simple_AoS(QuadStruct* quads, int count){ for(int i =0; i< count; i++){ float mid = ( quads[i].b / quads[i].a ) * -0.5f float u = sqrt(mid * mid - quads[i].c / quads[i].a ); quads[i].r1 = mid + u; quads[i].r2 = mid - u; } } should be faster if i haven't made a stupid logical error. 🙂
The way floats are structured means that bit shifts don't work for division nor multiplication by powers of 2. Floats are structured like this: SignBit PositiveAndBiasedExponent FractionalPart SignBit is 0 for positive float, 1 for negative float PositiveAndBiasedExponent is exponentYouWantToStore+bias = exponentYouWantToStore+127 FractionalPart is like the .345 in the number 12.345 SignBit is 1 bit long. PositiveAndBiasedExponent is 8 bits long. FractionalPart is 23 bits long. Why these kinda random #s? It's standardized and together they use exactly 32 bits, which is a nice power of 2. Example float in memory: 1 10001000 11000000000000000000000 SignBit = 1, so float is negative 10001000 = 136. However, to get the exponent we intended, subtract the bias to get 136-127=9 11000... = (1/2)+(1/4) = .75 So the example float in memory is -1.75*2^(9) = -1.75*512 = -896.0 1 10001000 11000000000000000000000 = -896.0 Why 1.75 instead of .75? The leading 1 is always present in binary scientific notation, and the value 1 is the same no matter which base you're in (except Base 0 and Base 1, which are extraordinarily difficult/annoying to deal with). Now imagine shifting it to the right because you think dividing a float by 2 is the same as dividing an integer by 2. What you're doing is moving the sign bit into the exponent, dividing the EXPONENT by 2, and then (what your intention originally was) dividing the FractionalPart by 2. 1 10001000 11000000000000000000000 shifted to the right by 1 is 0 11000100 01100000000000000000000 Putting this into www.h-schmidt.net/FloatConverter/IEEE754.html shows that it's equal to 8.11656739243e+20 -896 is pretty different when compared to +8 with 20 digits after it. HOWEVER, if you could find a way to decrease the exponent by 1 (i.e. subtract place value of 1 from the exponent) WITHOUT modifying any other part of the float, you would successfully divide by 2.
What a nice video! In the original quadratic formulae you get into trouble if a, c or both are small compared with b. You should compute the roots as q=== -1/2(b+sqn(b)sqrt(b*b-4ac)). Roots are x1=q/a and x2=c/q.
Is storing computation result into previously defined variable (thinking that we won't need its content anymore) faster than defining new (local to function) variable? And does this even matter?
@@monad_tcp I think Rogo was asking about code like: int x1 = SOME_EXPRESSION; // use x1 int x2 = SOME_OTHER_EXPRESSION; // use x2 If one should reuse x1 instead of introducing x2 for that.
@@cH3rtzb3rg oh, in that case it really doesn't matter, variables are renamed either way, the compiler either do register allocation or store in the stack. After data-flow analysis, it will reuse stack variables if possible. Unless your compiler is really, really dumb. Like Microchip ones.
Based on code complexity, I would think Po-Shen Lo might do better on an FPGA. It could probably be worked out that both take the same time, but even so, Po-Shen Lo looks like it would use less blocks on the FPGA.
So suppress the "if", and just use ONLY two intermediate variables "mid" and "u" declared as "long double", storing their complete expression; then you get "r1=mid+u" and "r2=mid-u"; you can even assign the intermediate valid of "mid" and "u" within the expression of "r1", and reuse "r1" into the expression of "r2" in a single statement, so you get "r2=(r1=(mid=sqrt(b*b-a*c*4)/a/2)+(u=...))-u*2;" As well avoid converting "float" read from your "quad" structure multiple times to "double" or "log double", just read them once in a local temp variable of type "long double" (this can be done in BOTH "formulas" to drop that overhead). Note that if you use assembly, those temp variables would be FPU registers of type "long double" (but you need a good C/C++ compiler to do the same optimization and use "IEEE rounding modes" flags in the compiler or with macros in the soruce code to avoid all unnecessary intermediate conversions and roundings). You can redo your test as well by changing your "quad" structure to declare "double" or "long double" fields (in both "formulas") So what you are really testing is how C/C++ compilers optimize the generated code. There's in fact NO difference between the formulas, except that if you incorrectly round intermediate results in your naive "Poh-Shen-Loh" implementation, you get DIFFERENT results between the two "formulas", where your "Poh-Shen-Loh" implementation is LESS PRECISE than your "original" implementation! You did not even compare the results between the two formulas to see that their results have differences (visible here on the 9th significant digit which can deviate by +/-2 !)
interesting video! Though I find it hard to believe that the whole gpu is only twice as fast as a single cpu core for such a massively parallel task. Is there some cuda optimization flag that you did not turn on? Also the gpu may not have had the time to increase its frequency to the max if the task was too short. Did you compute the effective bandwidth vs theoretical bandwidth as you did for the cpu test? I'm curious about that as well.
When you're doing stuff with GPGPU, if you try to simply time the time from when you "submit" the work to the GPU to when it finishes, you end up also measuring the overhead that is involved with deploying kernels and eventually having all of them hit the synchronization point at the end. There's a good chance that the amount of work that was submitted was not even to keep the GPU busy for long enough, and so the vast majority of the time is spent on that overhead. Depending on how things are done, you may end up also measuring the transfer of the inputs/outputs to/from the GPU, which would make the results look even more similar. The "dumb" way to measure how much overhead there is, would be to give it a ridiculously small amount of work, like only one calculation, and see if it ends up being ~75ms again. The better way is that Nvidia supplies tools for measuring how long each part of the process takes. (Launching the kernel vs memory transfers vs the actual work).
Would probably be faster to just return 1 for every entry.
It would be accurate enough in astronomical scales.
Unless your quadratics are astronomical in nature ;)
@@rayniac211 then it would be accurate on a bigger scale
@@WitherLele th-cam.com/video/ICv6GLwt1gM/w-d-xo.html
You remind me of the algorithm i designed to check an array of positive integers to return true for each value divisible by 3.
My Algorithm:
Return false
It is fast, it doesn't contain unpredictable failure, and it is correct most of the time given the set of all positive integers.
Dark background and blue color aren't a good combination of colors for readable text. Try white or yellow next time or something like that. Anyway great comparison and great job on the optimizations!
If you want Blue to be seen, add some Green!
LOL, he said blue code and i said what code ?
I had to change the brightness of my monitor to see it
Not just it bad by itself, but it is also more damaged by TH-cam compression algorithms.
yellow on white?
ha ha
@@NoNameAtAll2 no on the dark background
9:35 don't divide by 2, multiply by 0.5. The compiler is not allowed to do so, as it will sometimes produce different bitwise results.
... Unless you have fast unsafe math enabled, which will also allow for FMA reordering, which is a pretty significant optimization...
Also, much of the runtime is actually float division. It's not pipelined and you can only start a division after another has finished, which is 14 clock cycles at best, which is _extremely_ relevant when you can replace a FP division by using "RCP_PS" and do some Newton-Raphson, which has _WAY_ higher throughput than FP divison (relevant for tight loops such as this one), and can be FMA fused when multiplying by the reciprocal instead.
Here's the code for an accurate reciprocal (only _possibly_ worth it if your CPU supports FMA; total latency (fastest possible): 4 (rcp) + 2 * 4 (FMA) = 12 cycles vs 11 cycles (fastest possible division), but way higher throughput and FMA "fusable", meaning another - up to - 3 clock cycles saved):
__m128 rcp23_ps(__m128 a)
{
const __m128 ONE = _mm_set1_ps(1f);
__m128 r = _mm_rcp_ps(a);
return _mm_fnmadd_ps(_mm_fmsub_ps(r, a, ONE), r, r);
}
I was looking for this one.
I also want to add: even if fast unsafe math is enabled, on some compilers (vcc: /fp:fast) you should also enable optimizations for speed to rewrite constants as needed (done when using /O2 or /Ot but NOT /Ox). /Ot is default. /O2 does some more afaik.
(no indepth knowledge use at your own risk. Please corret me if im wrong)
P.S.: AFAIK in gcc constant rewriting is auto enabled when using fast unsafe math (-ffast-math) and optimization levels do not affect constants, -ffasm-math takes precedence?
Dividing by powers of two can and will actually be replaced by multiplying with the inverse by the compiler even without fast-math flags, since the result will be exactly the same.
One could replace the divisions by a with a multiplications by (1/a) -- this is a thing the compiler won't do in exact-math mode. In fast-math mode, it may even use the fast rcpps (reciprocal packed single) instruction with a refinement step to do this faster.
bit shift >> may be faster still
@@JamesEdwards780 You can't just bitshift floats.
@@vytah Not with that attitude!
I’m guessing the new version might actually use less power! They should be practically the same speed due to ILP, but the the new version requires less calculations
whats ILP?
@@-mwolf Instruction-level parallelism en.wikipedia.org/wiki/Instruction-level_parallelism .
Thank u for the wisdom, dolan
But it simply does not take less computational powe
@@ABaumstumpf I believe they mean literal power, like electricity. It might be less expensive in terms of electricity.
My man really putting the work in on the Blender effects 😂 but damn I was so shocked to see deleting a useless branch cut the processing time in half
He's done an interesting video on branchless coding.
In modern CPUs, a branch can be very expensive.
@@protonmaster76 I've seen it and enjoyed it, I just assumed that as a false branch it would have been optimized out by the compiler.
The combo of a branch + floating point, may have resulted in the compiler not realizing it was redundant. And having a branch that also probably didn't predict very well (its based soley on incoming data) it may have done major performance damage. Would be cool to see the pattern of which way the branch goes, given its based solely on the input. Wonder if "sorting" the input to make it predictable would allow removal of the divide (by the CPU's branch predictor), for some of the solve attempts and thus, maybe go even faster for those cases. @Creel looking at the branch predictability could be an interesting video.
26:24 it’s worth noting that because you’re dividing my a constant power of two for the third div in the new version, all modern compilers will optimize that to a single mul instruction.
I’m almost certain that -x/y will always become an xor and mul when y is a power of two
Edit: I don’t want to litter the comment section too much, so I’ll just put this here: awesome video, man! Some of my work involves optimization, and I absolutely love when other people make this knowledge super accessible!
I think you're right about the division being swapped for a mul! I can't remember what I did in the ASM, but hopefully a compiler would figure that one out... Cheers for watching mate, have a good one :)
An actual implementation would distinguish 4 main cases, depending on if 'a' is zero or not, and on if 'b' is zero on not:
- 1. When (a=b=0), if (c=0) there's a single root (r1=r2=0), otherwise there's no solution (r1=r2=NaN).
- 2. Otherwise when only (a=0), there's no solution if (b=0), otherwise a single solution (r1=r2=-c/b).
- 3. Otherwise when only (b=0), there are two solutions of opposite signs (r2=-(r1=√(c/a)).
- 4. Otherwise, we have the two quadratic roots, possibly complex depending on the sign of the determinant (b²-4ac).
For the last 4th case, you need **numeric stability** (due to limited precision for "float", "double" or "long double" types used by FPU) and the two quadratic solutions should be given by (r{1,2}=[√(1-4ac/b²)±1]*b/2a), where the square root may be imaginary if the "normalized" determinant (1-4ac/b²) is negative, i.e. when the usual determinant (b²-4ac) is negative (both conditions are equivalent because b≠0 in that 4th case).
This gives a total of 7 cases, where conditionals (if statements) are justified.
But the test on if b=1 in this video is just useless and stupid (it tries to avoid a trivial division by 1 but at the cost of a branch, which disables the optimization or parallelisation by the compiler: divisions by 1 are trivial also in FPUs, may just cost 1 cycle, LESS than the cost of a branch).
(-x/y) cannot be "optimized" with a xor and mul: we're working here with floatting point values, not integers! The negation of floatting points just flips & single bit, the sign bit, it does not perform any "xor" on the whole value (the exponent or the mantissa part are not modified), and this can be done before the division on any operand x or y, or on the result; but in fact the result sign is NOT[sign(x)AND sign(y)], and the absolute quotient is the separate division of abs(x)/abs(y).
To compute that (positiive) dividend of these two positive floatting points,
- the FPU separately computes the quotient of the two mantissa, which are in (0.5, 1.0], so that non-normalized quotient is in (0.5, 2], and the difference of exponent parts;
- finally it renormalizes the quotient into (0.5, 1.0] and accordingly adjusts the difference of exponent parts by conditionally substracting 1.
The final result may become a denormal (or zero) floatting point value, or an "infinite" floatting point value, depending if the final exponent part results in an underflow or overflow (for its allowed range), but it will be a "NaN" (or would raise a "'division-by-zero" exception) if both mantissa parts (including the implicit most significant bit if it's not stored in the floatting point format) are zero. In all cases, the result sign computed separately is blindly applied as is (possibly giving "positive or negative zero", "'positive or negative NaN", or "positive or negative infinite" values; some libraries may or may not discard the sign distinction for zeroes and NaNs, but with some extra cost using conditional branches)
@@PhilippeVerdy, you probably should have checked godbolt before that whole explanation. I just confirmed that on some compilers it becomes a mul and xor, on others it just becomes a mul by -0.5.
Remember, in this case we are dividing by a positive constant power of two and negating. This means the compiler can take shortcuts.
@@pyromen321 if you are dividing à float by a negative power of two, you don't need a mul, just z decrementation of the exposant part, and flipping the sign bit, this tales two instructions, worse than a single FPU instruction performing a single multiplication by constant like -0.5 (consider data path on thd same word containing thd sign bit and the exponent, plus you need to test underflows when decrementing the exponent, or you'll wrong results !
What a great idea! Gamifying the benchmarks makes it so much fun. Love the animations and benchmarking song. Really interesting stuff.
just so everybody knows:
the OG quadratic formula is the closed form of the Po-Shen Lo algorithm.
the PSL algorithm is simpler to understand and perform by a mere human, but I would not expect optimized code for the two variants to differ al to much.
If the least significant bit of precision can be sacrificed in the new method, the two divisions by a can be replaced by one reciprocal (i.e. 1.0/a) and two multiplications with that reciprocal. Thus one division can be traded for two multiplications, which may be faster.
Honestly, that's the real solution here. How much accuracy are you allowed to sacrifice in order to speed things up. I'm taking accounting courses right now and there's a bunch of shortcuts that are taken for similar reasons. The result is somewhat different from what it would be with proper math, but it's usually a small amount and you're rarely dealing with more than 4 decimal places when dealing with currency anways. Plus, since people are using one of two sets of rules regarding how these things are don, there usually isn't much trouble.
Accuracy is also why std:complex is usually slow. The C++ standard library has to work in all cases and for complex numbers not sacraficing precidion means a lot slower code. You also cant use a fast-math optimised version because the standard library is precompiled (altough you can work around this by compiling a glibc with fast-math and using that with static linking it would probably still preform worse than hand-written speed-for-accuracy optimised code). I would personally not recommend std:complex in any case unless you are sure you need super-duper accurate answers but than you would probably go with an arbitrary precidion library wich has complex numbers.
Isn't the step B = b/a and then mid = B/2 the real waste here? You only need mid.
Also you can use a trick used for the quadratic, where r2 is used for the discriminant, r1 = (-b + r2)/2a and r2 = (-b - r2)/2a.
For the Po Shen Lo it works even better:
mid = b/2a
c = c/a
r2 = sqrt(mid*mid - c)
r1 = mid - r2
r2 = mid + r2
the last step I imagine can be faster as it's adding mid to itself ( r2 += mid?)
This should already always beat the quadratic. Would using reciprocals be even faster?
CPU's & Compilers can vectorize Array of Structures as well, the difference between this and SOA is more about optimizing cache access patterns for functions that dont use every element in the structure. You should double check your compiler optimization settings if you are getting wildly differing results from a structure this small.
After a little bit of tinkering, I got the Po Shen method down to:
2 Float subtractions
2 Transfers (a := b)
1 Integers subtraction (from a float to reduce the exponent by 1 aka dividing by 2)
1 XOR (to multiply by -1)
1 Float addition
1 Square root
It also only uses 3 Registers.
It goes as follows (from x² + Ax + B; A, B, C are floats; p is the number of significand bits of the type of float, l is the size of the total number for example 32, 64):
1. Subtract from A 2^p
2. XOR A with 2^l
3. Set C to A
4. Multiply C by A
5. Subtract from C B
6. Take the square root of C
7. Set B to C
8. Add to B A
9. Subtract from A C
As a tip: at 3:45 the contrast with the blue text is so low that I can’t read it. So I’m following along with your commentary by imagining the route in my head rather than reading along with you!
EDIT: ok so this happens a lot in this video.
I felt that you haven't watch Po-Shen Loh's video at its full, because he explicitly mentions this is NOT his own equation. It's his way to interpret the logic in solving quadratic equations. He even mentions who originally had this idea.
For the CPU/FPU code, if the bottleneck is reading from RAM, you might try putting in some prefetch loads of the next set of data (particularly for your ASM code). It may not make much difference, though, given the loads are cache line at a time. I remember way back in the day we were looking at the Fastest Fourier Transform in the West (FFTW) on the UltraSparc and telling the C compiler to heavily optimize it (including using static profile guided optimization) and the ASM it generated was pretty slick, including prefetch loading so the data for the next loop was in L1 when the next loop iteration needed it. If the compiler you used didn't use PGO to optimize but has the option, it might be interesting to see if it helps.
Also, the dark blue text used for Po-Shen Loh code is pretty hard (for at least me) to read. The red was OK even if a little dark.
Manual prefetching will be slower because there are now additional instructions to be executed. He is just iterating over arrays, the hardware prefetcher likes that pattern.
When doing computations not only performance is important, but also the accuracy of the results. Quadratic equations have two cases which need to be treated carefully from a numerical point of view, or you might suffer a loss of accuracy: (1) where -b ~ sqrt(b^2 - 4ac) and (2) where b^2 ~ 4ac. In both cases you basically subtract two almost equal numbers. In combination with the finite precision of floats the siginificant digits of the result are reduced. Both algorithms (as shown here) suffer from these problems. Obviously, for numerically stable code these cases should be handled appropriately.
I do remember this coming up in Numerical Analysis class. Avoid subtracting two numbers if the result might be close to zero!
College may have been a long time ago for me but that one guideline has stuck with me.
Totally underrated comment. This is overlooked so often.
So, you take (A ± B)/C expression, separate it into A/C ± B/C and call it a new revolutionary formula? Seriously?
The whole comparison is about how smart is the optimizer in your compiler and how efficiently or inefficiently can you put the two into source code. But in the end of the day, it's exactly the same formula. Programmers go to far greater lengths in deviating from formulae optimized for viewing to formulae optimized for execution when it's really important.
@@jeremiahbullfrog9288 idk because i did not wathed but formula was proposed by someone else whose video is in the discription
4:31 skip the if. Just always divide. If the coefficients are random, the probability of a being equal to 1 are very slim. Also it might be worthwhile to multiply by the precomputed inverse of a.
I was just thinking what you put down here. The if statement is actually probably his biggest adder of time here! AND usually the vectorization wants to have pure math, nothing with conditionals
When I first saw Po-Shen Loh's method, I thought it looked like he was solving it by substitution, just as Tartaglia and Cardano solved cubics with at first before generalizing the formula. Po-Shen Loh's way is cool and nifty even though it's slower than the regular quadratic formula. Plus, there are situations where we'd get solutions in human-friendly form faster using Po-Shen Loh's method than we would with the regular quadratic formula.
It’s also just quadratic formula but removing the fluff
Super interesting video! At the end, you do mention how multiplications versus divisions may affect running time, and list that Po-Shen Loh has 3 divisions. However, one of the divisions from Po-Shen Loh is a simple division by 2, which is effectively a multiplication by 0.5.
With that said, Po-Shen Loh would have the same addition, 1 less subtraction, 3 less multiplication, the same division, and the same square root! Making it strictly better in terms of operation counts!
I'd be curious to understand what you think.
I found it very hard to count the operations. The negatives can become subtractions, the divisions can be invterted (as you mentioned), and the FMADD instructions can perform add/sub along with a multiplication.
My operation counts were a little hit and miss!
I think you’re right :)
The Po-Shen Loh method is not new. Learned that method over ten years ago in school.
yea like same
Same. It’s called the pq formula in Swedish schools. It’s basically just dividing the polynomial by a and deriving the quadratic formula again.
Really interesting. I was really hoping Po-Shen Loh was going to be faster across the board :P But the results and the overall lesson of the video was still very interesting.
It would be cool to see a video where you only talk about the Op Times table you showed at 26:36 and maybe showing other examples of algorithms and how they stack up in terms of Op Times.
On top of that, I also found myself wondering a little bit about "compiler vectorization". Idk if you have any video's on that subject, but I am interested in learning more about compiler vectorization. Especially including parts where you talked about "Structure of Arrays" and how the compiler (I guess) can more easily take advantage of simd instructions.
Worked on old 32-bit mainframes and we used to pour over our code for hours to hand-optimize the FORTRAN. Looking for division operations that could be replaced with multiplication by an inverse. And the bane of our work were pesky 'implicit type converions' where someone would do something like * instead of *. Yeah, those old compilers weren't very good at optimization so we had to do it ourselves. Love your channel and loved this dive into some interesting bit of math.
This video was absolutely amazing. Beautifully done. WOW.
CPU latency of modern computers are roughly:
fadd 4 (as well as fsub)
fmul 4
fdiv 11-25
fsqrt 35
At the very least, this puts the original at between 97 and 125 latency, while the new formula takes between 88 and 130 latency. This means on natural programming it is both slower and faster depending on division requirements.
The issue I have with your assessment at the end is you say that the new formula takes 3 divisions, it actually takes 2 to normalize the equation off the first coefficient. You are dividing by 2.0f, this is wasting 11-25 cpu latency on chip where you could simply multiply by 0.5f which is 4 latency. Even worse is the fact that floats are in powers of 2, so dividing by 2 is just decrementing the exponent, which is a 1 latency to perform an integer decrement prior to loading the float into register for use (even before normalizing division in the fpu). Some compilers would have noticed the 2.0f division and possibly switched to a fmul on 0.5f for speed, but it wouldn't have figured out that you could do a dec prior to any of it, which means that you didn't actually test the right code.
If done correctly, the new method should have used 1xDEC (1), 2xFDIV (22-50), 1xFMUL (4), 1xFSQRT (35), and 3xFADD/FSUB (12), which comes to 74-102, faster than the original by 21%. Also, complex numbers are automatically formatted correctly with the new method, the old method has a division after sqrt, which takes longer to simplify (squaring it, dividing the sqrt answer by it to get the complex part formatted). You can make the code a little bit faster by taking the reciprocal of the leading coefficient (1xFDIV) and then multiplying it to the other coefficients to speed up the process further, but both can make use of that trick....despite that, the new method becomes almost 23% faster.
The real bottleneck comes down to the square root function, it is the most expensive part in both scenarios. If we could get that down to 15 latency, the new method would be 28% faster than the old.
Love all the content!!! Reminds me of my random coding adventures! And very inspiring!!! Very happy that you are active again! I am struggling to get myself back up and going.., would love to see a continuation of your GraceAI neural network series!! ❤
Instead of:
b[i] /= a[i];
c[i] /= a[i];
do:
float a_inverse = 1.0 / a[i];
b[i] *= a_inverse;
c[i] *= a_inverse;
This will remove one division and replacing it with a much faster multiply.
That replaces a division with TWO multiplication, which... might still be faster? You'd have to benchmark, and it may depend on the CPU.
I like the *= steps. Someone else remarked instead of mid b /= 2 or b *=.5.
will using r2 instead of u be faster still?
so r2 = sqrt(mid*mid - c)
r1 = mid - r2
r2 += mid
And even faster to not be doing I/O back to the arrays.
And that change will give slightly different results
What a great way to pad the runtime
For the red code at 3:30, wouldn't it be faster to compute the sqrt(...) part only once in a variable and then insert the variable twice for the +/- solutions? Or does the compiler do that automatically?
Edit: Nevermind, just saw you test this later in the video and the compiler does indeed optimize it :)
I personally never trust the compiler :D
is this actually a legal optimisation for the compiler though?
if r1 points to the same address as a, b or c then the first computation would change the parameters for the second. So optimising this would lead to the wrong result.
I guess if the function is only called once and it knows all the arguments at compile time, then it knows this can't happen, but I'm still surprised it would do this.
Great channel ... really enjoying it ... thank you. Cheers
Love the format, very nice animation, and even nicer maths!
love the video, but some of the text is a little hard to read, maybe add a centered shadow text effect to make it pop out a bit, and lighten the text colours just a small amount too?
Yeah the blue on dark gray in particular is really hard to see
Would really like to see a discussion of whether Po-Shen Loh's method is preferable to the one we all learned from a numerical analysis POV. Not sure error analysis leads to quite as spiffy graphics, though. :)
I absolutely loved this, thanks for making such great content!
Just to be clear here, these are algebraically identical. There was no 'new math' discovered here as some people claim. This is merely an optimization for computation
Your videos are always super interesting!
Thank you
Po-Shen Lo did NOT derive a different quadratic formula, his’ is the same! He just derives it a little bit differently. The way he writes the formula just reshuffled common factors.
Nobody said anything about deriving
with Po=Shen Loh's, you don't need to calculate with complex numbers. you just put an i out the front
if c^2 < 0:
c = |c|
ans = ic
15:09 - my guessing that the compiler is using branch prediction, meaning that the program is already precomputing "a" and prepares for the branch as part of low level optimisation
The two algorithms are practically indistinguishable. One can regard the principle difference between them as an algebraically motivated optimization.
The main value of this investigation is to underscore the fact that divisions are more expensive than multiplications. This difference would be substantially amplified by the use of double precision math. I haven't looked closely; but it may be possible to reduce the number of divisions in the Po Shen Loh approach.
Regardless of approach, trading division for multiplication should improve performance - especially for double precision math. Note, however, I am not referring to constant division by powers of two. Though I am not certain, it is possible those will be converted to arithmetic shifts.
That's no longer true with modern FPUs and GPUs, where divisions are now actually a bit FASTER than multiplications! But this is actually visible only with GPUs, not with existing x64 CPU/CPU where both perform the same in one cycle... except when using "vectorized" expressions: AVX2 processor extensions (or newer), if they are used by your C/C++ compiler or its libraries, may show that divisions CAN perform faster than multiplications! You may also get different figures depending on CPU brands and models (how their internal FPU units are internally built). Do not assume things from the time where divisions were still using our "naive" algorithms (with bit-by-bit compares, shifts, and substraction) when we had no FPU to compute them in a single hardware instruction. As well, there are optimizations in modern FPUs to efficiently compute square roots. Modern FPUS can use "FFT-based" algorithms implemented by hardware (using many more silicon gates but structured in a much larger "tree" or parallel branches, but with lower depth, so with faster total propagation time): this is true for the multiplication, the division, the square root, and other trigonometric/exponential/logarithmic functions natively supported by FPU or GPU instructions.
As well there are new types of processing units, notably for IA and signal processing, using optimization of tensor flows. They start appearing on CPUs for smartphones and will be soon available on desktop PCs, and are already used in large computers performing massive computation (think about what the CERN needs for the LHC, or what Google needs for its IA with BigData, or what is needed in astronomy, fluid mechanic, or computer-aided design, notably in the aerospatial and defense sectors). These processing units have new specialized instructions for efficiently computing "convolution products" and "FFT's" by hardware. Here also, they can accelerate basic divisions faster than basic multiplications (and the difference is even more visible with massively parallel computing, with very large vectors or matrices rather than individual scalars).
awesome video, well done!
It's interesting that the std::complex is so much slower than the scalar. It would have been interesting to see a complex implementation in C that didn't use the std::complex methods.
It does make sense that for complex numbers you're looking at much more compute. In the complex domain you have to store 2 values per number for the same level of precision. You'd think that means you only have double the work, but you actually end up with substantially more in certain operations.
Consider addition and subtraction: that maps quite nicely to ReC = ReA + ReB, and ImC = ImA + ImB. So double the operations.
Now consider multiplication: ReC = ReA × ReB - ImA × ImB, and ImC = ReA × ImB + ImA × ReB. Now you have 4 multiplications instead of 1, and two additions.
Division and square roots are even worse. Now granted, you could speed this up by working in polar form, but then you need trig functions to handle addition and subtraction.
@@zactron1997 How fast is conversion between polar and grid forms?
@@zactron1997 it is true that when operating in the complex domain that some maths operations require a lot more operations than its real counterpart. My point is that when he used std::complex implementation it took 10,000 ms to run as opposed to a few hundred ms. But his asm real vs complex implementations did not have such a wide time difference between the two.
@@Mernom cartesian/polar conversation requires about the same amount of trig as you're trying to avoid in the add and sub.
std::complex is pretty terrible
brain neglects reading darkblue code :D
An idea for future videos:
Can you put the standard deviation with the speed result if its less than 2? I found myself really wanting to see it for some of those closer matches.
Its only us nerds watching so there is no shame! Awesome video btw!
bloody awesome! never knew this existed, extremely cool, hell it'll be useful for school work too ;)
I reckon! Such a clever formula! Cheers for watching :)
@@WhatsACreel Cheers :D
I was excited for a new formula, but it's just the regular one we are taught in school here in Germany :(
It's called "pq-Formel", with p=b/a and q=c/a
For completeness, I would suggest doing the benchmark calculations in both float and double. I would have liked to see complex. 🙃
I love your videos. I don't want to critique your style and this is the first I've seen with this style...but if you speed up all the animations (mainly the countdowns a lot, like 5 seconds less) I think it would flow better. I know animation is hard and a trial by error process sometimes.
That branch really muddied the water. I think the first test should've been redone with the optimized code.
that's the form for finding x axis intercepts. if just computing the y value for any particular a,b,c and x, then just plug it into y=axx+bx+c. no square roots and no division.
Glad round 2 took a Data Oriented approach!
just getting close to such a tiny we ll established formula is amazing. epic.
Can anyone explain these instructions which I found in a shellcode document for getting a bind shell?
function hashes (executable as nop-equivalent)
_emit 0x59 ; LoadLibraryA ; pop ecx
_emit 0x81 ; CreateProcessA ; or ecx, 0x203062d3
_emit 0xc9 ; ExitProcess
_emit 0xd3 ; WSAStartup
_emit 0x62 ; WSASocketA
_emit 0x30 ; bind
_emit 0x20 ; listen
_emit 0x41 ; accept ; inc ecx
At first I was angry with you for implementing it using a branch, the I was very happy that you did so because it made me feel good spotting the issue from the start :D
So thank you for making me feel clever ;D
I would try to calculate 1 / (2.0) * a[i]) and then do multiplications instead of the two fivisions.
Also I aliasing rules prevent further compiler optimizations, making the pointers restrict should help in this case, in particular in Quad2 as it does more memory access.
I am a bit curious about the aliasing rules here. If the compiler was reusing the value for the square root in the first case (at around 7:00), wouldn't this be against the aliasing rules, as if r1 would be the same as a, b or c, it would have changed the values for the square root for the second calculation. So the compiler may already assume that there is no aliased pointers
¡ Excelente trabajo !. You deserve the gold medal in computer olympics. For real numbers OQ performs very well.I'm going to try this on ARM V8 and compare with similar Intel. Cheers
I wish this was the kind of game show they'd show on TV. great video
fascinating. I haven't watched the other video yet to better understand this other method but from the 1-sentence description early on in your video, it doesn't sound like they're doing different things. The description of Po Shen Lo as finding the middle and then adding and subtracting the difference to find the roots... that's exactly what the quadratic formula does. -b/2a is the middle and then the difference is sqrt(b^2-4ac)/2a. You might be able to speed up the quadratic code by a tiny amount by calculating the -b/2a as well as the discriminant. Although in the scalar version, the compiler probably already has. I'm really curious to go see the more explanatory video soon, though. Maybe there's something I'm missing...
I can't see the code, the contrast is not good. The letters are small.
Suppose every program is just a number. I'm wondering how that number would be calculated. Is it left to right, or right to left? Then there is the stack and heap. Is there a stage before it enters memory where it is one complete number? Suppose we added 1 to it, where would it be? Is there a halt at the end? Does the increment have to take place before the halt?
I don't get how you can get rid of the if branch.... It says if a is not 1 then you adjust b and c. so for all values of a that are not 1 you need that adjustment. How can you just remove it? That's changing the formula. Am I missing something here?
The branch said to not do the division if it would divide by 1 " which is completely irrelevant cause dividing by 1 does not change the result but having a branch can seriously impact performance
at 4:30, friend, the code in blue is unreadable unless i use 1080p.
Is summary, we cannot decide which approach ("original" from maths on real or complex numbers with infinite precision, or "PSL"-optimized) is faster or slower as it largely depends on how you program your code, the programming language you use (notably how it rounds the results per its semantics), the compiler you use (and its compilation options notably about rounding of intermediate results, and how it may cache multiple reads from external structures/variables into local registers), the library you use, and the hardware you use: finally you can only decide after really making some local benchmark. NEVER forget the language semantics (effects of "implicit" roundings, and of "if" branches) when creating ANY implementation with that language.
I'd be interested and you using the fast square root algorithm for quake as a substitute for the square root for both of these
hard to see the blue code around 4 minutes in, dark text on dark background :x
"benchmarking as entertainment" is next to godliness
26:12 One of the divisions in Po-Shen Loh is x/2. It can be turned into a multiplication x*0.5. I'm kinda surprised if the compiler doesn't do that.
So PSL vs "old" should have 1 less sub, 3 less mul and all the others the same. I.e. 7 clocks less than the "old" in the first table, and 2.5 less in the second table.
Once you are doing GPGPU, your limit is becoming the speed of copying to and from the card over the PCI bus rather than the compute.
Dark blue on a grey background? Good going.
How does the tie rule work? Is it something like abs(resultsA[i] - resultsB[i]) < stddev(resultsA)? I'm not really a statistician, but I like learning these things
I think processor architecture plays a big role in this. The operation times depend on FPU implementation - for example Intel improved division by a factor of two and sqrt by 4 in the Penryn architecture if I remember correctly. More cache or memory bandwidth will target the key bottleneck here - that's probably the main gain from using a GPU. Branch prediction is powerful, but difficult and I think still evolving, so this affects different CPUs differently, as well as depending on the data fed in, and the order it's in. (The branch was only there to demonstrate bad code but I thought I'd include it.)
Nice vid, wish the code text in red and blue (especially blue) was easier to read though.
Why would the original Po-Shen Loh code check for a!=1, but no code actually checked for a!=0?
Where can I download the assembly version of the code? Many thanks!
Hey friend. Why didnt you upload for such a long time? You were a reallllllllly good introduction for me for assembly C++ and in general compsci. Please reply
Could you use something like the inverse root hack from quake 3? I'm sure there's other approximations of square roots that could also improve that calc.
Scratch that. Both methods have the same number of square root operations
man I love your videos haha
You can substitute one division in the po-shen lo formula by a multiplication if you instead of dividing b by a and then by 2 simply devide it by 2*a. On the other hand, multiplication and division by 2 shouldn't be a problem at all for a binary machine.
The font used in the animation always reminds me of the animations from surreal entertainment. Just waiting for "Sheldon is kill" to appear.
The 'blue' version @ 9:26, since 'bt' is only used for 'mid', why not eliminate it completely? Just 'float mid = - b[i] / (a[i] + a[i]). You eliminate a division operation as well as eliminate the storage 'bt' usage (and you don't need the constant '2' anymore). Just seems this code could be 'tweaked' a bit further and maybe beat the original quadratic formula.
Later at 26:20, you would then beat the original by having fewer multiplications and equal number of divisions.
At differences so low you more quickly run into what compiler can optimize better at given flags (O0 vs O1 vs O3, -ffast-math/-fno-fast-math)
For example: final table is unoptimized. Nobody counts 2a as (2*a). It's (a+a).
(Unless compiler would decide that 0.5 * X/a is faster than X/(a+a) -- it generally shouldn't, floats have no instruction for quickly halving a value)
Changing a float by a factor of 2 is a single addition/subtraction
√(b²-4ac)/(2a) == √((b/2a * b/2a) - c/a) == u. Therefore both formulas are the same.
If a is > 1 in the majority of your dataset, then the if block adds 2 division operations. Your red code can be optimized to make it even faster.
From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c. So we will need to use more different methods for computing the roots. Btw the "red an blue tower thing" was really anxiety-inducing 😂
"From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c"
Really? Like, in a way that po-shen loh wouldn't?
@@user-sl6gn1ss8p That's a good question, po's formula divides all the coefficients by a, and then does the calculations (but when abs(a) is small there might be some problems in general). @Creel would you share how you generated the test cases?
@@mattewlefty991 now that you mention it, comparing the results might be fun as well (if fairly slow)
why would it do that ? the roots is basically when the curve crosses the 0. we are talking about a parabola, it must cross the origin
@@monad_tcp Not all parabolas cross the x axis... But even then there are some operations (like subtraction, sum and division) that introduces errors because the computer works with a finite number of digits
First, the two algorithms really are not separate: They have exactly the same numerical qualities and effectively do the exact same operations, only in a slightly modified order!
On a modern CPU, sqrt() and FDIV are both far slower than fadd/fsub/fmul & fmadd, so I would start by calculating the a_reciprocal = (1/a) term: The FDIV in that operation will then be nearly or completely free because it can overlap with all the rest of the calculations. In the PSL version we immediately use the result of that division to scale b and c, which simplifies the logic but generates a latency barrier. If we instead calculate half_a_recip = (0.5/a) then we can multiply the final answers by that value.
The sqrt term is sqrt(b*b-4*a*c), here the b2=b*b term and the ac=a*c term can run at the same time, then we combine them in tsqrt=sqrt(fmadd(ac,4.0,-b2)).
By the time that has finished, the fdiv will be be all done, so we just have to do the final res0 = (-b + tsqrt) * half_a_recip and res1 = (-b-tqrt) * half_a_recip calculations which will also overlap.
Since sqrt is slightly slower than fdiv, we can in fact delay the half_a_recip calculation a smidgen, starting it just after the b2=b*b; and ac=a*c; operations, it will still finish before the sqrt. On an x64 cpu with at least two fmadd pipelines, but in scalar mode, these calculations should take ~40 clock cycles (I'm assuming 5-cycle fmadd and 20-cycle fsqrt).
I doubt the latency of the instructions matters at all in case you calculate millions of independent roots -- only instruction throughput and port-usage should matter (and in fact memory throughput).
Instruction latency would matter, if the output of the previous calculation influences the input of the next.
@@cH3rtzb3rg I mostly agree: When processing huge amounts of data, typically more than last level cache, almost any calculation will be totally dominated by memory bandwidth. OTOH, that is pretty much unimagninable for the case of solving quadratic roots. :-)
@@TerjeMathisen I also agree. The benchmark in the video is pretty much meaningless for most practical purposes. One could compare the latency of the algorithms by calculating new coefficients based on the previous roots (that way, there would also be no memory bottleneck).
What were the compiler settings?
In the end, both are good enough, most of the time goes to pure moving data around, and the square root itself. So it doesn't matter much. But it was a cool thing to know.
The way I learnt it, the expression (double==constant) is a bug anyway. (9:12)
Before going to assembly I would have replaced std::complex, which seems to me to be the culprit messing up memory access.
I find it both interesting and very funny how the GPU doing complex numbers is almost exactly twice the time it took for it to do scalars.
Which CPU and GPU was used? A speedup of only factor 2 to go to the GPU seems to be barely worth it imho.
specially considering the time it takes to get stuff into the gpu
That "if" breaks a lot of processor optimizations and it's not needed i think. if "quads[i].a" is 1.0f the equation inside the if is: b/=1.0f; c/=1.0f which didn't change the values. This would speed up the code a lot. With that it should have a fighting chance. The slowest part of both codes is the "sqrt" it is a magnitude slower than any other operation in that code. So using it as little as possible is always preferable. So this code:
void Po_Shen_Loh_Simple_AoS(QuadStruct* quads, int count){
for(int i =0; i< count; i++){
float mid = ( quads[i].b / quads[i].a ) * -0.5f
float u = sqrt(mid * mid - quads[i].c / quads[i].a );
quads[i].r1 = mid + u;
quads[i].r2 = mid - u;
}
}
should be faster if i haven't made a stupid logical error. 🙂
Just trying, but using branched code in cuda code severly slows things down. Maybe you could try branchless version of po-shen lo cuda?
Is one of those divisions by four? Can you optimise it with a bit shift? Or is that not how floats and std::complex numbers work?
The way floats are structured means that bit shifts don't work for division nor multiplication by powers of 2.
Floats are structured like this: SignBit PositiveAndBiasedExponent FractionalPart
SignBit is 0 for positive float, 1 for negative float
PositiveAndBiasedExponent is exponentYouWantToStore+bias = exponentYouWantToStore+127
FractionalPart is like the .345 in the number 12.345
SignBit is 1 bit long.
PositiveAndBiasedExponent is 8 bits long.
FractionalPart is 23 bits long.
Why these kinda random #s? It's standardized and together they use exactly 32 bits, which is a nice power of 2.
Example float in memory: 1 10001000 11000000000000000000000
SignBit = 1, so float is negative
10001000 = 136. However, to get the exponent we intended, subtract the bias to get 136-127=9
11000... = (1/2)+(1/4) = .75
So the example float in memory is -1.75*2^(9) = -1.75*512 = -896.0
1 10001000 11000000000000000000000 = -896.0
Why 1.75 instead of .75? The leading 1 is always present in binary scientific notation, and the value 1 is the same no matter which base you're in (except Base 0 and Base 1, which are extraordinarily difficult/annoying to deal with).
Now imagine shifting it to the right because you think dividing a float by 2 is the same as dividing an integer by 2.
What you're doing is moving the sign bit into the exponent, dividing the EXPONENT by 2, and then (what your intention originally was) dividing the FractionalPart by 2.
1 10001000 11000000000000000000000 shifted to the right by 1 is 0 11000100 01100000000000000000000
Putting this into www.h-schmidt.net/FloatConverter/IEEE754.html
shows that it's equal to 8.11656739243e+20
-896 is pretty different when compared to +8 with 20 digits after it.
HOWEVER, if you could find a way to decrease the exponent by 1 (i.e. subtract place value of 1 from the exponent) WITHOUT modifying any other part of the float, you would successfully divide by 2.
What a nice video!
In the original quadratic formulae you get into trouble if a, c or both are small compared with b. You should compute the roots as q=== -1/2(b+sqn(b)sqrt(b*b-4ac)). Roots are x1=q/a and x2=c/q.
Can’t read the code. Color blends in.
Is storing computation result into previously defined variable (thinking that we won't need its content anymore) faster than defining new (local to function) variable? And does this even matter?
For the compiler it should not matter at all, at least for simple types. Do whatever is easier to read.
@@cH3rtzb3rg its usually easier to store duplicated terms in a variable to make the code simpler, not for performance, the compiler can do that
@@monad_tcp I think Rogo was asking about code like:
int x1 = SOME_EXPRESSION;
// use x1
int x2 = SOME_OTHER_EXPRESSION;
// use x2
If one should reuse x1 instead of introducing x2 for that.
@@cH3rtzb3rg oh, in that case it really doesn't matter, variables are renamed either way, the compiler either do register allocation or store in the stack. After data-flow analysis, it will reuse stack variables if possible.
Unless your compiler is really, really dumb. Like Microchip ones.
Based on code complexity, I would think Po-Shen Lo might do better on an FPGA. It could probably be worked out that both take the same time, but even so, Po-Shen Lo looks like it would use less blocks on the FPGA.
code is deceptive. What is the actual optimized machine code for each fn?
4:23 already i see a problem. The +/- part should be computed once and reused, not just computed once for each root
So suppress the "if", and just use ONLY two intermediate variables "mid" and "u" declared as "long double", storing their complete expression; then you get "r1=mid+u" and "r2=mid-u"; you can even assign the intermediate valid of "mid" and "u" within the expression of "r1", and reuse "r1" into the expression of "r2" in a single statement, so you get "r2=(r1=(mid=sqrt(b*b-a*c*4)/a/2)+(u=...))-u*2;"
As well avoid converting "float" read from your "quad" structure multiple times to "double" or "log double", just read them once in a local temp variable of type "long double" (this can be done in BOTH "formulas" to drop that overhead). Note that if you use assembly, those temp variables would be FPU registers of type "long double" (but you need a good C/C++ compiler to do the same optimization and use "IEEE rounding modes" flags in the compiler or with macros in the soruce code to avoid all unnecessary intermediate conversions and roundings).
You can redo your test as well by changing your "quad" structure to declare "double" or "long double" fields (in both "formulas") So what you are really testing is how C/C++ compilers optimize the generated code. There's in fact NO difference between the formulas, except that if you incorrectly round intermediate results in your naive "Poh-Shen-Loh" implementation, you get DIFFERENT results between the two "formulas", where your "Poh-Shen-Loh" implementation is LESS PRECISE than your "original" implementation! You did not even compare the results between the two formulas to see that their results have differences (visible here on the 9th significant digit which can deviate by +/-2 !)
interesting video! Though I find it hard to believe that the whole gpu is only twice as fast as a single cpu core for such a massively parallel task. Is there some cuda optimization flag that you did not turn on? Also the gpu may not have had the time to increase its frequency to the max if the task was too short. Did you compute the effective bandwidth vs theoretical bandwidth as you did for the cpu test? I'm curious about that as well.
When you're doing stuff with GPGPU, if you try to simply time the time from when you "submit" the work to the GPU to when it finishes, you end up also measuring the overhead that is involved with deploying kernels and eventually having all of them hit the synchronization point at the end. There's a good chance that the amount of work that was submitted was not even to keep the GPU busy for long enough, and so the vast majority of the time is spent on that overhead. Depending on how things are done, you may end up also measuring the transfer of the inputs/outputs to/from the GPU, which would make the results look even more similar.
The "dumb" way to measure how much overhead there is, would be to give it a ridiculously small amount of work, like only one calculation, and see if it ends up being ~75ms again. The better way is that Nvidia supplies tools for measuring how long each part of the process takes. (Launching the kernel vs memory transfers vs the actual work).
Thrilling stuff 😃