Tell me you were surprised by using a vector to speed up instead of using a hashset but telling me i am beautiful (and liking the video) Slight correction. When I say left shift is multiplied by 10, I meant 10 in binary. Therefore it is two in decimal
it did surprise me. I was always confused how a hashset was O(1) just like a vector. I didn't realize it was O(1*c) which makes wayyyy more sense. Also the bits I didn't know so thank you :)
SipHash 1-3 (rust collections' default hashing algorithm) is fairly slow for integer keys, so this is kinda to be expected, I am quite surprised a vec is faster though, genius move.
It does make sense. My mind immediately jumped to the u32 "set", as I have seen it used before :D The threading was cool, but c'mon, only 64 threads? Should've used a 128 core ;)
I have started reading the Learning Rust with Entirely Too Many Linked Lists, and it has a link to a clip of mr Bjarne S. Talking about vectors vs lists, and why one should use vectors most of the time
that's true for algorithms but not for data. how your data is laid out in memory is just as important as the algorithm, and some languages limit the control you have over memory allocation, which makes them unsuitable for any program that deals with large amount of data, like games, audio systems etc.
Zig has clean simple syntax like Go with 90% of Rusts safety and are continually decreasing the margin between Rust and Zig's safety gap. It may even get to parity party soon on safety but with a much cleaner API and the entire C lib interop. I have a strong feeling Zig will take the cake within the next few years.
as a beginner developer, seeing you go from an array to bit manipulation was like watching black magic. incredible to see what we can do with these technologies and break them down to their atomic pieces.
Oh you should definetly look into bit and byte representation of several native datatypes. Even if you later code in something really high level aligning data correctly makes everything soo much faster. And some compilers will choose bitmanipulation over actual multiplication or division if you can keep something in a power of two for example.
@@MrHaggyy Could you not replace the mod 32(decimal) with And 011111(binary) , I can't make out from the disassembly( been a long time ) how the modulus operator is compiled. But being mod 32 the remainder of that will be the result of And 11111. Mod if compiled as a instruction using division/remainder takes longer clock cycles to execute than an And. At least that is how it seems to me.
For anyone who is interested into these kinds of algorithms. Go look into genome matching algorithms and also in-memory processing. Both categories heavily depend on these kinds of bit manipulation algorithms to optimize known problems like searching for a exact or fuzzy match.
This is really awesome because you don't just show the final kick ass solution, but also present a mental model on how to get from the naive solution to the fast one.
In the (relatively common) case where there's a 14-character window with more than one duplicate, going backwards within the window allows you to find the duplicate furthest along in the input. This means you can skip ahead further in those cases. It is a good example of a 'greedy' optimization looking for the best case (a late duplicate) first.
But what is the difference between finding the duplicate from back to front first or from front to back? You can skip ahead either way, right? Or am i missing something (which is likely because its 4AM)?
@@nimmidev9122 Consider the example "oabcaefghijlml...". Looking at the first window, 'oAbcAefghijLmL'. If we check from the front, once we encounter the second 'A', we will start the next window after the first occurrence of that character, skipping 1 window. But, if we check from the back, once we find the second (right-to-left) 'L', we will start the next window before the second occurrence of that character, skipping 10 windows. Since we are predictably accessing only 14 bytes at a time, prefetching and, by extension, reading backwards should not be an issue. On average, we will end up skipping more windows going backwards.
@@nimmidev9122 try the easiest case of "aaaaa aaaaa aaaa...", if you go from the front, you can't skip ahead, if you go from the back, you can skip ahead, so they are clearly different. But I don't know if there are any case where going from the front is beneficial
@@TheJocadasa Yee, makes much more sense now. Idk what i was thinking. Should probably stop watching such content in the middle of the night 😅 Thanks for the explanation.
But what if you have abcdefghijklman? The first duplicate is the second 'a' in the 14th position, but the first unique sequence is the b-n run. If you skip to the second 'a', you miss that entire sequence. Or if you want more than one duplicate, change the 'c' to a 'b' and add an 'o' at the end (abbdefghijklmano). Now the correct sequence is b-o. Either way, it feels like it would miss correct solutions.
Really like videos going over practical algerithm optimization. I feel in school there is a lot of focus on theory and very little on how to make things run fast in the real world. Love to see more of these types of videos!
I feel like we talk a lot about solving problems, and if it’s an acceptable result, that’s good enough. Most of the time, I am completely unaware that things even can be optimised this much. Blows mind.
More of this is needed! I never knew I needed to know this stuff. Best part of TH-cam is getting mentored by a senior dev without having to be in the same office!
@9:40 You don't actually have to collect up all the elements in the iterator for this! In Rust, if an iterator implements DoubleEndedIterator then it can iterate in reverse. std::slice::iter implements this trait, so calling any of the "r" iterator methods will just start counting from the end instead of from the start.
The assembly suggests that optimization is being applied by the compiler, since it is already iterating in reverse (over completely unrolled loops). I might be misunderstanding you: are you saying there are more things that don't need iterated (from your first sentence), and the second sentence about reverse iterator was a separate note? I don't use Rust, so "collect up all the elements in the iterator" might have a Rust technical/idiomatic meaning that I am missing.
And a note of caution: Although yes the performance increases significantly, the more optimised it becomes, the more specialised and hard to maintain it will also be. For the purposes of the Advent of Code challenge, it's perfectly suitable. However for daily development, the tradeoffs need to be considered really carefully. Finally, before going down the rabbit hole of extreme optimisation, don't forget to *profile your code first* ! You don't want to spend an inordinate amount of time hyper-optimizing a portion of your code that only affects 1% of total execution time. Find pieces of code that affects 10% or more of execution time and focus your efforts there first.
this mindset is why today we need 500MB to install a shitty app on a phone. back when resources were scarse, it was people job to come up with clever optimizations and techniques. nowadays people are just lazy and hide behind the "premature optimization" bullshit
for a task of the sort as presented in the video, there is NO REASON AT ALL to stick with the "simple" solution. the clever algorithm reminds of en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and is something any decent programmer is supposed to understand easily (possibly with introductory comments to explain some details). the best practice would be to code the optimized algorithm and unittest it against the simple reference solution for correctness.
This sort of mewling adds nothing. "Be sure to check whether the floor is dirty before mopping the floor or else it will be a waste of time" is not the brilliant insight you want it to be. What's really going on is that people like you are often coming from a corporate managerial sort of mindset where you want all your cogs to churn out the most uniformly basic code humanly possible with no intelligent or individualized thought given to things like engineering or efficiency, but saying that out loud in so many words doesn't sound very sexy so instead you go around offering your contempt for people who enjoy the prospect of writing better code as if it is some sort of counter-intuitively enlightened insight you have descended from the heavens to grace us with, but it is no such thing.
This reminds me of the FISR algorithm in Quake 3 that enabled it to run blazingly fast. That was a beautiful abomination peice of code. I wonder whether there are tools that calculate which of your functions take the most overall time.
I just had my imposter syndrome gone, and then I watched this video. Damn it! Great video, a lot of small details and bits of information. Being a nerd is a lot of fun. Thanks ;)
It's not mentioned, so one way to split the problem for multiple threads, is to divide the string into 64 pieces and ask each thread to solve one piece. You would also need to extend the boundaries of each piece by 14 (maybe 13?) characters on one side so that you can catch 14 size windows that are in multiple pieces. Report the min index from all threads as the final answer.
That would parallelize the list, but has 1 minor issue. We are looking for the first instance of a unique string, and one of the threads could find a unique string later in the sequence before the threads looking through the beginning of the sequence found the first unique string. Maybe, set up a queue for the threads. There are different ways to do so, and in this instance, you could set up two variables. queue_end = total_string_length - window length queue_value = 0 A thread would take a window with its starting position at queue_value and increment queue_value by one, so that the next thread would take the next window, and so on, until one of the threads found a unique sequence or queue_value reaches queue_end
@@LtdJorge And that all will be slower, because threads in programming language are abstractions, that involve allocation lots of memory and creating bunch of stuff, but in the fastest solution in the video there's direct use of simd instructions of the processor, which is on the whole another level
@@daedalus_00 Every thread will return the index where it found the first unique string. This index is the index in the original string and not the smaller piece. So we can take the minimum of the indexes returned by all the threads at the end as the final answer.
@@loam Multi threading works along with SIMD. The function that solves each part of the whole string is exactly the same as the one that solves the entire string with SIMD alone. If the input is large enough, the multi threading optimization can be taken a step further by using GPUs and distributed systems. Both optimizations have their pros and cons.
The stack allocated array and the vector array both have cache locality benefits due to them _both_ being contiguous. The difference is the vector is making a syscall (potentially) to get memory off of the heap. You could reuse the vector after an iteration and you could speed up that variant some. The stack allocated array, gets further cache locality, since you don't even need the first fetch of a cache line - its already in the cache. But both have the benefits of spatial and temporal locality.
I like the comments about the order of hashing. I think in university alot of the time they fail to mention the constants in the complexity conversation and you often end up with people always thinking hashmaps and sets being the fastest completely unaware of the cost of hashing and the common list iterator for collisions.
When I went to university, hash sets didn't even have linked lists inside of containers. We actually used linear and exponential back off to determine it's insertion location. It's a neat improvement to how much space you need for a set, and it also seems pretty fun to implement.
@@ThePrimeagen there's a cpp con lecture about googles flat hashmap implementation I watched years ago that was pretty interesting in terms of the decisions they made with writing a faster implementation. One of their optimizations was to use no pointers at all so once the hash gave an address you would just iterate across the key value pairs till you found your equality. On average you wouldn't need to iterate across many elements.
This is the type of videos I look for, where they scratch the surface good algorithms, give examples with actual code, and share similar ones. Amazing!
Nice explanation! I pretty much went for full speed with no regards for readability / maintainability with this AoC (using rust ofc lol). I managed to get my day 6 solution down to about 4 microseconds on one of the discord callenge bots and I had a rather similar approach to you. The performance on the bot was a bit flakey, but some guy had it down to 2us on the bot which was what I got locally on my ryzen 7 2700x
When I was first learning to code I had this one problem. A robot is in a maze, you do not now the ending point, make a program to solve the maze. I just implied the right hand rule, however I noticed that when searching for the right most wall it looked a lot like a 3 bit number. I ended up figuring out now to implement it, and was really proud!
I just started learning about hash tables, and immediately saw incredible performance boost. I would have never imagined that an array or a vector could do the trick
@@Winnetou17 that's the thing, if you know where the things are then the array will be faster since it will always be very close to a true O(1). The problem is keeping track of where everythign is, wich is where the hashtable comes into play. It's a question of using the right tool for the right job.
Love this content. Really cool and even though I haven't looked at some of the concepts for a while, i can tell you've got fundamental knowledge to bank on. Can't wait to see more!
I'd really love to see more videos like this about how crazy optimization can be. It was really intriguing to see how much overhead there is in what seems like the best approach, and how quickly a solution can run if you remove all of that.
At around 5:40 you use modulo to turn a letter (a..z) to a number between 0..31. Modulo is overkill here. There are only 26 letters, so a simple subtraction c - 'a' would give you a number between 0..26, with 'a' = 0, 'b' = 1, and so on.
As a new programmer, and an aspiring front end web dev and indie game dev, I don't understand a single fucking thing you're talking about, but you get me excited to learn. But uh...that really was blazingly fast. Maybe I'll change career paths. 👁️ 👄 👁️
Also worth noting that reverse iterators in rust don’t first go forwards, they rely on double-ended iterators, which for things like slices and vectors can start from the end
I'd love more of these code optimization videos, like okay maybe we're not using these techniques and optimizations everyday but so much fun to nerd out about them
The last version (the one that was going backwards), while technically still O(1), contains a substantial change in the algorithm that changes the underlying distribution of all possible runtimes (from an Analytical Combinatorics sense). If you would like to learn more about this, watch Robert Sedgewick's course on Coursera (AofA and AC)
This is absolutely fascinating. I’ve a few questions: 1) How does one goes about developing these fast solutions? In order words, how does one learn to develop the skill of algorithmic thinking? 2) There’s always a tradeoff between performance and readability. So how realistic is it to use the fastest solution in a real world code base?
1. well you have to determine if the speed up is meaningful something simple as "just because its O(1) doesn't mean its fast" is a great thing to remember. Experiment to find what makes a difference
For a real code base, you write your code in a readable way, then measure it. If it's not fast enough, find where it's slowest in the hot path and make that bit faster. Repeat until it's fast enough.
@@ThePrimeagen I think that a good take that big O notation is not about whether an algorithm is fast or slow. It is about if it scales. A O(n) solution that is slow but viable at small scales will continue to be slow and viable at large scales.
1) Never stop programming new things/projects. This will challenge your intuition and improve it. 2) It depends on the use case. When performance is critical, readability is sometimes sacrificed, though this can be mitigated by, for example, using comments explaining what's going on. However, it's often more encouraged to prioritze readability when performance isn't a concern, so as to make the code easier to scale
@@georgehelyar This, or see if you can abstract the problem to a more general problem where optimizations are already known. An easy example example: sorting characters can be abstracted to sorting numbers, and sorting algorithms for numbers are well known.
Your understanding of concepts always amazes me. I just got off learning JavaScript and now learning c++ and data structures for job placement but my true goal is to be like you. Have a deep knowledge of code and write blazingly fast solutions.
@@codesymphony it depends on who you work for. in my job having to process 50+ million records does require inner knowledge to deliver better experiences
@@earthling_parth not sure I agree. The fact that front end developers have a reputation for being less educated on CS fundamentals doesn’t excuse the utility of knowing them.
@@TehKarmalizer Games is a great example, programmers used to rely on fast and optimized code to the game even run, nowadays it seems that it as shifted and more and more optimized garbage shows up every day.
0:58 An ad perfectly cut you off at "The simp" and I was waiting excitedly for what you were going to say about simps and coding once the ad ended. My disappointment is immeasurable.
3:51 you get cache locality with vectors too, it's more about accessing the heap which takes longer than accessing on the stack since everything is known compile time as opposed to having to look up the page (might not or might not be in the TLB), deferencing, etc.
I think a step you missed in your explanation that would make the approach more clarifying is to mention explicitly that you want to use each bit to represent whether a certain character has been seen, and that you use the character's ordinal representation modulo 32 to determine the index of the bit this information is stored in. I think it's worth remarking that you can only do this because all the characters in your input are adjacent in their ordinals and there is less than 32 of them (or else you'd get position clashes).
Awesome video, thank you! I love it when you dive into the lower-level aspects of programming. Also a nice reminder that approximations can obfuscate significant differences.
Bit manipulation is how I solved the memory issue in my "because I can" project, that was a virtual filesystem. I had to store a map of all free and allocated sectors, and having a byte for each of them would result in 1GB of memory used for every 1TB of space... Quick fix and now it's roughly 130MB, accounting for all overhead for during mapping and splitting the sector maps, and other memory-saving things I did... Moral of the story is that bit manipulation is such a nifty trick to get around memory access time issues, especially when there is a lot ot memory calls needed to perform something. Not only do you cut on your (quite expensive) memory calls, but also move a large chunk of your task into the cache.
The going backwards to skip large number of characters was likely inspired by the Boyer-Moore string-search algorithm, which is also what powers "grep". It's actually faster to grep for long strings than it is to grep for short strings because of this optimization. Longer search strings let you skip faster.
small correction at 5:41 c%32 will result in a value between 1 and 26 since (a)97%32 = 1 and (z)122%32= 26 so total of 26 possibilities (letters of the alphabet) another form of optimization would have been creating a custom binary type of 26 bits and doing the same thing (correct if i’m wrong) Also don’t know if it’s possible but maybe mapping each character to a 5 bit type (32 possibilities closest power of 2 to 26) and then doing the window comparasion technique might also add some speed not 100% sure.
The array solution is the one I would have come up with, but I think there may be a little room for improvement there, depending on what the compiler is already doing with that code.
I love this style of rust. It looks pretty confusing at first sight but it's much better than for loops with random looking operations. The algorithms kick ass. And the browsing through at low level was pretty neat as well. It's pretty wild to consider how a human would just look at the 15th value and say "I've seen that already" instead of comparing all of them individually. Despite being pretty slow at some things, human brain just keeps amazing with all its optimizations. The Perez algorithm was really cool, I only recently saw it explained somewhere but I didn't understand it originally nor did I remember it. What surprises me is that it isn't part of standard algorithms.
This video shows perfectly why I love writing high performant code. You get to be creative: Use your existing knowledge, e.g. the Boyer Moore algorithm for string matching, to come up with solutions for new problems. Know the internals of the underlying system to select the right data structures. All theory aside, you have to actually test and experiment. Measure something odd? Then fire up that compiler and check the assembly. It just never gets boring!
You are presenting everything with such an ease, and it is highly understandable to me, self-taught (no university degree) DevOps engineer. If you were my teacher in high school I would be probably working for NASA or Roskosmos now... Make a course related to Rust development, Rust for beginners or similar...
I watched this all unfold on twitch but love having the video here to review. As someone that didn't get a formal education in CS, this low level stuff is so helpful to be able to return to.
I had the binary method in mind, but you explained it beautifully, and benny implemented it much better than what i would do. Though, i'm just a junior with 0 work experience, but with 6 years of programming experience
reminds me of the massive perf increase on an algorithm by standupmaths to find 5x5 words with no character repeats. went from days down to milliseconds due to work by his community before watching this vid past problem description, my first thought is to use char -> ints 1-26 and bitshift/xor into an int and popcnt ===whatever the length needed is... or & / | s to check and insert and abandon early if before == before&char
He is a math teacher who used Python because he doesn’t understand the first thing about optimization. I spent 1 day writing C++ code and it took 20 mins to find all solutions. Spent the next morning and it took 2 minutes. The afternoon and it took 20 seconds.
In Julia, the naïve solution (using Sets and moving window just +1) runs in 7-8ms and the fast solution in 7-8µs on chars, or ~3µs if the input is already bit states. (Didn't bother doing a threaded version.)
I know I’ve seen you do this many times, but I think it would be SUPER cool to see you do the setup and generation of stats/graphs for perf. The more I perf test, the more I see people are often optimizing for the wrong things
Checking if something is in a vector/array is btw not constant time, but O(n). It's just constant here, because n the size of the vector is constant. It's often faster than hash sets, if your n is in the low tens, sometimes even low hundreds. That's less true for interpreted languages without raw/non object number types though. I've seen cases in python where hash sets of 5 elements were faster than lists for such checks... I guess less interpreter work almost always wins the performance race (especially with GIL)
It used to happen a lot earlier in 80s and 90s. CPU was a scarce commodity. You wanted to squeeze as much performance out as possible. Zen of Assembly programming by Michael Abrash and How to solve it by Computer by Dromey were some amazing books I read growing up.
@@KaranChecker technically compute time was not a commodity back then because software was run on hardware owned by the company and thus is not sold on a market, for compute time to be a commodity it has to be sold on the market, eg AWS, Linode etc. But i dont think those kind of things where common back then.
That was really fascinating! Also, what language(real programming language) would you recommend to learn Data structures and algorithms in? I primarily use JS/TS for side projects and stuff.
js is fine for algorithms. You might want to look at C for stuff using pointer operations, but Javascript can pretty much do the same thing by wrapping things in arrays or objects
@@ElektrykFlaaj @shadamethyst1258 Thanks for the insight! Will stick with Js until I find a good reason to switch. Also, I was concerned because people complain about JS lacking some fundamental DS like heap, trees, etc which i thought could be a deal breaker during the interviews?
If you want to deal with manual memory management, then Rust would probably be the way to go. As far as just learning how data structures work, you're probably best just googling the structures you're interested in. Most languages can represent them, but obviously a systems programming language can do so more efficiently. That being said, a lot of the time JS is just fine speedwise, it's actually a pretty fast language most of the time.
If written slightly better, the innermost loop can have only one simple branch check, which is whether a duplicate is found. And it is a cold check. No need for bounds checking in the inner loop.
Paused at ~ 5:05 as I heard the next and last one will be bitwise, to ask this After the array optimization, but before the bitwise optimization, what causes the fact that this following optimization doesn't score between those? (Which we know because you didn't do that in the vec or array optimization, and didn't even list it - and we trust that your rankings are exhaustive and only excluded anything worse than naive algorithm #0 aka "simple"): Instead of using a 14-length array and checking contains (or your array equivalent), you could have a 29-length 0-initialized array, and a hashfunction that takes a character and gives a corresponding index. Should be something like `ch - 65` (assuming the input string is uppercase), because utf-8 is kind towards the alphabet. Except you do need to remember to handle the last 3 characters differently, as utf-8 is *not* kind towards those. Then, as should be obvious, you check if the index is non-zero. If it is, there's a duplicate. If not, you make it non-zero. Hmm, now that I think about it, this same algorithm should be able to also use bitwise manipulation. Not sure if it would be faster, if it is I doubt it is a large improvement, but should nevertheless be doable. Main issue (with getting it more performant) should lie in the fact that you need to operate over 29 bits, aka 4 bytes, aka 32 bits (as you generally don't get to be more choosy than 8 bits at a time). I guess with a language that allows the bitwise operators to be performed on `int`s and the like, and not only `byte`s could achieve some performance... But that is far from all, so says more about the languages performance than the algorithms. Edit: lol, at 5:20, and you just started talking about the utf-8 stuff I mentioned (albeit lowercase, and forgot exceptions the last 3 letters of the alphabet) :p ..And for some reason seem to assume we have access to 32+ bit types with bitwise operators? edit2: hmm, using modulo instead of subtraction... Unsure why, and wouldn't module be slower? Surely some good reason will pop up soon. Maybe that is how you will deal with the exceptional letters? edit3: Huh, that Snake-esque optimization was really clever! Is the window operation even required then? If done this way it should be enough to keep track of 2 indices i_0 & i_1 where i_1 = i_0+13, no? I have trouble imagining that generating all sub-arrays with the windows operation beats incrementing 2 counters each iteration. Maybe it's something language specific? Like, how I am fairly sure windows in Rust is lazy. The "count_ones" operation is very iffy to me though. I have no idea how that is done as the function is a black box, and my naive guesses are that it literally iterates over every bit and foldLs them together. I wonder... Well, either way it should probably, maybe, hopefully, remain faster thanks to the Snake-optimization, even if it needs to iterate 32 bits every iteration. Also surprised you opted to not relate it to the well-known Snake-optimization during your explanation, for clarity and brevity. edit4: Oh, david's returns more to the initial idea (checking each insertion + not playing Snake), but instead noticed a huge optimization in the fact that when you found a duplicate, you should be able to skip to the index where your window loses the first occurrence of the letter (cheap because of the reverse() call, and opting to not play Snake). Surely that could have been adapted to the other algorithms too? How come they scored worse than the naive algorithm #0 in your rankings, and was thus excluded? Starting to believe your rankings weren't exhaustive, as if you intentionally opted to not rank an infinite number of candidates because you decided it's unrealistic... edit5: Threading... That's honestly kind of a boring reveal, even though I have to admit it is exactly what I expected when I saw the ranking named it "async" :/ If that was all you did, then the comparisons should have had the other algorithms (including "simple") threaded to begin with, as it didn't appear like the ability to thread it seemed in any way limited to the last one. Also, curious, did you simply thread it on your cpu, or did you translate the algorithm into compute shaders and use the GPU? If not the latter (with the choice of 64 thread it sounded like the former), where does that last algorithm rank when you do? edit end: Huh, you never explained why module was preferred over simply subtracting. And never did mention the exceptional 3 last letters of the alphabet (and I really doubt module magically was lucky and took care of it in secret)). Feels like those last algorithms will only work so long as your input constrains itself to the first 26 characters, and any occurrences of 27th, 28th or 29th will cause bugs. edit beyond end [because I started wondering why they didn't just modulo by 26 anyway, since they seemed to already expect the input to only have 26 first characters + only lowercase - so I went to verify something on the utf-8 tables]: Hmm, think I figured out why module 32. An interesting facet about UTF-8 which I was unawares of is that: after uppercase ended, the number of random characters before lowercase starts, is exactly enough that A to Grave accent ('A' .. 'a'-1) are 32 characters. I figured 32 was taken because of the choice of an u32, but seems it was choosen after a look at the UTF-table, and checking `len = 'a'-1 - 'A'`. You should probably have mentioned that (and in fact, from how you explained things, you appear to have had the same misunderstanding as me). Though all that said, this only removes the "only lowercase" requirement. The last 3 letters are still buggy, and in fact the last letter gives this `'Ö'%32=214%32 = 22`
Fun fact about modulo. When it is modulo by 32, it is not anything other than a bit mask operation So it's actually quite optimized ------- Brief explanation. Think about moduloing by a power of two, all it really is is a simple bit mask :)
@@ThePrimeagen Yeah, makes sense. Another advantage to mod 32 instead of 26, even if UTF-8 hadn't (probably intentionally) had such useful "a and A are both multiple of 32 + alphabet mostly ordered sequentially" properties (ps: iirc 'A' is also a multiple of 32, so the "array"/u32 is actually indexed in alphabetical order: a is at 'index' 0). Wasn't aware the ALU actually did such optimizations tbh. Or wait, is this another one of those cases where I have to remember it's Rust and thus has compilation optimization? The topvoted comment and that 'chapter' about how the gains are the same in JS, makes me constantly try to think in a language-independent context. Since you didn't answer, and I had a wall of text so you likely missed them, what about the other questions?: - Did you try the async 'algorithm' as compute_shaders on GPU? To what results? - Did you recognise the Snake(game)-optimization? - Is u32.count_zeros doing any nifty stuff beyond a foldL or equivalent? - Were you similarly 'fooled' by the mod 32 to think it was choosen by the "array"s/u32s "length", when it was in reality to fix the lowercase/uppercase issue through mentioned quirk of UTF8 being intelligently designed with this kind of stuff taken into account? - And finally, the most important question: what about the letters after Z? Any ideas on how to fix those algorithms so they aren't bugged? My best idea atm is a replace of the input, turning them into the characters that mods to the letters location, and then run the algorithm. How much is the performance hurt when compared to the algorithms that aren't dependent on intelligent design of character-encodings?
I use this bit manipulation to store a series of flags as a DWORD to keep track of e.g. user configurations in a database. Much more efficient than storing a zillion of true/false fields. Never thought about using it to compare hashes. Good one 😁
Pedantic alert: the “x times faster” needs to be +1 that of the percentage, after allowing for the 100x factor to covert one to the other. I’ll let the reader work out why.
Thanks for the awesome content! Minor quibble: left shift always increases by the base, in binary that’s 2, not 10, it just looks like you multiply by 10 because the only digits in the numbers are 1 and 0
I just saw it for a second in the twitter post, but using c - 'a' for bit-indexing will be much faster than c % 32. And if there's a possibility that the total set in general contains less than 14 distinct characters you could traverse it quickly once and then shorten the window accordingly.
I went through this video not spoiling myself the rest and I guessed that it was going to go the SIMD way + the index skipping to the duplicate at some point. I don't see how SIMD operations should result from the reverse iteration though. I think that the reverse iteration just lowers the number of comparisons after we skip idx after finding a duplicate, so iterating in reverse has nothing to do with the SIMD implementation per se, but works in synergy with it.
what if in 2nd last code instead of counting the ones again and again we remove that check and increment or decrement just by checking the bit at which doing XOR will change the value
I think my biggest optimization achievement was rewriting a simulation code we had in our lab so that it exploits parallelism and GPU. Later on, I improved it further so that it uses the CPU/GPU resources more intelligently. Initially, it took about 2 days for the code to finish a simulation. After 2 weeks of throwing ideas around and scratching my head, I was able to make it work in around 30 seconds.
If you went from 80 bit FP to 24 bit GPU FP, then you aren't actually running the same simulation. Not even close. So, yeah, you might be getting a wrong result in 30 seconds. At the very least you are getting a different wrong result. ;-)
I'm not sure, but working with a xor signature accu instead of working with the whole 14 chars window on each iteration seems could be even MUCH faster... something like this: let xor_table = [...]; // TODO static based on sequences on quiz, index are xor signatures, value the sequence #i pub fn cutrez(input: &[u8]) -> Option { let mut xor = 0x0; // 14 chars window xor accumulator/signature for &e in input[0..13] { // preload xor signature xor ^= e; } let len = input.len(); let mut i = 14; while i < len { xor ^= input[i]; // 14 chars window xor signature let sequencei = xor_table[xor]; if (sequencei != 0) { // sequence possible match let mut j = 0; let mut found = true; while j < 14 { // hard check if (sequences[sequencei][j] != input[i - 14 + j]) { found = false; } } if (found) return i - 14; } xor ^= input[i - 14]; // clean oldest char from xor signature } return None; // not found } And I think that this could be optimized even a lot more by threads and by working over input with dwords or qwords instead of byte by byte and taking account of that by loading all offsetted signature variants on xor_table.
i thought using bitwise AND could be faster than using mod (n & 31 does the same thing as n % 32), since mod is a relatively slow operator. does the compiler optimize this?
These kind of optimizations replacing some instructions with another ones are probably the simplest to implement, so I'd be really suprised if the compiler didn't do it. This particular technique is called peephole optimization (was proposed in a paper from 1965) and is applied in many similar cases. Contemporary compilers can do much more that so you should avoid these kind of tricks that often hurt code readability and let the compiler do its job. I recommend checking out talks by Matt Godbolt "What Has My Compiler Done for Me Lately".
Tell me you were surprised by using a vector to speed up instead of using a hashset but telling me i am beautiful (and liking the video)
Slight correction.
When I say left shift is multiplied by 10, I meant 10 in binary. Therefore it is two in decimal
it did surprise me. I was always confused how a hashset was O(1) just like a vector. I didn't realize it was O(1*c) which makes wayyyy more sense. Also the bits I didn't know so thank you :)
SipHash 1-3 (rust collections' default hashing algorithm) is fairly slow for integer keys, so this is kinda to be expected, I am quite surprised a vec is faster though, genius move.
It does make sense. My mind immediately jumped to the u32 "set", as I have seen it used before :D The threading was cool, but c'mon, only 64 threads? Should've used a 128 core ;)
I have started reading the Learning Rust with Entirely Too Many Linked Lists, and it has a link to a clip of mr Bjarne S. Talking about vectors vs lists, and why one should use vectors most of the time
You're beautiful
Exactly. It's not about Rust vs C vs Cpp vs Zig.
It's about how much you understand system programming.
yayaya
Rust vs Cpp vs C vs Zig is still an important question, but the differences are stuff like memory safety and maintainability
it's not understand of system programming. it's general understanding how a hardware works irl
that's true for algorithms but not for data. how your data is laid out in memory is just as important as the algorithm, and some languages limit the control you have over memory allocation, which makes them unsuitable for any program that deals with large amount of data, like games, audio systems etc.
Zig has clean simple syntax like Go with 90% of Rusts safety and are continually decreasing the margin between Rust and Zig's safety gap. It may even get to parity party soon on safety but with a much cleaner API and the entire C lib interop. I have a strong feeling Zig will take the cake within the next few years.
as a beginner developer, seeing you go from an array to bit manipulation was like watching black magic. incredible to see what we can do with these technologies and break them down to their atomic pieces.
Oh you should definetly look into bit and byte representation of several native datatypes. Even if you later code in something really high level aligning data correctly makes everything soo much faster. And some compilers will choose bitmanipulation over actual multiplication or division if you can keep something in a power of two for example.
@@MrHaggyy
Could you not replace the mod 32(decimal) with And 011111(binary) , I can't make out from the disassembly( been a long time ) how the modulus operator is compiled. But being mod 32 the remainder of that will be the result of And 11111. Mod if compiled as a instruction using division/remainder takes longer clock cycles to execute than an And.
At least that is how it seems to me.
But manipulation is a lot simpler than it seems. The hard part is determining where to use it. Definitely test for performance as you use it.
It can be hard and tricky to implement, but if you can learn how to quickly go up and down in levels of abstraction, it can make all the difference.
For anyone who is interested into these kinds of algorithms. Go look into genome matching algorithms and also in-memory processing. Both categories heavily depend on these kinds of bit manipulation algorithms to optimize known problems like searching for a exact or fuzzy match.
This is really awesome because you don't just show the final kick ass solution, but also present a mental model on how to get from the naive solution to the fast one.
Yayaya
@@ThePrimeagen That's why you're the best waifu
This is actually to increase watch depth of the video. You know the next one will be better so you keep watching.
In the (relatively common) case where there's a 14-character window with more than one duplicate, going backwards within the window allows you to find the duplicate furthest along in the input. This means you can skip ahead further in those cases. It is a good example of a 'greedy' optimization looking for the best case (a late duplicate) first.
But what is the difference between finding the duplicate from back to front first or from front to back? You can skip ahead either way, right? Or am i missing something (which is likely because its 4AM)?
@@nimmidev9122 Consider the example "oabcaefghijlml...".
Looking at the first window, 'oAbcAefghijLmL'. If we check from the front, once we encounter the second 'A', we will start the next window after the first occurrence of that character, skipping 1 window. But, if we check from the back, once we find the second (right-to-left) 'L', we will start the next window before the second occurrence of that character, skipping 10 windows.
Since we are predictably accessing only 14 bytes at a time, prefetching and, by extension, reading backwards should not be an issue. On average, we will end up skipping more windows going backwards.
@@nimmidev9122 try the easiest case of "aaaaa aaaaa aaaa...", if you go from the front, you can't skip ahead, if you go from the back, you can skip ahead, so they are clearly different. But I don't know if there are any case where going from the front is beneficial
@@TheJocadasa Yee, makes much more sense now. Idk what i was thinking. Should probably stop watching such content in the middle of the night 😅 Thanks for the explanation.
But what if you have abcdefghijklman? The first duplicate is the second 'a' in the 14th position, but the first unique sequence is the b-n run. If you skip to the second 'a', you miss that entire sequence. Or if you want more than one duplicate, change the 'c' to a 'b' and add an 'o' at the end (abbdefghijklmano). Now the correct sequence is b-o. Either way, it feels like it would miss correct solutions.
Really like videos going over practical algerithm optimization. I feel in school there is a lot of focus on theory and very little on how to make things run fast in the real world. Love to see more of these types of videos!
fast? you mean blazingly fast?
so true...cs teachers are setting us up to fail oh well thank god for chtGPT
Theory is important
I feel like we talk a lot about solving problems, and if it’s an acceptable result, that’s good enough. Most of the time, I am completely unaware that things even can be optimised this much. Blows mind.
How can u do optimization without knowing some basic theories?
More of this is needed! I never knew I needed to know this stuff. Best part of TH-cam is getting mentored by a senior dev without having to be in the same office!
can you recommend some other senior devs on youtube who teach good concepts...
Awesome way of thinking!!
@9:40 You don't actually have to collect up all the elements in the iterator for this! In Rust, if an iterator implements DoubleEndedIterator then it can iterate in reverse. std::slice::iter implements this trait, so calling any of the "r" iterator methods will just start counting from the end instead of from the start.
The assembly suggests that optimization is being applied by the compiler, since it is already iterating in reverse (over completely unrolled loops). I might be misunderstanding you: are you saying there are more things that don't need iterated (from your first sentence), and the second sentence about reverse iterator was a separate note? I don't use Rust, so "collect up all the elements in the iterator" might have a Rust technical/idiomatic meaning that I am missing.
And a note of caution: Although yes the performance increases significantly, the more optimised it becomes, the more specialised and hard to maintain it will also be.
For the purposes of the Advent of Code challenge, it's perfectly suitable.
However for daily development, the tradeoffs need to be considered really carefully.
Finally, before going down the rabbit hole of extreme optimisation, don't forget to *profile your code first* ! You don't want to spend an inordinate amount of time hyper-optimizing a portion of your code that only affects 1% of total execution time. Find pieces of code that affects 10% or more of execution time and focus your efforts there first.
You hit the nail on the head
this mindset is why today we need 500MB to install a shitty app on a phone. back when resources were scarse, it was people job to come up with clever optimizations and techniques. nowadays people are just lazy and hide behind the "premature optimization" bullshit
for a task of the sort as presented in the video, there is NO REASON AT ALL to stick with the "simple" solution. the clever algorithm reminds of en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and is something any decent programmer is supposed to understand easily (possibly with introductory comments to explain some details). the best practice would be to code the optimized algorithm and unittest it against the simple reference solution for correctness.
This sort of mewling adds nothing. "Be sure to check whether the floor is dirty before mopping the floor or else it will be a waste of time" is not the brilliant insight you want it to be. What's really going on is that people like you are often coming from a corporate managerial sort of mindset where you want all your cogs to churn out the most uniformly basic code humanly possible with no intelligent or individualized thought given to things like engineering or efficiency, but saying that out loud in so many words doesn't sound very sexy so instead you go around offering your contempt for people who enjoy the prospect of writing better code as if it is some sort of counter-intuitively enlightened insight you have descended from the heavens to grace us with, but it is no such thing.
This reminds me of the FISR algorithm in Quake 3 that enabled it to run blazingly fast. That was a beautiful abomination peice of code.
I wonder whether there are tools that calculate which of your functions take the most overall time.
Amazing, I really love this type of content and inspire me to always thinking about performance
my man! i forgot to say thanks for the comment.
I like making these videos because its fun!
@@ThePrimeagen Anytime!! I'm waiting for more videos like that!!!
I just had my imposter syndrome gone, and then I watched this video. Damn it!
Great video, a lot of small details and bits of information. Being a nerd is a lot of fun. Thanks ;)
It's not mentioned, so one way to split the problem for multiple threads, is to divide the string into 64 pieces and ask each thread to solve one piece. You would also need to extend the boundaries of each piece by 14 (maybe 13?) characters on one side so that you can catch 14 size windows that are in multiple pieces. Report the min index from all threads as the final answer.
That would parallelize the list, but has 1 minor issue. We are looking for the first instance of a unique string, and one of the threads could find a unique string later in the sequence before the threads looking through the beginning of the sequence found the first unique string. Maybe, set up a queue for the threads. There are different ways to do so, and in this instance, you could set up two variables.
queue_end = total_string_length - window length
queue_value = 0
A thread would take a window with its starting position at queue_value and increment queue_value by one, so that the next thread would take the next window, and so on, until one of the threads found a unique sequence or queue_value reaches queue_end
@@daedalus_00just do one window per thread. Mark the order of the threads and return the lowest occurrence that no windows to complete before it.
@@LtdJorge And that all will be slower, because threads in programming language are abstractions, that involve allocation lots of memory and creating bunch of stuff, but in the fastest solution in the video there's direct use of simd instructions of the processor, which is on the whole another level
@@daedalus_00 Every thread will return the index where it found the first unique string. This index is the index in the original string and not the smaller piece. So we can take the minimum of the indexes returned by all the threads at the end as the final answer.
@@loam Multi threading works along with SIMD. The function that solves each part of the whole string is exactly the same as the one that solves the entire string with SIMD alone.
If the input is large enough, the multi threading optimization can be taken a step further by using GPUs and distributed systems. Both optimizations have their pros and cons.
good thing it only did it for 11.9 milli seconds because you'd have a pretty hard time keeping your CPU fed at 617GB/s
Benny's solution with bin manip was great. Slicing and jumping forward, rather than step by step is brilliant
The stack allocated array and the vector array both have cache locality benefits due to them _both_ being contiguous. The difference is the vector is making a syscall (potentially) to get memory off of the heap. You could reuse the vector after an iteration and you could speed up that variant some. The stack allocated array, gets further cache locality, since you don't even need the first fetch of a cache line - its already in the cache.
But both have the benefits of spatial and temporal locality.
6:33 That is NOT the Logical AND Operator (which is &&) but actually the Bitwise AND Operator (&).
Great video! I learnt a few things here!
I like the comments about the order of hashing. I think in university alot of the time they fail to mention the constants in the complexity conversation and you often end up with people always thinking hashmaps and sets being the fastest completely unaware of the cost of hashing and the common list iterator for collisions.
When I went to university, hash sets didn't even have linked lists inside of containers. We actually used linear and exponential back off to determine it's insertion location.
It's a neat improvement to how much space you need for a set, and it also seems pretty fun to implement.
@@ThePrimeagen there's a cpp con lecture about googles flat hashmap implementation I watched years ago that was pretty interesting in terms of the decisions they made with writing a faster implementation. One of their optimizations was to use no pointers at all so once the hash gave an address you would just iterate across the key value pairs till you found your equality. On average you wouldn't need to iterate across many elements.
Implementation details matter. It’s important to have a rough idea how data structures work to understand the performance implications.
This is the type of videos I look for, where they scratch the surface good algorithms, give examples with actual code, and share similar ones. Amazing!
“Binary ten thousand” is a phrase that makes be very uncomfortable. Lol
😂
Because it looks like it's already in binary yes?
More videos like this please! Love the deep dive into blazingly fast computer science topics
Nice explanation! I pretty much went for full speed with no regards for readability / maintainability with this AoC (using rust ofc lol). I managed to get my day 6 solution down to about 4 microseconds on one of the discord callenge bots and I had a rather similar approach to you. The performance on the bot was a bit flakey, but some guy had it down to 2us on the bot which was what I got locally on my ryzen 7 2700x
When I was first learning to code I had this one problem. A robot is in a maze, you do not now the ending point, make a program to solve the maze.
I just implied the right hand rule, however I noticed that when searching for the right most wall it looked a lot like a 3 bit number.
I ended up figuring out now to implement it, and was really proud!
I just started learning about hash tables, and immediately saw incredible performance boost. I would have never imagined that an array or a vector could do the trick
If you can use arrays, they are always the fastest
@@Winnetou17 that's the thing, if you know where the things are then the array will be faster since it will always be very close to a true O(1). The problem is keeping track of where everythign is, wich is where the hashtable comes into play. It's a question of using the right tool for the right job.
Love this content. Really cool and even though I haven't looked at some of the concepts for a while, i can tell you've got fundamental knowledge to bank on. Can't wait to see more!
I'd really love to see more videos like this about how crazy optimization can be. It was really intriguing to see how much overhead there is in what seems like the best approach, and how quickly a solution can run if you remove all of that.
At around 5:40 you use modulo to turn a letter (a..z) to a number between 0..31. Modulo is overkill here. There are only 26 letters, so a simple subtraction c - 'a' would give you a number between 0..26, with 'a' = 0, 'b' = 1, and so on.
As a new programmer, and an aspiring front end web dev and indie game dev, I don't understand a single fucking thing you're talking about, but you get me excited to learn.
But uh...that really was blazingly fast. Maybe I'll change career paths.
👁️ 👄 👁️
Also worth noting that reverse iterators in rust don’t first go forwards, they rely on double-ended iterators, which for things like slices and vectors can start from the end
That's only sometimes true. You just have to make sure you have a data structure with a known size.
I'd love more of these code optimization videos, like okay maybe we're not using these techniques and optimizations everyday but so much fun to nerd out about them
The last version (the one that was going backwards), while technically still O(1), contains a substantial change in the algorithm that changes the underlying distribution of all possible runtimes (from an Analytical Combinatorics sense). If you would like to learn more about this, watch Robert Sedgewick's course on Coursera (AofA and AC)
This is absolutely fascinating. I’ve a few questions:
1) How does one goes about developing these fast solutions? In order words, how does one learn to develop the skill of algorithmic thinking?
2) There’s always a tradeoff between performance and readability. So how realistic is it to use the fastest solution in a real world code base?
1. well you have to determine if the speed up is meaningful
something simple as "just because its O(1) doesn't mean its fast" is a great thing to remember. Experiment to find what makes a difference
For a real code base, you write your code in a readable way, then measure it. If it's not fast enough, find where it's slowest in the hot path and make that bit faster. Repeat until it's fast enough.
@@ThePrimeagen
I think that a good take that big O notation is not about whether an algorithm is fast or slow. It is about if it scales.
A O(n) solution that is slow but viable at small scales will continue to be slow and viable at large scales.
1) Never stop programming new things/projects. This will challenge your intuition and improve it.
2) It depends on the use case. When performance is critical, readability is sometimes sacrificed, though this can be mitigated by, for example, using comments explaining what's going on. However, it's often more encouraged to prioritze readability when performance isn't a concern, so as to make the code easier to scale
@@georgehelyar This, or see if you can abstract the problem to a more general problem where optimizations are already known. An easy example example: sorting characters can be abstracted to sorting numbers, and sorting algorithms for numbers are well known.
Long time fan and this is my favorite video I’ve seen from you, more of this please!!
Your understanding of concepts always amazes me. I just got off learning JavaScript and now learning c++ and data structures for job placement but my true goal is to be like you. Have a deep knowledge of code and write blazingly fast solutions.
blazingly fast is fun, but it's really all not that important in the workforce
@@codesymphony it depends on who you work for. in my job having to process 50+ million records does require inner knowledge to deliver better experiences
@@epicgameryt4052 Correction to symphony's comment: *not that important in the majority of the workforce.
@@earthling_parth not sure I agree. The fact that front end developers have a reputation for being less educated on CS fundamentals doesn’t excuse the utility of knowing them.
@@TehKarmalizer Games is a great example, programmers used to rely on fast and optimized code to the game even run, nowadays it seems that it as shifted and more and more optimized garbage shows up every day.
0:58 An ad perfectly cut you off at "The simp" and I was waiting excitedly for what you were going to say about simps and coding once the ad ended. My disappointment is immeasurable.
This deep dive into optimization was fantastic! Keep up the great content Prime.
Thank you . Great content. I’m not very proficient with material related to any STEM type of work but I was very interested in this video
Another upload. You love to see it.
yaya!
3:51 you get cache locality with vectors too, it's more about accessing the heap which takes longer than accessing on the stack since everything is known compile time as opposed to having to look up the page (might not or might not be in the TLB), deferencing, etc.
I think a step you missed in your explanation that would make the approach more clarifying is to mention explicitly that you want to use each bit to represent whether a certain character has been seen, and that you use the character's ordinal representation modulo 32 to determine the index of the bit this information is stored in. I think it's worth remarking that you can only do this because all the characters in your input are adjacent in their ordinals and there is less than 32 of them (or else you'd get position clashes).
Sameeee.
Still don't understand the those "you can skip stuff if there are duplicates" why is this? Because our string doesn't have them? Or??
Subscribed, Thanks for the quality content
Awesome video, thank you! I love it when you dive into the lower-level aspects of programming. Also a nice reminder that approximations can obfuscate significant differences.
Bit manipulation is how I solved the memory issue in my "because I can" project, that was a virtual filesystem.
I had to store a map of all free and allocated sectors, and having a byte for each of them would result in 1GB of memory used for every 1TB of space... Quick fix and now it's roughly 130MB, accounting for all overhead for during mapping and splitting the sector maps, and other memory-saving things I did...
Moral of the story is that bit manipulation is such a nifty trick to get around memory access time issues, especially when there is a lot ot memory calls needed to perform something.
Not only do you cut on your (quite expensive) memory calls, but also move a large chunk of your task into the cache.
step 1: remove the `sleep(10) # replace with algorithm`
I have 0 understanding of wtf are you doing but i'm enjoying this so much for some reason, great vid!
man you just say "blazingly fast" and i automatically click the like button.
omg i just realized i did that too
The going backwards to skip large number of characters was likely inspired by the Boyer-Moore string-search algorithm, which is also what powers "grep". It's actually faster to grep for long strings than it is to grep for short strings because of this optimization. Longer search strings let you skip faster.
1606240% faster does NOT mean its 1.6 million times faster.
ya 16K time faster
David's version is a clever use of the Boyer-Moore heuristic of good suffix for string pattern matching
My dad makingan algorithm to save 2 dollars at the mall
small correction at 5:41 c%32 will result in a value between
1 and 26
since (a)97%32 = 1
and
(z)122%32= 26
so total of 26 possibilities (letters of the alphabet)
another form of optimization would have been creating a custom binary type of 26 bits
and doing the same thing (correct if i’m wrong)
Also don’t know if it’s possible but maybe mapping each character to a 5 bit type (32 possibilities closest power of 2 to 26) and then doing the window comparasion technique might also add some speed not 100% sure.
The array solution is the one I would have come up with, but I think there may be a little room for improvement there, depending on what the compiler is already doing with that code.
I love this style of rust. It looks pretty confusing at first sight but it's much better than for loops with random looking operations. The algorithms kick ass. And the browsing through at low level was pretty neat as well. It's pretty wild to consider how a human would just look at the 15th value and say "I've seen that already" instead of comparing all of them individually. Despite being pretty slow at some things, human brain just keeps amazing with all its optimizations.
The Perez algorithm was really cool, I only recently saw it explained somewhere but I didn't understand it originally nor did I remember it. What surprises me is that it isn't part of standard algorithms.
Honestly I expected to see a mind-bending mathematically motivated algo but this is still very impressive
This video shows perfectly why I love writing high performant code. You get to be creative: Use your existing knowledge, e.g. the Boyer Moore algorithm for string matching, to come up with solutions for new problems. Know the internals of the underlying system to select the right data structures. All theory aside, you have to actually test and experiment. Measure something odd? Then fire up that compiler and check the assembly. It just never gets boring!
Most exciting
Could you show your code that used all of your threads? I would like to learn more about how to use threads on a program.
its on my github, advent of code (let me link the code in the description)
You are presenting everything with such an ease, and it is highly understandable to me, self-taught (no university degree) DevOps engineer. If you were my teacher in high school I would be probably working for NASA or Roskosmos now... Make a course related to Rust development, Rust for beginners or similar...
I watched this all unfold on twitch but love having the video here to review. As someone that didn't get a formal education in CS, this low level stuff is so helpful to be able to return to.
byte % 32 can be optimized away with byte & (0b11111). Not sure if the compiler optimizes that away.
it does (i believe)
ahh so that's why (x % 32) is better than (x - 'a')
@@ThePrimeagen cool. Good to know that you write clear code and the compiler rewards you.
I had the binary method in mind, but you explained it beautifully, and benny implemented it much better than what i would do.
Though, i'm just a junior with 0 work experience, but with 6 years of programming experience
Tbh this might be my favorite video of you so far. Very well explained, hats off to you! Also congratz to David for the solution, very clever stuff!
The idea of iterating from the back of the window reminds me of the Boyer Moore algorithm for substring search.
reminds me of the massive perf increase on an algorithm by standupmaths to find 5x5 words with no character repeats. went from days down to milliseconds due to work by his community
before watching this vid past problem description, my first thought is to use char -> ints 1-26 and bitshift/xor into an int and popcnt ===whatever the length needed is... or & / | s to check and insert and abandon early if before == before&char
He is a math teacher who used Python because he doesn’t understand the first thing about optimization. I spent 1 day writing C++ code and it took 20 mins to find all solutions. Spent the next morning and it took 2 minutes. The afternoon and it took 20 seconds.
In Julia, the naïve solution (using Sets and moving window just +1) runs in 7-8ms and the fast solution in 7-8µs on chars, or ~3µs if the input is already bit states. (Didn't bother doing a threaded version.)
I know I’ve seen you do this many times, but I think it would be SUPER cool to see you do the setup and generation of stats/graphs for perf. The more I perf test, the more I see people are often optimizing for the wrong things
Checking if something is in a vector/array is btw not constant time, but O(n). It's just constant here, because n the size of the vector is constant. It's often faster than hash sets, if your n is in the low tens, sometimes even low hundreds.
That's less true for interpreted languages without raw/non object number types though. I've seen cases in python where hash sets of 5 elements were faster than lists for such checks... I guess less interpreter work almost always wins the performance race (especially with GIL)
Imagine writing an algorithm and someone just makes it 1.5 million times better.
It used to happen a lot earlier in 80s and 90s. CPU was a scarce commodity. You wanted to squeeze as much performance out as possible.
Zen of Assembly programming by Michael Abrash and How to solve it by Computer by Dromey were some amazing books I read growing up.
@@KaranChecker technically compute time was not a commodity back then because software was run on hardware owned by the company and thus is not sold on a market, for compute time to be a commodity it has to be sold on the market, eg AWS, Linode etc. But i dont think those kind of things where common back then.
I'm going to need to re-watch this a few times to get my head around those improvements. 😀
The first thing that came to my mind when thinking about this problem was Bloom Filters. They would be pretty fast I guess but not SIMD-level fast.
This is the kind of breakdown I’ve been looking for!! Thank you and your family for this content 😅
show in log scale next time
savage. went from web dev to game dev in 7 steps. blazingly fast indeed, ty for sharing sir!
That was really fascinating!
Also, what language(real programming language) would you recommend to learn Data structures and algorithms in? I primarily use JS/TS for side projects and stuff.
js is perfectly fine for learning learn as long as you dont care about performance
js is fine for algorithms. You might want to look at C for stuff using pointer operations, but Javascript can pretty much do the same thing by wrapping things in arrays or objects
@@ElektrykFlaaj @shadamethyst1258 Thanks for the insight! Will stick with Js until I find a good reason to switch.
Also, I was concerned because people complain about JS lacking some fundamental DS like heap, trees, etc which i thought could be a deal breaker during the interviews?
Common Lisp
If you want to deal with manual memory management, then Rust would probably be the way to go. As far as just learning how data structures work, you're probably best just googling the structures you're interested in. Most languages can represent them, but obviously a systems programming language can do so more efficiently. That being said, a lot of the time JS is just fine speedwise, it's actually a pretty fast language most of the time.
If written slightly better, the innermost loop can have only one simple branch check, which is whether a duplicate is found. And it is a cold check. No need for bounds checking in the inner loop.
Blazingly fast 😂⏩
it really was
This is the best content on TH-cam. I'm about to do your algo course on front-end masters, cant wait.
This was extremely useful!
Please more of this when you can. Lots of Advent of Code days to play with xD
Very clever adaptation of the Boyer-Moore algorithm !
Paused at ~ 5:05 as I heard the next and last one will be bitwise, to ask this
After the array optimization, but before the bitwise optimization, what causes the fact that this following optimization doesn't score between those? (Which we know because you didn't do that in the vec or array optimization, and didn't even list it - and we trust that your rankings are exhaustive and only excluded anything worse than naive algorithm #0 aka "simple"):
Instead of using a 14-length array and checking contains (or your array equivalent), you could have a 29-length 0-initialized array, and a hashfunction that takes a character and gives a corresponding index. Should be something like `ch - 65` (assuming the input string is uppercase), because utf-8 is kind towards the alphabet. Except you do need to remember to handle the last 3 characters differently, as utf-8 is *not* kind towards those.
Then, as should be obvious, you check if the index is non-zero. If it is, there's a duplicate. If not, you make it non-zero.
Hmm, now that I think about it, this same algorithm should be able to also use bitwise manipulation. Not sure if it would be faster, if it is I doubt it is a large improvement, but should nevertheless be doable. Main issue (with getting it more performant) should lie in the fact that you need to operate over 29 bits, aka 4 bytes, aka 32 bits (as you generally don't get to be more choosy than 8 bits at a time). I guess with a language that allows the bitwise operators to be performed on `int`s and the like, and not only `byte`s could achieve some performance... But that is far from all, so says more about the languages performance than the algorithms.
Edit: lol, at 5:20, and you just started talking about the utf-8 stuff I mentioned (albeit lowercase, and forgot exceptions the last 3 letters of the alphabet) :p ..And for some reason seem to assume we have access to 32+ bit types with bitwise operators?
edit2: hmm, using modulo instead of subtraction... Unsure why, and wouldn't module be slower? Surely some good reason will pop up soon. Maybe that is how you will deal with the exceptional letters?
edit3: Huh, that Snake-esque optimization was really clever! Is the window operation even required then? If done this way it should be enough to keep track of 2 indices i_0 & i_1 where i_1 = i_0+13, no? I have trouble imagining that generating all sub-arrays with the windows operation beats incrementing 2 counters each iteration. Maybe it's something language specific? Like, how I am fairly sure windows in Rust is lazy.
The "count_ones" operation is very iffy to me though. I have no idea how that is done as the function is a black box, and my naive guesses are that it literally iterates over every bit and foldLs them together. I wonder... Well, either way it should probably, maybe, hopefully, remain faster thanks to the Snake-optimization, even if it needs to iterate 32 bits every iteration.
Also surprised you opted to not relate it to the well-known Snake-optimization during your explanation, for clarity and brevity.
edit4: Oh, david's returns more to the initial idea (checking each insertion + not playing Snake), but instead noticed a huge optimization in the fact that when you found a duplicate, you should be able to skip to the index where your window loses the first occurrence of the letter (cheap because of the reverse() call, and opting to not play Snake). Surely that could have been adapted to the other algorithms too? How come they scored worse than the naive algorithm #0 in your rankings, and was thus excluded? Starting to believe your rankings weren't exhaustive, as if you intentionally opted to not rank an infinite number of candidates because you decided it's unrealistic...
edit5: Threading... That's honestly kind of a boring reveal, even though I have to admit it is exactly what I expected when I saw the ranking named it "async" :/ If that was all you did, then the comparisons should have had the other algorithms (including "simple") threaded to begin with, as it didn't appear like the ability to thread it seemed in any way limited to the last one.
Also, curious, did you simply thread it on your cpu, or did you translate the algorithm into compute shaders and use the GPU? If not the latter (with the choice of 64 thread it sounded like the former), where does that last algorithm rank when you do?
edit end: Huh, you never explained why module was preferred over simply subtracting. And never did mention the exceptional 3 last letters of the alphabet (and I really doubt module magically was lucky and took care of it in secret)). Feels like those last algorithms will only work so long as your input constrains itself to the first 26 characters, and any occurrences of 27th, 28th or 29th will cause bugs.
edit beyond end [because I started wondering why they didn't just modulo by 26 anyway, since they seemed to already expect the input to only have 26 first characters + only lowercase - so I went to verify something on the utf-8 tables]:
Hmm, think I figured out why module 32. An interesting facet about UTF-8 which I was unawares of is that: after uppercase ended, the number of random characters before lowercase starts, is exactly enough that A to Grave accent ('A' .. 'a'-1) are 32 characters. I figured 32 was taken because of the choice of an u32, but seems it was choosen after a look at the UTF-table, and checking `len = 'a'-1 - 'A'`. You should probably have mentioned that (and in fact, from how you explained things, you appear to have had the same misunderstanding as me).
Though all that said, this only removes the "only lowercase" requirement. The last 3 letters are still buggy, and in fact the last letter gives this `'Ö'%32=214%32 = 22`
Fun fact about modulo. When it is modulo by 32, it is not anything other than a bit mask operation
So it's actually quite optimized
-------
Brief explanation. Think about moduloing by a power of two, all it really is is a simple bit mask :)
@@ThePrimeagen Yeah, makes sense. Another advantage to mod 32 instead of 26, even if UTF-8 hadn't (probably intentionally) had such useful "a and A are both multiple of 32 + alphabet mostly ordered sequentially" properties (ps: iirc 'A' is also a multiple of 32, so the "array"/u32 is actually indexed in alphabetical order: a is at 'index' 0).
Wasn't aware the ALU actually did such optimizations tbh.
Or wait, is this another one of those cases where I have to remember it's Rust and thus has compilation optimization? The topvoted comment and that 'chapter' about how the gains are the same in JS, makes me constantly try to think in a language-independent context.
Since you didn't answer, and I had a wall of text so you likely missed them, what about the other questions?:
- Did you try the async 'algorithm' as compute_shaders on GPU? To what results?
- Did you recognise the Snake(game)-optimization?
- Is u32.count_zeros doing any nifty stuff beyond a foldL or equivalent?
- Were you similarly 'fooled' by the mod 32 to think it was choosen by the "array"s/u32s "length", when it was in reality to fix the lowercase/uppercase issue through mentioned quirk of UTF8 being intelligently designed with this kind of stuff taken into account?
- And finally, the most important question: what about the letters after Z? Any ideas on how to fix those algorithms so they aren't bugged? My best idea atm is a replace of the input, turning them into the characters that mods to the letters location, and then run the algorithm. How much is the performance hurt when compared to the algorithms that aren't dependent on intelligent design of character-encodings?
I use this bit manipulation to store a series of flags as a DWORD to keep track of e.g. user configurations in a database. Much more efficient than storing a zillion of true/false fields. Never thought about using it to compare hashes. Good one 😁
You make my brain hurt... in the most pleasing way possible.
Loved the format of this video, it was explained so nicely and very easy to follow! Ty
Really cool video, Prime. Learned a lot. Ty!
Pedantic alert: the “x times faster” needs to be +1 that of the percentage, after allowing for the 100x factor to covert one to the other.
I’ll let the reader work out why.
Finally, we set a binary 1 (0b1, or just 1) into the right place for that letter using bit shifts, specifically the left-shift ("
Honestly this was such a great video congrats and thanks!
Thanks for the awesome content! Minor quibble: left shift always increases by the base, in binary that’s 2, not 10, it just looks like you multiply by 10 because the only digits in the numbers are 1 and 0
I actually look forward to watching your videos... something rare for me on youtube nowadays
I just saw it for a second in the twitter post, but using c - 'a' for bit-indexing will be much faster than c % 32. And if there's a possibility that the total set in general contains less than 14 distinct characters you could traverse it quickly once and then shorten the window accordingly.
Great stuff, a lot of fun stuff happens at the bit and byte level
I went through this video not spoiling myself the rest and I guessed that it was going to go the SIMD way + the index skipping to the duplicate at some point. I don't see how SIMD operations should result from the reverse iteration though. I think that the reverse iteration just lowers the number of comparisons after we skip idx after finding a duplicate, so iterating in reverse has nothing to do with the SIMD implementation per se, but works in synergy with it.
Appreciate you King 👑
I didn't understand most of it but still loved the video! Thanks!
well howdy! you will some day :)
Wow! That one I'll need to reawatch a bunch of times to get what's happening on those improvements.
Instant like for performance optimization content!!🎉
what if in 2nd last code instead of counting the ones again and again we remove that check and increment or decrement just by checking the bit at which doing XOR will change the value
I think my biggest optimization achievement was rewriting a simulation code we had in our lab so that it exploits parallelism and GPU. Later on, I improved it further so that it uses the CPU/GPU resources more intelligently. Initially, it took about 2 days for the code to finish a simulation. After 2 weeks of throwing ideas around and scratching my head, I was able to make it work in around 30 seconds.
If you went from 80 bit FP to 24 bit GPU FP, then you aren't actually running the same simulation. Not even close. So, yeah, you might be getting a wrong result in 30 seconds. At the very least you are getting a different wrong result. ;-)
The David solution resembles the Boyer-Moore string-search algorithm. Nice video!
I'm not sure, but working with a xor signature accu instead of working with the whole 14 chars window on each iteration seems could be even MUCH faster... something like this:
let xor_table = [...]; // TODO static based on sequences on quiz, index are xor signatures, value the sequence #i
pub fn cutrez(input: &[u8]) -> Option {
let mut xor = 0x0; // 14 chars window xor accumulator/signature
for &e in input[0..13] { // preload xor signature
xor ^= e;
}
let len = input.len();
let mut i = 14;
while i < len {
xor ^= input[i]; // 14 chars window xor signature
let sequencei = xor_table[xor];
if (sequencei != 0) { // sequence possible match
let mut j = 0;
let mut found = true;
while j < 14 { // hard check
if (sequences[sequencei][j] != input[i - 14 + j]) {
found = false;
}
}
if (found) return i - 14;
}
xor ^= input[i - 14]; // clean oldest char from xor signature
}
return None; // not found
}
And I think that this could be optimized even a lot more by threads and by working over input with dwords or qwords instead of byte by byte and taking account of that by loading all offsetted signature variants on xor_table.
thank you for this blazing fast series.
i thought using bitwise AND could be faster than using mod (n & 31 does the same thing as n % 32), since mod is a relatively slow operator. does the compiler optimize this?
These kind of optimizations replacing some instructions with another ones are probably the simplest to implement, so I'd be really suprised if the compiler didn't do it. This particular technique is called peephole optimization (was proposed in a paper from 1965) and is applied in many similar cases.
Contemporary compilers can do much more that so you should avoid these kind of tricks that often hurt code readability and let the compiler do its job. I recommend checking out talks by Matt Godbolt "What Has My Compiler Done for Me Lately".
@@cysia3683 ohh, that makes sense, thanks!