Making an Algorithm Faster

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 ม.ค. 2025

ความคิดเห็น • 263

  • @ThePrimeTimeagen
    @ThePrimeTimeagen 4 หลายเดือนก่อน +2049

    LETS GO! I HAVE MADE IT!!!

    • @mathzygote
      @mathzygote 4 หลายเดือนก่อน +110

      Hello prim-ye-jun

    • @0deltasierra
      @0deltasierra 4 หลายเดือนก่อน +6

      hello there Captain Prime

    • @vipulsemwal124
      @vipulsemwal124 4 หลายเดือนก่อน +3

      my goat

    • @sarcasticdna
      @sarcasticdna 4 หลายเดือนก่อน +15

      hello little guy

    • @owo4202
      @owo4202 4 หลายเดือนก่อน +26

      hello little TH-camr!

  • @tommyshelby6277
    @tommyshelby6277 4 หลายเดือนก่อน +804

    bro primeagened primeagen

    • @omri9325
      @omri9325 2 หลายเดือนก่อน +3

      pomegranate

    • @SunglassOrang
      @SunglassOrang 2 หลายเดือนก่อน

      Bob the Builder

  • @landonkryger
    @landonkryger 4 หลายเดือนก่อน +123

    A similar video by Matt Parker (Stand-up Maths) "Someone improved my code by 40,832,277,770%." I was part of the team that optimized his 1 month solution down to 300 microseconds. We submitted ours kind of late so he wasn't able to cover our big algorithmic changes, but many of the techniques you mention here applied there as well.

    • @samygafsi6429
      @samygafsi6429 หลายเดือนก่อน +5

      Thank you for your contribution. That video was one of the best videos on this website

  • @haronxbelghit
    @haronxbelghit 4 หลายเดือนก่อน +213

    18:00 no coincidence, the range from 97 to 122 falls between (32*3 = 96) and (32* 4 = 128), meaning the remainders span from 1 to 31 within this interval

    • @NeetCodeIO
      @NeetCodeIO  4 หลายเดือนก่อน +236

      Damn, it makes you realize how much thought the original programmers put into making things elegant. And then we ended up with JS, web apis, and frontend frameworks...

    • @HtotheG
      @HtotheG 4 หลายเดือนก่อน +103

      @@ismbksMy favorite Ascii Hack is that each capital letter A-Z (65-90) is exactly 32 away from their lowercase counterparts (97-122) in such a way that the only difference is the 6th bit (2^(6-1) = 32) making things like case-insensitive comparisons or conversions to upper or lowercase SUPER fast bitwise operations ignoring, unsetting, or setting the 6th bit respectively.

    • @hellowill
      @hellowill 4 หลายเดือนก่อน +36

      ​@NeetCodeIO back then programmers were genius. Like you basically needed a phd, or some deep understanding of math.
      The bar has really lowered. Which I guess was necessary to scale.

    • @XtenstialKrysis
      @XtenstialKrysis 4 หลายเดือนก่อน +3

      Nothing like that lol 😂 ​@@hellowill

    • @jimgraham-sz6cl
      @jimgraham-sz6cl 4 หลายเดือนก่อน +2

      why do modulo? when you can just do `chr-'a'`, and modulos are much more expensive

  • @sahilverma_dev
    @sahilverma_dev 4 หลายเดือนก่อน +676

    "from this little youtuber primeOgen"

    • @PP-ss3zf
      @PP-ss3zf 4 หลายเดือนก่อน +25

      trying to understand if hes just trolling, he said it so seriously

    • @sahilverma_dev
      @sahilverma_dev 4 หลายเดือนก่อน +2

      @@PP-ss3zf what do you think

    • @PP-ss3zf
      @PP-ss3zf 4 หลายเดือนก่อน +14

      @@sahilverma_dev my comment.. thats what i think

    • @Iron_spider99
      @Iron_spider99 4 หลายเดือนก่อน +1

      ​@@PP-ss3zf think some more

    • @yassine-sa
      @yassine-sa 4 หลายเดือนก่อน +10

      @@PP-ss3zf he's obviously trolling, he knows him

  • @TestTost-j4d
    @TestTost-j4d 4 หลายเดือนก่อน +53

    To preface, You very nicely explained everything. Even the tricky bits :).
    Stack is handled by the cpu under the direction of the OS. There is still overhead when you cross the page boundary. The heap is not handled by the OS, but by the whatever allocator you use (the allocator mmap()s pages when needed). Allocators usually use "bins/buckets" for various sizes of allocations, so it's pretty fast. Unless the allocator has to mmap() some more memory.
    Anyway, what i'm trying to say is that it's complicated. Like if your language would let you, you could even use the stack as a dynamic array. Or you could mmap() a big piece of memory and just use it as a dynamic array, as memory doesn't get allocated until you touch it. Getting the same result as the stack. If the array size is fixed, the compiler could even just reserve a piece of memory from when the program is loaded (like they do for const strings).
    Cache locality is also a bit.. Cpu's cache memory in "cache lines", that are usually 64 bytes. And yea, if your resize moves the data then all those cache lines get useless. Then again the memcpy puts the new data into cache, so it's not ~that~ bad. Just that something else will get thrown out. And there's more levels of cache, like L1 is closest to the core but way smaller then L3, ofc.
    And yea, the cpu "prefetch"-es data as you read it in. It even figures out the direction, so going 0..n is the same as n..0.
    In short you always want to use as little memory as possible, and keep the memory that is accessed at the same time as close together as you can. If you can keep it in registers, like the bitfield solution, then your golden. And you ~might~ want to align/pad to some power of 2 (especially to 16 bytes for SIMD, that you even had to on older cpus).
    PS Oh, and your solution to subtract 'a' would also be faster then modulo (Modulo is divide, ofc subtract would be faster. Bdw bitwise operations usually take 1/3 of a cpu tick, the fastest operations there are (except maybe mov between registers)).

    • @egor.okhterov
      @egor.okhterov 4 หลายเดือนก่อน +4

      This ☝️
      It is not about stack or heap.
      These are just simple introductory concepts taught in University, but the majority of the people stop at that and never actually understand how memory works.
      There is MMU, TLB inside of CPU.
      There are virtual pages of 4kb that are loaded via page faults.
      There is .text, .data and .bss section which are neither heap nor stack, but are still loaded into programs address space.

    • @treelibrarian7618
      @treelibrarian7618 4 หลายเดือนก่อน +5

      ... although modulo (%) 32 is optimized to and (&) 0x1f so is basically the same, and other static modulo's are normally optimized to 2x multiply so not so bad. and bitwise ops aren't 1/3 of a clock, they take a whole one: they just have 3 or 4 alu's that can do them so 3 or 4 can be done each tick if nothing else is happening. also, since icelake the move elimination you refer to has been removed on intel (sadge...) so mov takes same time as other ALU ops now... but zen still does it.

    • @dorianligthart3378
      @dorianligthart3378 4 หลายเดือนก่อน

      @egor.okhterov I'd assume the devs in the video made the testing string variadic from stdin or a file (so there would be no .text section with the test string) cause otherwise the compiler, after linking, could just optimise the answer, maybe even fill it in😂 making the testing times invalid.
      What would otherwise be the point of the problem if not automation? You would write the first solution you'd come up with and wait a bit, thinking about how the younger you would waste time optimising the code hahaha
      But it's nice to finally find/read a more technical yt comment thread.

    • @saharshbhansali250
      @saharshbhansali250 4 หลายเดือนก่อน

      I really wish I could share TH-cam comments.

  • @SanjayB-vy4gx
    @SanjayB-vy4gx 4 หลายเดือนก่อน +145

    Bro gave 2hrs length content for him

  • @TheOnlyJura
    @TheOnlyJura 4 หลายเดือนก่อน +30

    "the actual runtime is what matters" - tell that to the average react developer

    • @0x0michael
      @0x0michael 4 หลายเดือนก่อน +11

      Lol lol lol, they're too bothered about elgant abstraction while their apps keep making people replace their phones every two years

    • @nathanhedglin931
      @nathanhedglin931 2 หลายเดือนก่อน

      Just use useMemo /s 😂

  • @juanmacias5922
    @juanmacias5922 4 หลายเดือนก่อน +12

    These random topic videos have been really insightful, great content!

  • @eblocha
    @eblocha 4 หลายเดือนก่อน +18

    I think you can still get a cache locality boost using an array, because the array’s memory is next to other stack variables. That means the array’s memory is more likely to be in the same cache line as the other stack variables.

    • @vasiliigulevich9202
      @vasiliigulevich9202 4 หลายเดือนก่อน

      You only need two pointers referencing input, what array?

    • @TanmayPatil37
      @TanmayPatil37 4 หลายเดือนก่อน

      Pointers are pointing to underlying array which has to be accessed (through the pointers)

  • @howto.3411
    @howto.3411 4 หลายเดือนก่อน +16

    That is a BRILLIANT video,loved watching it.

    • @NeetCodeIO
      @NeetCodeIO  4 หลายเดือนก่อน +12

      I'm honestly glad there are people out there that enjoy this stuff as much as I do. Love deep technical concepts.

  • @valentinrafael9201
    @valentinrafael9201 4 หลายเดือนก่อน +3

    Ignoring the constant *when you are learning Big O* is important, so that you dont get distracted, however, when building something, it’s only relevant if you are already at the “simplest form” or smallest big O you can achieve, and then the constant matters.

  • @zangdaarrmortpartout
    @zangdaarrmortpartout 3 หลายเดือนก่อน +1

    At 14:20 he is relocating the array every-time the windows change but with a well constructed loop it is possible to reuse the same array and toss out the indexes we know can't have been updated and contain values from the previous sub-string. On C# doing this makes this algo 4x faster.

  • @0x0michael
    @0x0michael 4 หลายเดือนก่อน +2

    You're right about cache locality not being involved. It's the same thing with strings and small string optimizations

  • @CrapE_DM
    @CrapE_DM 26 วันที่ผ่านมา +1

    Ascii values weren't technically chosen to be used mathematically like that, but there is a similar idea there. They wanted to be able to check whether a letter was upper or lowercase simply by checking a single bit. Or maybe it was for converting between.

  • @GuRuGeorge03
    @GuRuGeorge03 4 หลายเดือนก่อน +4

    I am a lead web developer and have never done any leetcode except in university. I recently started leetcode to get into a big name company that pays like 10%-20% more than my current company and videos like this are very eye opening!

    • @marcsh_dev
      @marcsh_dev 4 หลายเดือนก่อน +1

      Thats quite a bit less than I wouldve thought though. If you like the folks you work with a lot, dont get a new job for anything under 25% more than what you make now
      Being on a horrid team is the worst. Id much rather work for less money than folks I dislike.

  • @dekumidoriya-qi2is
    @dekumidoriya-qi2is 4 หลายเดือนก่อน +422

    this little youtuber 🤣🤣

    • @plaintext7288
      @plaintext7288 4 หลายเดือนก่อน +12

      he should stream as well

    • @dekumidoriya-qi2is
      @dekumidoriya-qi2is 4 หลายเดือนก่อน

      @@plaintext7288 fr fr

    • @stoic12343
      @stoic12343 4 หลายเดือนก่อน +1

      @@plaintext7288 agree

    • @aflous
      @aflous 4 หลายเดือนก่อน +6

      Primyejen 🤣🤣🤣

    • @sebastianwapniarski2077
      @sebastianwapniarski2077 4 หลายเดือนก่อน

      That's classic

  • @freya28733
    @freya28733 2 หลายเดือนก่อน

    25:44 it also instantly gives you the place where to jump to, when going forward you reach the 2nd J and you'd have to go back to see where the first J was in order to continue, going backwards your pointer is already at the second J (or actually the first J but we are going backwards) and instantly know where to start the next window ("index of J''' + 1)

  • @MrSonny6155
    @MrSonny6155 4 หลายเดือนก่อน +5

    Feels like Boyer-Moore, but without the pain of preprocessing bad-character/good-suffix tables. Very nice.

    • @akiya9216
      @akiya9216 28 วันที่ผ่านมา

      yea immediately reminded me of boyer moore too

    • @sidhantmanchanda
      @sidhantmanchanda 22 วันที่ผ่านมา

      This is what came in my mind, haystack needle

  • @criptych
    @criptych 4 หลายเดือนก่อน +12

    You can create static arrays in Python with the "array" module. Still not sure if that qualifies it as a "real" language, though.

    • @khuntasaurus88
      @khuntasaurus88 10 วันที่ผ่านมา

      a real language is a language you code in that lets you create something. Unemployed elitists like to say Python is not a real language because it's a scripting language that gets scripts passed to an interpreter. The problem is, I am a Python developer, I write python code, the resulting applications work as expected and I get paid at the end of the month for the work. That qualifies it as a "real" language/ Do not listen to these unemployed bozos saying "muh real language is X"

    • @criptych
      @criptych 10 วันที่ผ่านมา

      @@khuntasaurus88 Sorry, forgot sarcasm tags.
      I'm not even sure a hard line can be drawn between which languages are "real" or not. Most have features that make them better suited for particular use cases, but it's a matter of preference as long as you can express what you need done. (Rust, for instance, is a good choice for projects where you need to say "rewrite it in Rust" or "I use Rust, by the way." … )
      I use Python a lot for work, too, primarily ETL jobs where the speed of making updates and adjustments matters more than the speed of the actual process; the heavy lifting is done with NumPy/Pandas or on a database server, and a few seconds (much less milliseconds) don't matter when we're working on a timescale of minutes anyway. That flexibility is something Python is really good at. I'm dabbling with using Go, but it's too rigid in places to make that an easy switch, and it's not helped by having to track down or even rewrite some libraries that were readily-available in Python.

  • @akialter
    @akialter 4 หลายเดือนก่อน +12

    What I like about your explanation is you dont “assume” the audience know a thing, you drove into the tiniest detail like what even is an AND operation. Whereas college professor always have that assumption, like oh you guys must already know about stack, heap, memory allocation, let me talk about this scheduling algorithm,…

    • @rdubb77
      @rdubb77 4 หลายเดือนก่อน

      Because most professors suck and only “teach” for a paycheck

  • @rodrigosantoszz
    @rodrigosantoszz 2 หลายเดือนก่อน

    14:00 I believe the cache locality mattered because he was allocating one vector for each sliding window, and as such they were getting to random positions in the heap each time, with stack allocation it’s always the same

  • @thargor2k
    @thargor2k 4 หลายเดือนก่อน +7

    Maybe I misunderstood what you wanted to say in the beginning, but CPUs are totally able to add two numbers together.
    Does that in the end boil down to binary operations? Yes. But except in some very esoteric CPUs it doesn't run those binary operations, but there is dedicated circuitry to do the addition, in many cases in 1 cycle (e.g. on x86 it's a single uop, as well as on most embedded CPUs)

    • @K9Megahertz
      @K9Megahertz หลายเดือนก่อน

      I was a bit confused by this statement as well. CPU's do in fact to addition (ADD instruction) but the actual ADD instruction itself is done by the ALU. Adding two numbers together is done by a full adder which is comprised of mostly AND, XOR and OR gates. I'm not really sure what was meant here either.

    • @akiya9216
      @akiya9216 28 วันที่ผ่านมา

      ​@@K9Megahertzi mean, and, or and xor are essentially what the bitwise operations are, whereas using an ALU is multiple operations joined together i guess

  • @acters124
    @acters124 4 หลายเดือนก่อน +1

    I enjoyed this. i like the mention of bit mask. using a 32 bit sized buffer, it is just so much faster to deal with bits as an index. This is assuming only lowercase(or only uppercase) Latin characters. (26 letters)

  • @MrHaggyy
    @MrHaggyy 4 หลายเดือนก่อน

    Async or parallel optimization is also really interesting from a data structure perspective. As a chunk of cash has a specific size, we do want a structure that uses as much of a chunk as the algorithm can handle.
    This means in a forward sliding window approach we can assign each thread with a starting position and collect the results. Likewise you can often use multithreading to make a O(N^2) into a O(N × (N/threads) ). Which leads to great improvements on specific hardware.
    But it's hardware specific. Currently i'm working with a controller that only has one RAM but SIMD and MIMD. In that case you would either do the backward sliding window on the CPU, or try to fit the whole algorithm in the MIMD and do a forward brute force.

  • @AnimeLover-su7jh
    @AnimeLover-su7jh หลายเดือนก่อน +1

    And Another note, Ascii values are set to be manipulated indeed, the book art of assembly language, chapter one, by phatcode, has a good explanation of many uncommon tricks that are usually done by assembly programmers (back in 70s or something) that utilizes ASCII representation.

  • @porky1118
    @porky1118 4 หลายเดือนก่อน +1

    20:20 I often store stuff as bitset. It's more comfortable than working with arrays IMO.
    Recently I also turned some struct of boolean flags into a bitset. (Or I rather told some AI to do it for me, since it's pretty repetitive)

  • @maaikevreugdemaker9210
    @maaikevreugdemaker9210 4 หลายเดือนก่อน +19

    I died at "we talked a bit about memory" 😂

  • @yes5421
    @yes5421 4 หลายเดือนก่อน +1

    Just started my cs degree but I love every part of this and can’t wait to get to this level

  • @notuxnobux
    @notuxnobux หลายเดือนก่อน

    The reason why the performance was better when going from vector to array was because he was allocating a new vector on each iteration of the window shift. Allocating on the stack is just substracting an integer, allocating a vector on the heap is significantly more expensive.

  • @DavidM_603
    @DavidM_603 4 หลายเดือนก่อน +7

    9:50 another bit of overhead that could've been avoided at that step is reusing and clearing one vec, instead of harassing the allocator for a new one every 14 bytes lol

    • @doodocina
      @doodocina หลายเดือนก่อน

      just do not use vectors AT ALL in this example. another alternative would be using arena, you don't need to clear it

  • @theairaccumulator7144
    @theairaccumulator7144 4 หลายเดือนก่อน +2

    CPUs know how to add numbers together even floats though. ALUs and FPUs make it so the difference between a right shift and a multiply by 2 isn't a thing anymore.

  • @nochan99
    @nochan99 หลายเดือนก่อน

    The thing with stack locality is that by having data far apart you trash the cash. Depending on the rest of your program, if not everything the algorithm is working on fits in the cache at the same time, you will see significant slowdown due to cache misses. If everything is on the stack then chances are greater that they all fit in the same cache page.

  • @spicybaguette7706
    @spicybaguette7706 4 หลายเดือนก่อน

    14:00 I think the biggest is allocation, since it has to make a call to malloc every single iteration of the loop, which means one for almost every characters. I'm guessing that if you were to move the Vec::with_capacity out of the loop and vec.clear() it every time you checked a window, you would get much closer to the performance of the array code

  • @gazorper
    @gazorper 27 วันที่ผ่านมา

    Going by the assembler instructions that both gcc and clang generate, the modulus method of mapping 'a'..'z' to 0..25 is slower than just subtracting 'a', at least on x86 style CPUs. For reference, this is what I refer to:
    char ch = 'c';
    int av = ch - 'a'; // faster on x86
    int av2 = ch % 32; // slower on x86. a lot slower actually.
    On ARM 32 and 64 the two appear to be equivalent.
    I did look at some other architectures, and it looks like CISC architectures are always slower for mod 32, and RISC is always equivalent. But that's anecdotal.

  • @jacksong8748
    @jacksong8748 4 หลายเดือนก่อน +3

    25:11 actually you'd see the p's but who's counting? xD
    In any case, the reverse iteration to guarantee taking the maximum step size every time was definitely the coolest optimization in my book. The second the Primeagen pointed it out it was like HOLD UP, that's so freakin clever. very cool stuff. Not often do i see an optimization that makes me "teehee" like that.

  • @yellingintothewind
    @yellingintothewind 4 หลายเดือนก่อน

    Cache locality does matter even in small arrays vs vectors. Vectors have more space overhead, as they need to track their size, current location, allocated capacity, and so on. They also cannot be packed with whatever other data is in the current stack frame. So it is harder to get the entire working set to fit in the L1 cache, or even if it nominally fits, it's more likely to have parts of it evicted and have to fetch it from the L2 cache when task switching happens in multitasking operating systems. Taken to the extreme, you have programs like CPUBurn which fit the entire program into just the main CPU registers and stress test the CPU by cycling as fast as possible, never reaching out even to CPU cache.
    Your point applies more to cache lines, which is where moving data from system memory to the L3 cache happens to bring in the _next_ data you need. The concept is related, but not what matters here.

  • @kiratornator
    @kiratornator 3 หลายเดือนก่อน

    the right to left approach is often a good idea when looking for largest sub array

  • @sunopowerdeathmetal
    @sunopowerdeathmetal หลายเดือนก่อน

    take C.
    make a 256 - bit (or less) flags array in stack 32 bytes. Can it land into one vector register? it depends.
    You can decrease it to 128,64... etc bit flags if you restrict string say not being all 256 symbols and just letters, or smth.
    In each 14 - bits chunk what has no doubles, you delete flag bit of previous character, and try to add new bit.
    No backwards going and xoring throught list is nessesary.
    If there is a collision( new bit is 1 means the new character is one of last 13 symbols, position window after this symbol. If you can put the 14-th bit into empty flag spot you have 14 different symbols and do what you want - add the position to output array?
    The speed does not grow with growing of window length like that

  • @sideone3581
    @sideone3581 4 หลายเดือนก่อน

    I am Fan of Prime but you cleared the stuff easily, He is too good and sometimes forget we don't understand his language

  • @WiseWeeabo
    @WiseWeeabo 4 หลายเดือนก่อน +3

    I feel like a lot of these optimizations actually imply knowledge of the data and is biased towards a hypothetical success or failure case.. but if you know your data, there's a whole world of N possibilities to optimize for a specific use-case..

    • @MrHaggyy
      @MrHaggyy 4 หลายเดือนก่อน

      They do imply knowledge. But quite broad knowledge is usually more than enough. For example in Bitmasking, the letter "A" is encoded in 0x65 in ASCII or 0x41 in UTF-8. We don't need to know the specific value, the information that its one byte per char and we can check for equal or not is already sufficient. Going into SIMD instructions forces you to know that 32bit = 4x8bit so you can check 4 chars in one go.
      A good optimisation is usually something really really trivial. You might want to look up how DFT or the the inverse Fourier Transform works. That one simple binary trick enabled a shit ton of things with image compression, nuclear detection or GPS beeing only a view applications.

  • @realfranser
    @realfranser 4 หลายเดือนก่อน

    You can assign each character to a prime number, you keep multiplying the result with the next number if the division has modulo different than 0 😮.
    Mem = int32

  • @pramodpotdar5416
    @pramodpotdar5416 4 หลายเดือนก่อน

    Okay this is one of the great video I have seen in a while.

  • @Lifeofbhadauria
    @Lifeofbhadauria 4 หลายเดือนก่อน

    This was a great video 👏👏, enjoyed watching it, I wish youtube suggests me more videos of this type 😅

  • @mathzygote
    @mathzygote 4 หลายเดือนก่อน +5

    ‘Prim-ye-jun’ that made me laugh harder

  • @CaptTerrific
    @CaptTerrific 4 หลายเดือนก่อน

    17:15 mod functions are far more expensive in terms of clock cycles than subtraction, that feels like it'd matter a lot if we're in the realm of 1,000,000% optimizations

    • @aaaabbbbbcccccc
      @aaaabbbbbcccccc 4 หลายเดือนก่อน

      %32 will get burned out by the compiler. If it wasn't a power of 2 then yes, it'd be vastly slower than the character - "a" approach

  • @zangdaarrmortpartout
    @zangdaarrmortpartout 3 หลายเดือนก่อน

    We did it in C# and while it is difficult to understand while simply watching the video it becomes trivial simple once you start putting it on paper. We weren't able to reproduce completely the flow Prime has in Rust, I see he first add last, then check, then only remove left, which means he enters the loop with potentially 13 bits set, we did initialize a first windows covering 14 characters and then as long as 14 bits aren't set we exclude left, include right, then check. I don't think it changes a lot but I'm curious to make the code even smaller than what it currently is.
    On the next step, which is understanding how I can parallelize this, right now I don't see how.

  • @Sanmayce
    @Sanmayce 3 หลายเดือนก่อน

    No one mentions the work of previous coders, giving due credits is a sign of not being a talking galfon. The skipping "optimization" is Boyer-Moore.

  • @porky1118
    @porky1118 4 หลายเดือนก่อน

    24:20 I started to hate if-let. In this case I'd use let-else, especially because the else case of let-else has to return (or continue/break) anyway.

  • @mulllhausen
    @mulllhausen 3 หลายเดือนก่อน

    Once you get down to this level just write it in assembly. It's quite fun and simple

  • @kushagrasaxena5202
    @kushagrasaxena5202 4 หลายเดือนก่อน +29

    25:56 brother you are the "super leetcode monkey"

  • @abdullahsaid4765
    @abdullahsaid4765 4 หลายเดือนก่อน

    I love your explaining for hard thing video in simple english. keep going u are doing good thing.👌

  • @porky1118
    @porky1118 4 หลายเดือนก่อน

    17:30 I also always subtract 'a' instead of modding sizeof(T).

  • @BigBrainHacks
    @BigBrainHacks 3 หลายเดือนก่อน

    Thanks for your nice and insightful explanation!

  • @ferdynandkiepski5026
    @ferdynandkiepski5026 4 หลายเดือนก่อน +7

    The use of the modulo is a bad practice. While here we are doing modulo of a power of two on unsigned ints, which any sane compiler should optimize into an and (or do some wizardry to make it work on signed numbers as well), if these two conditions weren't met, and an actual division was performed to find the modulo, there would be a significant runtime cost. As such, your proposed method of substraction would be faster.

    • @mage3690
      @mage3690 4 หลายเดือนก่อน +4

      So long as it's a compile-time constant, I'm fairly certain you can mod whatever number you want and it'll come out as a series of shift, sub and imul instructions. Godbolt helpfully told me that `int mod (int a) { return a % 3; }` contains not a single idiv instruction, _at zero optimizations._ At -O3, the function length went from 20 to 11 instructions. Using the 32 bit FNV_prime as the compile-time constant merely changed the constants in Godbolt's output, no other effect (ok, an lea with a multiply in the second operand got changed to an imul, whatever).
      Now, what it _absolutely will not do_ is take an array of anything, deduce that they are all _absolutely_ compile-time constants, and perform the same optimizations for each index of that array. No, for that to work, you have to declare an array of function pointers, which the compiler will refuse to inline under any circumstances (I shouldn't be surprised by that, but I sort of am).
      And apparently that optimization is just barely worth it on some CPUs: `int mod (int a) { return a % 7 }` removes the idiv on the general case CPU, but generates 14 lines of Assembly -- unless you specify `-march=znver3`, in which case the idiv comes right back, as I'm assuming it would for most modern architectures.
      Matter of fact, whatever algorithm they're using seems to get worse the further away you are from a power of 2, where the "nearest" power of 2 is always smaller than the constant. Maybe the guy who came up with the algorithm will generalize it to calculate down from the next higher power of 2 as well, and this piece of advice will become consigned to the dustbin of history. Who knows. Fascinating stuff, either way

    • @PennyEvolus
      @PennyEvolus 4 หลายเดือนก่อน +1

      Omg my brain was like wha whahuh huùuh ohhhh yeah i get it u sent me on a roller coaster ride bro

    • @PennyEvolus
      @PennyEvolus 4 หลายเดือนก่อน

      ​@@mage3690can i get that in a tldr pls im dyslexic (please ik us programmers need to read but i long for recognition and programming is the only unique skill i could learn at school to get recognition)

    • @panjak323
      @panjak323 28 วันที่ผ่านมา

      @@mage3690 Doing this with uint results in less operations. Not sure why you would want to do modulo on signed ints.

    • @mage3690
      @mage3690 27 วันที่ผ่านมา

      @@panjak323 duly noted -- I was writing `int` because that's just habit for me. Unless I know I want an unsigned number, I just write `int` -- seems it's less than optimal for these things.

  • @urizen959
    @urizen959 4 หลายเดือนก่อน +26

    0:15 "little youtuber" 😂😂

    • @ppercipio
      @ppercipio 4 หลายเดือนก่อน +1

      I had to do a double take just to make sure... 😂

    • @kondekinoe9337
      @kondekinoe9337 3 หลายเดือนก่อน +1

      I double checked. And still didn't believe it first and then I turned the captions on just to be sure.

  • @sanchitwadehra
    @sanchitwadehra 4 หลายเดือนก่อน +12

    Dhanyavad bhai

  • @lu3tz
    @lu3tz 4 หลายเดือนก่อน +1

    I stumbled across a super nice run-time optimization video in F#, it's called "F# for Performance-Critical Code, by Matthew Crews" neat stuff in there!

  • @vasiliigulevich9202
    @vasiliigulevich9202 4 หลายเดือนก่อน

    Fixed window width is crazy. Why not track start and end separately? There would be no nested iterations, no tracking of multiple symbols.

  • @notuxnobux
    @notuxnobux หลายเดือนก่อน

    This also shows why C programs are often faster. Not because it's written in C, but because C makes dynamically sized data a pain. People that use other languages such as c++ or rust where a heap allocated string/array is easy to make use those dynamic versions very often. In C you instead create an array on the stack with the maximum realistic size of the data and keep an integer of the current size. And instead of a hash map you iterate a static stack allocated array.

  • @platonvin1022
    @platonvin1022 3 หลายเดือนก่อน

    reality is that bruteforcing on gpu is going to be faster for any reasonable size

  • @amorfati4752
    @amorfati4752 27 วันที่ผ่านมา

    Why use (last % 32) when you can use (last & 31)? The result is the same and I think the latter is much faster since you're doing a single bitwise operation instead of using the expensive division operator.
    I'm no expert or anything, by the way. I haven't even solved a hard leetcode problem before

  • @Youshisu
    @Youshisu 3 วันที่ผ่านมา

    I like your explanations , clear

  • @playerguy2
    @playerguy2 23 วันที่ผ่านมา

    I love bitsets because they are just the degenerate case of a hash map, where the hash is an identity function of a number and the value is a boolean.
    You can't (easily) beat that. No collisions, no wasted space.

  • @infiniteloop5449
    @infiniteloop5449 4 หลายเดือนก่อน +4

    This screenshot reminds of Paolo Costa’s Secret Juice ad.

  • @xymaryai8283
    @xymaryai8283 4 หลายเดือนก่อน

    i'm a noob who's only just installed rustc and read only a few pages of the rust book, i have no clue why people approach this problem with a hash set instead of a table or vector

  • @HyperCodec
    @HyperCodec 2 หลายเดือนก่อน

    Was not expecting to see python in the thumbnail of a video with a title containing “faster”

  • @KilgoreTroutAsf
    @KilgoreTroutAsf 21 วันที่ผ่านมา

    "CPUs don't know how to add numbers together"
    Off to a great start.

  • @prajwals8203
    @prajwals8203 4 หลายเดือนก่อน +1

    Would love to see more reactions vids like this lol

  • @EnergyCourtier
    @EnergyCourtier 4 หลายเดือนก่อน

    Love these videos. Thanks.

  • @TheRageCommenter
    @TheRageCommenter 2 หลายเดือนก่อน

    Neetcode: *walks into a room*
    Inefficient algorithm: Why do I hear boss music?

  • @spookimiiki5891
    @spookimiiki5891 4 หลายเดือนก่อน +3

    the name is primagoon

  • @lah30303
    @lah30303 4 หลายเดือนก่อน +2

    21:10 what happens in the occurrence where there are 3 duplicates?

    • @GoKotlinJava
      @GoKotlinJava 4 หลายเดือนก่อน +3

      any extra duplicates will result in less than 14 bits set.
      Two duplicates will result in 12 bits set. (12 unique + 2 duplicates cancelling each other)
      Three duplicates will result in 12 bits set. (11 unique + 3 duplicates giving us 1 unique and 2 cancelled)
      You want exactly 14 bits in final result

  • @1n4f4bl3
    @1n4f4bl3 4 หลายเดือนก่อน +2

    "This little amoeba youtuber" -> Next Video, "This electron youtuber"

  • @thekwoka4707
    @thekwoka4707 3 หลายเดือนก่อน

    Feels like the result he has wasn't even optimized. Why do windows when you can just do left and right markers?

  • @zS39SBT4fe5Zp8Q
    @zS39SBT4fe5Zp8Q 4 หลายเดือนก่อน

    Now do it in CUDA and actually get O(1) by testing all positions in one cycle.

    • @aaaabbbbbcccccc
      @aaaabbbbbcccccc 4 หลายเดือนก่อน

      And when the number of positions exceeds your cores, what then? It isn't O(1), it's still O(n)

  • @xenopholis47
    @xenopholis47 3 หลายเดือนก่อน

    Can anybody explain why we use both binary/ 2 bit and also we use 8bit? I am noob. I am asking about the fundamental thing. Any long form answer will be appreciated.
    Edit: As I went further into the video, I realized this they are talking about DSA of which I have no idea. But still any easy explanations are welcome. Cheers

  • @dhruvsolanki4473
    @dhruvsolanki4473 3 หลายเดือนก่อน

    Amazing 🎉❤

  • @af5772
    @af5772 4 หลายเดือนก่อน

    hey, had to stop watching the video halfway through so maybe I missed something, but Im mainly referring to the beginning section of the video where you explain different ways to tackle the problem. When you mention dynamic array are you talking about some sort of higher level data structure that I am not familiar with? Just asking because when I hear dynamic array im thinking of a heap allocated array in C which you manipulate fully on your own with malloc and such. I mallocing some size any slower than just going for a static (stack) array. just to be clear arr[4] vs arr* = malloc(...

    • @NeetCodeIO
      @NeetCodeIO  4 หลายเดือนก่อน +3

      dynamic array is basically a vector in cpp or a arraylist in java. js and python only use dynamic arrays. i guess another word for it would be a 'resizable array'

  • @sainayan
    @sainayan 4 หลายเดือนก่อน +2

    Yeah, I just started learning DSA, and I see this. Wow, I'm cooked.

  • @ProfessorThock
    @ProfessorThock 4 หลายเดือนก่อน +1

    Amazing video

  • @AnimeLover-su7jh
    @AnimeLover-su7jh หลายเดือนก่อน

    You got a lot of things regarding how memory works wrong. For one, stack and heap are both the same speed (Both are pages in ram, or Virtual memory if you prefer), the stack is usually warm in cache, because it is being used more frequently, there are techniques to actually keep the heap warm, you got things about why the CPU brings more memory also wrong, it is not an assumption, it has to do with hardware design (and also has the side affect of bringing more memory), you also got the significance of locality wrong, it can affect a lot, even for small data set. (Your computation can easily 10th of your memory access time).
    Anyway, best of luck, you will find ton of those info (if you want) in the manuals and optimization strategies

  • @kushagrasaxena5202
    @kushagrasaxena5202 4 หลายเดือนก่อน +6

    Primy-agen

  • @jonphelan707
    @jonphelan707 4 หลายเดือนก่อน

    In a code where the loops...spin and grow
    BigO whispers...how fast can we go
    With n squared in sight...We’ll optimize all right
    And watch as that CPU blazes with new mojo!

  • @lysendertrades
    @lysendertrades 4 หลายเดือนก่อน +1

    The name is the prime agen!

  • @bravesirrobin9576
    @bravesirrobin9576 4 หลายเดือนก่อน

    What are you using for the diagrams and screencasting tools? They look swish

  • @morton4
    @morton4 4 หลายเดือนก่อน

    guys how is the case where there are 3 or more of the same characters in the window? that bit will still be set to true, am i missing something?

    • @NeetCodeIO
      @NeetCodeIO  4 หลายเดือนก่อน

      In that case we won't have 14 bits set to true tho.

  • @Vortex-qb2se
    @Vortex-qb2se หลายเดือนก่อน +1

    At this point, just ask Gennady Korotkevich to optimize it further.

  • @JuliaC-mz8qy
    @JuliaC-mz8qy 4 หลายเดือนก่อน +4

    no offense at all neetcode, I love neetcode. But I had a ratatouille moment when he started going into a sliding window explanation. I think im traumatized from my last job search.

  • @Dom-zy1qy
    @Dom-zy1qy 4 หลายเดือนก่อน

    Bro this thumbnail is devious

  • @EverAfterBreak2
    @EverAfterBreak2 4 หลายเดือนก่อน

    21:50 what if there’s 3 repeated letters? The bit would be 1 again

    • @ericcoyotl
      @ericcoyotl 4 หลายเดือนก่อน +3

      Correct, but he stated that the only thing that matters is if there are 14 distinct characters (14 ‘1’s)
      If there’s 3 repeated letters, that bit would be 1. But there wouldn’t be enough other ‘1’ bits to total to 14

    • @EverAfterBreak2
      @EverAfterBreak2 4 หลายเดือนก่อน

      @@ericcoyotl you’re right, thanks for the explanation

  • @jgtb0pl
    @jgtb0pl หลายเดือนก่อน

    18:08 Yes! This was intentional

  • @mahmoudmousa2405
    @mahmoudmousa2405 4 หลายเดือนก่อน

    27:00 I agree the intuition made so much more sense to me

  • @betadelphini4036
    @betadelphini4036 4 หลายเดือนก่อน

    but what would be a vector in Python? A List?

    • @NeetCodeIO
      @NeetCodeIO  4 หลายเดือนก่อน

      yeah, python only has dynamic arrays, not static ones i believe

  • @samarnagar9699
    @samarnagar9699 4 หลายเดือนก่อน

    i saw what you did there in start what so should we calll you dr.n now

  • @zangdaarrmortpartout
    @zangdaarrmortpartout 3 หลายเดือนก่อน

    Where can we find this problem text ?

  • @janisir4529
    @janisir4529 2 หลายเดือนก่อน +1

    The best way to make an algorithm faster is making the initial solution extremely bad.

  • @drunkenmaster389
    @drunkenmaster389 4 หลายเดือนก่อน

    "...by this little youtuber called the pree-mee-a-jen" 🤣

  • @Akhulud
    @Akhulud 4 หลายเดือนก่อน

    its more like "making a faster algo" that "making an algo faster"

  • @BennyDeeDev
    @BennyDeeDev 4 หลายเดือนก่อน +1

    You lost me at bitmask, even though I am also named Benny 😢