Knuth Morris Pratt (KMP) String Search Algorithm - tutorial with failure function in Java

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ส.ค. 2024
  • This tutorial explains how the Knuth-Morris-Pratt (KMP) pattern matching algorithm works. Animated examples are used to quickly visualize the basic concept. Then the source code of an implementation written in Java is discussed, along with its running time complexity.
    There are two parts to the algorithm. The first one creates a "failure function", which is nothing more than an array that holds the longest prefixes as they occur through out the pattern string. The main algorithm uses this prefix length array to calculate how much forward to shift the pattern string.
    Source code written in Java:
    bitbucket.org/...
    The implementation is based on the pseudo-code taken from Wikipedia:
    en.wikipedia.o...
    FAST PATTERN MATCHING IN STRINGS
    DONALD E. KNUTH, JAMES H. MORRIS, JR. AND VAUGHAN R. PRATT
    pdfs.semantics...
    Written and narrated by Andre Violentyev
    -----------------------------------------------------------------------------------------------------------------------------------------------
    Errata (credit to @Achyut Sapkota):
    My animation at 6:04 should have moved not by 2 but by 3 characters. By the way, at that point we'd also increment the pointer in the text to the next character. The source code (see above) is correct - once we get to index zero of the prefixLen array, which stores -1, we increment both the text and the pattern string pointers:
    p = prefixLen[p];
    if (p 'less than' 0) {
    t++;
    p++;
    }
    In the above example once we shift by 3, we essentially go "out of bounds" for the pattern string. The illustrations show that the first character, 'a', is at index 1 of the prefixLen array (not index 0, which is probably what is causing confusion since most programming languages use 0 based indexes for strings). So the formula works, it's just that, once we detect the "out of bounds" condition, we reset the pattern string pointer to the first character, and increment the text pointer by 1. In other words, we shift the pattern string so that its first character aligns with the next character in the text. This is what insures that each character in the text is examined at most twice.

ความคิดเห็น • 117

  • @maridavies3425
    @maridavies3425 4 ปีที่แล้ว +55

    This is the best explanation of how the prefix array is constructed. Awesome job and thanks!

    • @stablesort
      @stablesort  4 ปีที่แล้ว +1

      Glad it was helpful!

  • @sloppyguy4940
    @sloppyguy4940 3 ปีที่แล้ว +18

    Even a sloppy guy like me would understand this explanation, well done!
    I also like how you comunicate :)

    • @stablesort
      @stablesort  3 ปีที่แล้ว +4

      Thanks for the compliment! Cheers, Sloppy Guy!

  • @mandeepsinghrr
    @mandeepsinghrr 4 ปีที่แล้ว +9

    This channel honestly has the best explanation for all the difficult problems. The nlogn LIS implementation was too complex for me to understand, but then I stumbled upon this channel. The same goes for KMP! Amazing work! Really appreciate how you explain it in a simple manner!

    • @stablesort
      @stablesort  4 ปีที่แล้ว +1

      Thank you for such a wonderful compliment! I am glad to hear that the videos are helpful!

  • @abdalla4425
    @abdalla4425 4 หลายเดือนก่อน

    Wow this is the a way better explanation than the GFG article. Thanks

  • @prasannakumarnaidu5030
    @prasannakumarnaidu5030 3 ปีที่แล้ว +7

    Thank you sir
    After lot of confusion in understanding this algorithm you made it easy to understand . Hope you make many videos like this.

    • @stablesort
      @stablesort  3 ปีที่แล้ว +2

      I am glad it was helpful! But this is just a hobby for me. I don't make any money from it :) Cheers!

  • @xiaoxiaoxiao2043
    @xiaoxiaoxiao2043 2 ปีที่แล้ว +1

    The best explanation of the MIT version of KMP. Good job!!!
    I learned the KMP algorithm through the 《Algorithm》which is public by Princeton University. The MIT version of KMP and Princeton's version is the two main version of the KMP algorithm.

  • @junkaze08
    @junkaze08 3 ปีที่แล้ว +2

    This is far the best explanation and discussion of the KMP, thank you for saving my semester.

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Great! I am glad to hear that it was helpful

  • @ram-oj2ij
    @ram-oj2ij 3 ปีที่แล้ว +2

    I watched other videos and it finally clicked with this video, thanks. All your other videos are produced very well and great.

    • @stablesort
      @stablesort  3 ปีที่แล้ว +1

      Glad to hear it!

  • @irfanquader3944
    @irfanquader3944 3 ปีที่แล้ว +2

    your explanation method (with the visualization ) and calm way of explaining really helped me understand the algorithm :)

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Glad to hear it and thanks for the compliment!

  • @Japo0po0
    @Japo0po0 3 หลายเดือนก่อน

    Your videos are the best for learning these algorithms. Much appreciated.

  • @abelsimon5308
    @abelsimon5308 4 ปีที่แล้ว +3

    thanks a lot for going through these classics, also your code looks simple and concise!

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Glad you like them! I try to bring something new to the table or explain it in a way that is "less confusing" than typical explanations.

  • @danielsasse7609
    @danielsasse7609 3 ปีที่แล้ว +2

    This was incredibly helpful for understanding the concepts behind the algorithm, Thank you!

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Wonderful! And thanks for the leaving a few good words

  • @lukelyu3264
    @lukelyu3264 4 ปีที่แล้ว +1

    The most efficient one out of those confusing videos

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Just trying to make things a little less confusing =)

  • @juniperspice3762
    @juniperspice3762 2 ปีที่แล้ว

    i am so glad i stumbled onto this channel

  • @AleemKhan42069
    @AleemKhan42069 2 ปีที่แล้ว

    You sir have saved my career thanks

  • @AluSoft
    @AluSoft ปีที่แล้ว

    Hi Stable sort!
    Please keep posting !!! You have the best videos I've ever seen!!!

  • @mohamedmusharaf406
    @mohamedmusharaf406 4 ปีที่แล้ว +2

    Wow Sir. The explanation is really good!! Glad I've found your channel!!

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Glad to hear that and thanks for the compliment!

  • @vinayak186f3
    @vinayak186f3 2 ปีที่แล้ว

    Just 1:44 into the video and got the ans I was long been pondering over , Thank you sir

  • @Prashantkumar-pn6qq
    @Prashantkumar-pn6qq 4 ปีที่แล้ว +1

    Very good and illustrative explanation!

  • @amitkumarprajapati5094
    @amitkumarprajapati5094 3 ปีที่แล้ว +1

    Really appreciated your way of explanation sir

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Thanks! Glad to hear that!

  • @troglodyto
    @troglodyto 4 ปีที่แล้ว +1

    very clear. thank you for your time and energy!

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Glad it was helpful!

  • @maulindusarkar4581
    @maulindusarkar4581 4 ปีที่แล้ว +1

    It's awesome man ! Thanks for this beautiful explanation !

  • @nickchen8259
    @nickchen8259 2 ปีที่แล้ว +1

    Awesome job

  • @frankwu1043
    @frankwu1043 4 ปีที่แล้ว +1

    Excellent work! Please keep it up.

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Thanks for the compliment! Let me know if there are other topics that you'd like to see covered.

    • @frankwu1043
      @frankwu1043 4 ปีที่แล้ว

      @@stablesort Hi Sir, could you cover applications of finding shortest path? I learned Dijkstra's Algorithm, Prim, but I don't know how to apply them into a weighted graph with Java. I'm kind of missed up with content and Java code. Thanks!

  • @Krishnayadav-ln8ru
    @Krishnayadav-ln8ru 3 ปีที่แล้ว +2

    perfect explanation

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Glad it was helpful!

  • @create_space812
    @create_space812 3 ปีที่แล้ว +1

    Best explanation!

  • @ryancodrai487
    @ryancodrai487 3 ปีที่แล้ว +1

    Fantastic video

    • @stablesort
      @stablesort  3 ปีที่แล้ว +1

      Glad you liked it!

  • @pavanvenkat9182
    @pavanvenkat9182 4 ปีที่แล้ว +2

    Amazing explanation 👌👌

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Thanks for the compliment!

  • @bigsnorts4320
    @bigsnorts4320 2 ปีที่แล้ว

    fantastic presentation. Thank you so much for this.

  • @zaidziad1456
    @zaidziad1456 3 ปีที่แล้ว +1

    Awesome!

  • @soroushhh1586
    @soroushhh1586 3 ปีที่แล้ว +1

    Thanks man 🤘🙏

  • @jacobkurien
    @jacobkurien 3 ปีที่แล้ว +1

    awesome as always. can you do a video on catalan numbers

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      That's a great suggestion - thanks! Catalan numbers keep popping up in leetcode questions as well, so yeah, definitely worth covering!

  • @mfrice
    @mfrice 2 ปีที่แล้ว +12

    Hi, I think there is another error at 5:23 in the animation. The shift should be (8-2)=6 instead of (9-3)=6. Although they happen to have the same results, the last 'c' in the pattern is a mismatch so it should not be counted in the matched length. Consider a modified example, the string is "abdxxxabdxxxabcde" and the pattern is "abdxxxabcde". The fist mismatch is at the 'c' in the pattern, and the lps of 'c' is 0. The shift should be the matched length 8 minus the lps of the second 'b' which is 2, which results in 6, rather than (9 - 0) = 9. Do you feel it make sense?

    • @Kuppu-pk
      @Kuppu-pk 2 ปีที่แล้ว +3

      it perfectly makes sense to me. Similarly, i see at 6:03 we see that the video shows as 3-0 but he does shift only by 2 which doesn't make sense. As per your clarification, the prefix length is 2 and value of last matched b is 0, so, 2-0 = 2 which is the two shift that the presenter makes.

    • @vmani369
      @vmani369 2 ปีที่แล้ว

      Yes it seems so here

    • @jamesbridger9878
      @jamesbridger9878 ปีที่แล้ว

      Thanks for the clarification. I actually got myself confused when looking through my teachers slides and trying to apply the video's formula.

  • @harshith_takkala
    @harshith_takkala 2 ปีที่แล้ว +1

    thanks man!

  • @manojkvn2632
    @manojkvn2632 4 ปีที่แล้ว +1

    great explanation.

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Thanks for the compliment!

  • @karanbhatia6712
    @karanbhatia6712 3 ปีที่แล้ว +1

    Great Video! Thanks :)

  • @tsvetomirgalabov110
    @tsvetomirgalabov110 3 ปีที่แล้ว +1

    awesome JOB thank you

  • @RaceMusgraveASU
    @RaceMusgraveASU 10 หลายเดือนก่อน +2

    Hi, unfortunately this doesn't work for all cases. However if we modify the formula used for shifting very slightly, it will work. shift = (len-1) - prefixLen[len - 1]
    I know many people wont believe me, but I agonized over this for hours.
    lets look at an example
    text: ababcabcabababd
    pattern: ababd
    lps for our new pattern would be: 0 0 1 2 0
    now lets play out formula as laid by the video
    ababcabcabababd
    ababd
    x --> mismatch at 5, 5 - 0 = shift 5
    ababcabcabababd
    ababd
    x --> mismatch at 1, 1 - 0 = shift 1
    ababcabcabababd
    ababd
    x --> mismatch at 3, 3 - 1 = shift 2
    ababcabcabababd
    ababd
    x --> mismatch at 1, 1 - 0 = shift 1
    ababcabcabababd
    ababd
    x --> mismatch at 5, 5 - 0 = shift 5
    ababcabcabababd
    ababd
    as we can see, the matching alignment was entirely overlooked when using the formula as given by the video. but if we modify it as described before, not only do we find the match for the example I just gave, we can find it for any other example as well.
    to modify the formula, either modify it according to what I explained previously, or use the formula he gave but use the last matching letter instead of the first mismatched letter

    • @bulazs01
      @bulazs01 8 หลายเดือนก่อน

      My guy, im learning this rn in algorithm class so i checked out this video. i looked at it and was thinking what would have happened if the y was the mismatch, and started doubting my knowledge, so thanks for reassuring me and giving me an actual working formula. Thanks mate :D

  • @godofkings4366
    @godofkings4366 ปีที่แล้ว

    thank you very much

  • @misoren9645
    @misoren9645 3 ปีที่แล้ว +3

    Thank you for the video, nice explanation.
    Can you please explain "else if(prefixLen > 0) prefixLen = ar[prefixLen]" ?
    I don't see where in the simulation something like that was used, I saw either reset to 0 or match current characters with the prefix. prefixLen++; ar[i] = prefixLen.

    • @stablesort
      @stablesort  3 ปีที่แล้ว +2

      Good question. So that happens when we have a mismatch and the length of the prefix is something other than zero. Consider the output of calcPrefixLen("xxyxxxxx"):
      [-1, 0, 1, 0, 1, 2, 2, 2, 2]
      Note how the last three 'x' characters don't increase the length of the prefix, but they keep holding the length of the prefix found so far. I hope this helps!

  • @TheJakobPolegek
    @TheJakobPolegek 3 ปีที่แล้ว

    Thanks blood.

  • @utkarshnaik1415
    @utkarshnaik1415 3 ปีที่แล้ว +1

    If the mismatch is at 'y' character then the shift will be by size(pattern string) as we have 0 in the pattern length array , this seems wrong and the shift should be by size(pattern string - 3)

    • @stablesort
      @stablesort  3 ปีที่แล้ว +1

      Thanks for your question. The character 'y' in our example pattern string is found only at the right end of the pattern string. Meaning, there are no other 'y' characters on the left side. If we matched the previous 9 characters, we know for sure that none of those 9 characters are 'y'. Therefore, no need to subtract 3. You could just fast forward by 9. I hope this helps but let me know if that's not clear.

    • @utkarshnaik1415
      @utkarshnaik1415 3 ปีที่แล้ว +1

      @@stablesort how would it work out for following search string for the pattern explained in the video : abcxxxabcxxxabcy
      I believe using pattern length as 0 for y character would cause us to miss the match .

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      @@utkarshnaik1415 As they say "code resolves arguments" - I tried it just now, passing that as input (the code is linked in the description but here it is again: bitbucket.org/StableSort/play/src/master/src/com/stablesort/stringmatch/KnuthMorrisPratt.java ) and the algorithm did find the match at index 6. Try it yourself if you don't believe me =)

    • @FeuersurferLP
      @FeuersurferLP 3 ปีที่แล้ว

      @@stablesort But we aren't looking for another "y". The matched prefix when checking if the text is "y" would be "abc", so we have to check if the symbol that isn't "y" is "x", since then we would have "abcx", which is the beginning of our pattern string and therefore shouldnt be skipped.

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      @@FeuersurferLP Thanks for your comment but I am not sure if I fully understand your question. Could you possibly rephrase it on its own? Oh wait, you may have done just that in another comment. I am going to read over that first =)

  • @ggorkk
    @ggorkk 10 หลายเดือนก่อน

    Hi sir, in the example when the second c is mismatched the first time, we already know that the pattern "abc" is not possible to be in the text at index 9 of the text. Shouldn't this information be used to shift the p variable all the way instead of only 9-3? Does KMP algorithm consider this information as well?

  • @decentralmultiverse6719
    @decentralmultiverse6719 4 ปีที่แล้ว +1

    Good explanation, Could you make a video on lazy propagation in segment tree, Thanks

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Thanks for the suggestion. I'll look into lazy propagation. It'd be a good follow up to my earlier video on segment trees (in case you have not seen it: th-cam.com/video/xztU7lmDLv8/w-d-xo.html )

  • @thirum5940
    @thirum5940 4 ปีที่แล้ว +1

    @Andre Violentyev - Can you please elaborate on "How after a mismatch each character is examined atmost twice and how the time complexity is linear to the length of the text"?

    • @stablesort
      @stablesort  4 ปีที่แล้ว +2

      In the typical bruit force algorithm, whenever a mismatch occurs between the text and the pattern string, the pattern string is shirted by just 1 position to the right. This leads to O(n * m) running time, n and m being the lengths of the text and pattern string. But in KMP algorithm, the pattern string is shifted by more than one position. In fact, it is shifted so much, that each character in the text is examined at most twice. Meaning, some text character text[i] is checked to see if it matches with some pattern character pattern[j]. If no match, the pattern is shifted, then text[i] is examined one more time to see if it matches one other possible pattern character pattern[k]. If there is no match again, then the pattern string is shifted past where text[i] is. So at most there were two examinations done per text character. Therefore, the time complexity ends up being O(2n + m), which reduces to O(n + m), linear time complexity.
      I hope this helps but let me know if you have other questions.

    • @thirum5940
      @thirum5940 4 ปีที่แล้ว

      @@stablesort Even after spending so much time, I was struggling to understand this part. Now it is clear to me. I can't thank you enough. I hope you keep making this kind of videos and hopefully one day on aho-corasick algorithm :)

    • @stablesort
      @stablesort  4 ปีที่แล้ว +1

      @@thirum5940 Glad to hear it. I am adding Aho-Corasick algorithm to my to-do list - thanks for the suggestion 🙂

  • @fenil197
    @fenil197 3 ปีที่แล้ว +1

    I liked and understood very well .I currently researching this KMP Algo .Topic , can please guide me. what is next possible solution this algo.?

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Hi Fenil, I am not sure if I understand your question. What do you mean by "next possible solution"? In case you are looking for an example implementation, here is a link to the source code, written in java: bitbucket.org/StableSort/play/src/master/src/com/stablesort/stringmatch/KnuthMorrisPratt.java

  • @graju2000
    @graju2000 4 ปีที่แล้ว +1

    Please don't stop your service :)

    • @stablesort
      @stablesort  4 ปีที่แล้ว +1

      No worries! I am working on the next episode as we speak =)

  • @20aatif
    @20aatif 4 ปีที่แล้ว +1

    Please come up with lazy propogation and Fenwick tree

    • @stablesort
      @stablesort  4 ปีที่แล้ว

      Well, I do have a Fenwick Tree tutorial: th-cam.com/video/uSFzHCZ4E-8/w-d-xo.html
      But to keep its length to a under 10 minutes, I did not cover lazy propagation. Thanks for the suggestion, I should probably make a followup video and tackle lazy propagation. Cheers!

  • @narayangupta2623
    @narayangupta2623 3 ปีที่แล้ว

    At 6:05 your formula tells to shift by 3 but you shifted by 2(Which is correct). Please clearify the formula.

    • @stablesort
      @stablesort  3 ปีที่แล้ว +1

      Thanks for posting your question. Some time ago, another viewer, Achyut Sapkota, pointed out this same exact mistake. Let me copy-n-paste my response to:
      ... You are right, my animation at 6:04 should have moved not by 2 but by try 3 characters. My mistake! By the way, at that point we'd also increment the pointer in the text to the next character. I double checked the code ( bitbucket.org/StableSort/play/src/master/src/com/stablesort/stringmatch/KnuthMorrisPratt.java ) and there it looks to be correct - once we get to index zero of the prefixLen array, which stores -1, we increment both the text and the pattern string pointers.
      p = prefixLen[p];
      if (p < 0) {
      t++;
      p++;
      }
      In the above example once we shift by 3, we essentially go "out of bounds" for the pattern string. The illustrations show that the first character, 'a', is at index 1 of the prefixLen array (not index 0, which is probably what is causing confusion since most programming languages use 0 based indexes for strings). So the formula works, it's just that, once we detect the "out of bounds" condition, we reset the pattern string pointer to the first character, and increment the text pointer by 1. In other words, we shift the pattern string so that its first character aligns with the next character in the text. This is what insures that each character in the text is examined at most twice.
      I hope this clarified things a little as opposed to making it even more confusing =) But either way, please let me know.
      Thanks!

  • @FeuersurferLP
    @FeuersurferLP 3 ปีที่แล้ว +1

    Hi! First of all: Great video! Really like your style of explaining.
    But there's one thing I don't quite get.. At 5:28 the prefix that already matched is "ab". Since "c" didn't match the text at pos 11, we look if the character at pattern-position 3 (coincidentally also c) matches the character at text-position 11, right?
    So.. let's say "c" matched but the text at pos 12 didn't match "y" (so text i.E. ---abcxxxabc----). According to the table and your explanation we move back all the way to the beginning. But shouldnt we check if the text at position 12 matches the pattern at position 4 (x), since the prefix that matched is "abc"?
    Actually, I debugged your code and it does exactly what I described above. I chose the input string "---abcxxxabc---" and after failing to match "y" the variable p changed to 3, not to 0. This happens because in the code you only increment p when you got a match, so at the point of comparing "y" with the text the prefix length is still 3. In your animation on the other hand you increment p before looking wheter or not you have a match, which would lead to a prefix length of 0.
    So actually, the prefix when checking wheter text at pos 11 is "c" is not 3, as you showed in your video (5:19), but instead it is 2. (also in your code, since there you didn't move yet and the matched prefix is "ab"!). This also checks out with your formula for the shift. You matched 8 not 9 so far, so "shift by" is = 8 - 2 = 6!
    I would love to hear your opinion on my points. Your code seems to work like i described it, but maby you weren't explaining 1:1 how the code works and i missed a point there.. I am by far an expert on the topic 😅
    Anyways, greetings from germany, keep up the great work!

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Hi Josh, I had a similar conversation with another viewer, @Achyut Sapkota, some time ago. We concluded that the code works, the formula works, the explanation also works but there is a error in the animation at 6:04. The animation at that moment shows that the shift is made by only two characters, even though I actually say out loud that the shift should be done by 3. Sorry about that - I didn't catch that before posting the video. But I did document it in the description of the video after the fact. Please take a look there as I go into detail explaining the actual implementation. Hope this helps!

    • @FeuersurferLP
      @FeuersurferLP 3 ปีที่แล้ว +1

      ​@@stablesort Hi, I saw the conversation you were having with @Achyut Sapkota and that isn't the point I am having an Issue with. My Issue is that according to your explanation, when matching the text "---abcxxxabc----" with "y" of the pattern (and failing) your table would tell you to shift by the whole 10, which is incorrect and not what your code does. Your code checks if after the previously matched prefix "abc" the current letter of the input is a "x", which is the correct behaviour (because "abcx" is a valid prefix). Hope this clarifies what my problem is, for a more detailed explanation please see my comment above.

    • @FeuersurferLP
      @FeuersurferLP 3 ปีที่แล้ว +1

      @@stablesort Please note that I really do enjoy your videos and this isn't any kind of personal attack =)
      But in my opinion your animation is incorrect (and i'm not referring to the 2/3 shift) and doesn't work like your code which really gave me a hard time. So I really want to sort this out, so future viewers won't be having an issue with that part. I think I really made clear where my issue lies and i hope you give it a fair shot of seeing where I'm coming from!

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Hi Josh. No offense taken at all. If anything, I appreciate that you spent time and effort on sorting this out =)
      I think what I was trying to do was to first explain the algorithm in a way that makes the most intuitive sense - shifting the pattern string visually. For that, I had to have a formula that calculates the shift amount. The code, of course, doesn't shift anything - it just adjusts the index to the pattern string, p. As you pointed out, in the animation p looks to be incremented too soon (your example with of what would happen with ---abcxxxabc---- text is spot on). That's the side effect of "shifting" the pattern string. I think if I were to redo the video, I'd explain w/o focusing so much on the shifting.

  • @anonymoussloth6687
    @anonymoussloth6687 3 ปีที่แล้ว

    I don't understand one thing. suppose we have a pattern like this:
    txt: a a a b a (and so on...)
    pat: a a a b b
    lps: 0 1 2 0 0
    In this case, where there is a mismatch at the last a and b, I understand why pat is started at 0. because there is no prefix that is also a suffix at that point in the pattern.
    But why is the txt pointer not shifted back?? How can we be sure that if the pattern is just shifted one place to the right (like in the naive/brute force method) that we won't find a match?
    meaning:
    a a a b a ...
    a a a b b
    I know in this case it is not a match, but how can we be sure that this is the case always?
    My question basically is, in this algo, we have a i pointer that is iterating through the txt and a j pointer that is iterating through the pat. when there is a mismatch, the j pattern is shifted back a certain amount (which i understand why) but the i pointer is not shifted at all. THIS is what I don't understand. j pointer is shifted based on if there is a prefix that is also a suffix and that makes sense. But i dont understand why the i pointer is not shifted REGARDLESS of whether a prefix is found or not.

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Hi Ayush, thanks for posting your question. In your example, there are 4 characters that matched and then there is a mismatch on the 5th character. Now, we know something about the first 4 characters - the fact that they matched contains extra information that Knuth-Morris-Pratt make use of. The fact that the first 'b' character matched, implies that there is no other 'b' characters in the first 3 character sequence. Therefore, we don't need to shift back the text pointer (the i pointer in your example). This is a little tricky but I hope the animation helps visualizing it.

    • @anonymoussloth6687
      @anonymoussloth6687 3 ปีที่แล้ว

      @@stablesort I am not sure what you mean by the first b matching implies there is no other b in the first 3. How can we know that? The algorithm doesn't know if it's the first b or any amount of b.
      If the txt was: a b c b... And the pattern was a b c x then when there is a mismatch between x and b, we don't know if there was a b before. All we know is if there is a prefix that is also a suffix in the pattern before x. And in this case, there isn't because ab is not the same as cx. Basically, I don't understand why the KMP algo doesn't move the I pointer back. If there is a prefix that is also a suffix, that only means that the j pointer should be moved back by that amount. But why is the I pointer not going to the next character in the text like in the naive algo?

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      @@anonymoussloth6687 You are asking the right questions. Let's look at your first example:
      txt: a a a b a (and so on...)
      pat: a a a b b
      lps: 0 1 2 0 0
      I'll try to elaborate a little - the 1st 'b' matched but we have a mismatch on the 2nd 'b'. The pattern string is then slid to the right by 4 and its pointer is reset to 0. This is because the pattern string does not start with 'b' (lps[] for that character is 0). We did match on first 4 characters. The sequence of those characters matters. The brute force algorithm is blind to this idea.
      Let's have a thought experiment: suppose we have a very long text, 1M totally random characters. Then we also have a long pattern string, suppose 1K long. It just so happens that the first 999 characters matched but there is a mismatch on the 1000th character. Now, do you agree that there is no point of sliding the pattern string over to the right by just 1 character (as is done by the brute force algorithm)? There is just no chance that by sliding the pattern by 1 character we would happen to have a full match. Intuitively I hope you can see that we'd probably have to slide the pattern string over by some number close to 999. May be it'd be 995, if the last 4 characters of the pattern string happen to be also the first 4 - the whole prefix idea =). What do you think? Do you buy this argument?

    • @anonymoussloth6687
      @anonymoussloth6687 3 ปีที่แล้ว +1

      @@stablesort I see. Now it's clear. Thank you! And the video was awesome!

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      @@anonymoussloth6687 Cheers!

  • @MatkoFaka
    @MatkoFaka 3 ปีที่แล้ว

    Hi, question for everybody.
    Some examples/code of KMP on the internet use DFA which is basically a table/2Darray. Are those examples wrong or is it just another variant of KMP?

    • @geraldhuang7858
      @geraldhuang7858 ปีที่แล้ว

      The DFA variation is completely analogous to the construction of the prefix array. You encode the DFA by looking at all of the possible prefixes of your pattern and run the string on the automaton. The encoding of the DFA is exactly the same as the prefix table. Or maybe not exactly the same, but an equivalent variant.

  • @asitisj
    @asitisj 3 ปีที่แล้ว

    Noob question : why did you prefer to write a subroutine to function, to create an array?

    • @stablesort
      @stablesort  3 ปีที่แล้ว +1

      Just to keep the size of each function in check - it's easier to read, IMHO.

  • @achyutsapkota1522
    @achyutsapkota1522 3 ปีที่แล้ว

    I am afraid that, in explanation of algorithm, Shifting by 9-3=6 and 3-0=3 doesn' have consistency. In later case, it seems only two characters are shifted. I apologize in advance, if I have misunderstood

    • @stablesort
      @stablesort  3 ปีที่แล้ว

      Thanks for bringing up this question. You are right, my animation at 6:04 should have moved not by 2 but by try 3 characters. My mistake! By the way, at that point we'd also increment the pointer in the text to the next character. I double checked the code ( bitbucket.org/StableSort/play/src/master/src/com/stablesort/stringmatch/KnuthMorrisPratt.java ) and there it looks to be correct - once we get to index zero of the prefixLen array, which stores -1, we increment both the text and the pattern string pointers.
      p = prefixLen[p];
      if (p < 0) {
      t++;
      p++;
      }
      Thanks for keeping me honest!

    • @achyutsapkota1522
      @achyutsapkota1522 3 ปีที่แล้ว

      @@stablesort Thank you very much for your kind response.I was trying to find a suitable way to explain the algorithm and I thought the shifting formula you mentioned was an easier way to explain KMP algorithm. From an explanation perspective of the algorithm, regardless of the program, I still couldn't understand how shifting by 3 characters in the mentioned section makes the correct representation of the algorithm. I think the formula of shifting ie. "shift by=len-prefixlen[len] " gives incorrect shifting length in the case of "0" in prefixlen[len] .

    • @stablesort
      @stablesort  3 ปีที่แล้ว +1

      @@achyutsapkota1522 In the above example once we shift by 3, we essentially go "out of bounds" for the pattern string. The illustrations show that the first character, 'a', is at index 1 of the prefixLen array (not index 0, which is probably what is causing confusion since most programming languages use 0 based indexes for strings). So the formula works, it's just that, once we detect the "out of bounds" condition, we reset the pattern string pointer to the first character, and increment the text pointer by 1. In other words, we shift the pattern string so that its first character aligns with the next character in the text. This is what insures that each character in the text is examined at most twice.
      I hope this clarified things a little as opposed to making it even more confusing =) But either way, please let me know.
      Thanks!

    • @achyutsapkota1522
      @achyutsapkota1522 3 ปีที่แล้ว +1

      @@stablesort Thank you very much for your kind explanation. I got it perfectly now. I actually had misunderstood the formula. I had interpreted "shifting by 6" in a different way. Coincidently, that interpretation, though wrong, produced the same shifting scenario. Apologies for taking your time.

    • @stablesort
      @stablesort  3 ปีที่แล้ว +1

      @@achyutsapkota1522 thanks again for posting your question. Hopefully this discussion will help someone else down the road.