RegEx Roman Numerals - Computerphile

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 ก.พ. 2020
  • Working with regular expressions to decode Roman Numerals. Professor Brailsford is on the case.
    EXTRA BITS - Roman Numerals Reminder: • EXTRA BITS - Roman Num...
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

ความคิดเห็น • 269

  • @profdaveb6384
    @profdaveb6384 4 ปีที่แล้ว +151

    Hi everyone,
    The RE that I show under test, with (g)awk , in this video, at approx 10m45s , is indeed /^(I[XV]|V?I{0,3})$/ But as one or two of you have pointed out, when transferring that RE to line-printer paper at an earlier stage, I accidentally write {0-3} rather than {0,3} for the counted repetition . Note carefully that the version with the comma is the correct one.

    • @ajreukgjdi94
      @ajreukgjdi94 4 ปีที่แล้ว +2

      Would there be an issue with covering the empty match by saying I[XV]|VI{0,3}|I{1,3} surely that should fix it?

    • @speedstyle.
      @speedstyle. 4 ปีที่แล้ว +2

      The final regex would not match, for example, XM = 990. Was this intentional, or just a limitation of the StackOverflow you lifted from? 😂
      I would've done
      (M{0,3})([IVXLC][DM]|D?C{0,3})([IVX][LC]|L?X{0,3})(I[VX]|V?I{0,3})
      This matches every possible (single-letter) subtractive sequence using those [IVXLC][DM] [IVX][LC] I[VX] groups. Additionally, the parentheses around the thousands at the start are unneeded, but in a program actually parsing (rather than just matching), they would help by splitting matches into thousands/hundreds/tens/units.

    • @nbrader
      @nbrader 4 ปีที่แล้ว +2

      ​@@ajreukgjdi94 This does indeed prevent the empty string from matching and yet still cover the restricted range, but the important thing is that it doesn't easily generalize to larger number ranges. As soon as you try to combine this with the tens part of a roman numeral you have trouble where you want either the tens or the units to be able to match the empty string but you don't want both to do so at the same time.
      You CAN generalize, but it gets big fast. You can make the correction to the units part as you did in a combination of tens and units parts on one side of a '|' and make the same sort of correction to the tens part in another copy of the combination of tens and units parts on the other side. I hope you can see how this is going to get much larger as you add the hundreds, thousands etc.
      IMO, it's better (if possible) to leave the regex simpler, allowing it to match the empty string. I'm thinking in the case where the regex engine is being used as part of a system powerful enough to check and handle the case of an empty string match itself. Also, who says the empty string isn't a roman numeral?! We need something to represent zero! :)

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว +2

      @@nbrader If you want to allow an empty string as a valid Roman numeral, you can use this regex:
      ^M{0,3}(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$
      And if you don't, you can use this regex:
      ^(?=.)M{0,3}(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$
      That (?=.) at the beginning is a positive lookahead that ensures that the pattern matches at least one subsequent character without 'consuming' that character.

    • @profdaveb6384
      @profdaveb6384 4 ปีที่แล้ว +5

      Yes - your fix would work if the problem is confined to non-null strings in the range I to IX. If all else fails the {1,3} construct would deliver a non-null answer. But if this piece of code was used in a wider context you have to allow for similar constructs at the left delivering (say) M or C as a single-character Roman Numbers with no need for a "units" contribution at all . So the puzzle is that the sub-expressions for thousands, hundreds, tens and units must *all* be capable of delivering a null . But you want to impose an overriding condition that not all of them can *simultaneously* deliver a null ! (Nathan Brader makes similar points in his reply below.

  • @LupinoArts
    @LupinoArts 4 ปีที่แล้ว +160

    This "yeah right" on 3:43 reminded me of this joke:
    Linguistics professor: "There are languages where double negative is negative, languages where double negative is positive, and languages where double positive is positive. However, there is no language where double positive is negative."
    Student: "Yeah. Right...."

    • @333peacher4
      @333peacher4 4 ปีที่แล้ว +7

      ok, fine.

    • @KazzArie
      @KazzArie 4 ปีที่แล้ว +3

      Quite the cunning linguist, that professor.

    • @Trisador9
      @Trisador9 4 ปีที่แล้ว +5

      sure, sure.

    • @bananya6020
      @bananya6020 3 ปีที่แล้ว

      oh, i see...

    • @IlyaMalyarenko
      @IlyaMalyarenko 3 ปีที่แล้ว

      Ага, конечно

  • @israelRaizer
    @israelRaizer 4 ปีที่แล้ว +196

    Professor Brailsford is the David Attenborough of computer science

    • @traininggrounds9450
      @traininggrounds9450 4 ปีที่แล้ว +1

      @@Crow_16 "you must HAVE read my mind"

    • @israelRaizer
      @israelRaizer 4 ปีที่แล้ว +1

      @@Crow_16 damn it, I've been exposed

    • @mangiblotarinawabag4964
      @mangiblotarinawabag4964 3 ปีที่แล้ว

      Brailsford is a 100x more intelligent and has a 1000x the charisma than that Attenborough dope.

  • @GuyMahoney
    @GuyMahoney 4 ปีที่แล้ว +25

    IVI would be a better test bed as all the characters are valid but it challenges the conditions alone.

  • @chris24hdez
    @chris24hdez 4 ปีที่แล้ว +106

    Omg tell me he’s doing Audible books

    • @RT-.
      @RT-. 4 ปีที่แล้ว +12

      ASMR

  • @cavalrycome
    @cavalrycome 4 ปีที่แล้ว +89

    8:00 The dash in {0-3} should be a comma: {0,3}

    • @PMA65537
      @PMA65537 4 ปีที่แล้ว +3

      someone beat me to saying this .... probably why Poliakoff of Chemistry got the knighthood and not this channel

    • @curtiswfranks
      @curtiswfranks 4 ปีที่แล้ว +6

      Eh, use whichever notation you want. It does not matter: they all are just symbols and we assign rules to them.

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว +8

      @@curtiswfranks Sure, if you're doing it on paper, but if you want to have a regex that is readable to the regex engines built into various programming languages, there are conventions you need to follow.

    • @curtiswfranks
      @curtiswfranks 4 ปีที่แล้ว +2

      @@cavalrycome True, but he was doing it on paper. The point was communicated just fine.
      I know that "+" meaning addition is just a matter of convention too, but it seems far less variable to me than regex stuff can be. There is nothing stopping the next language developer from using a different notation and getting away with it. Things like that happen all the time in programming languages - the symbology has not yet been well-established.

    • @idogtv
      @idogtv 4 ปีที่แล้ว +2

      He typed it correctly on the actual computer, though.

  • @CliseruGabriel
    @CliseruGabriel 4 ปีที่แล้ว +8

    Professor Yoda has one of the best explanations on this channel with a lot of humor, elegance and simplicity. Grateful to have him around.

    • @DutchmanDavid
      @DutchmanDavid 4 ปีที่แล้ว

      That's *professor* Yoda!

    • @CliseruGabriel
      @CliseruGabriel 4 ปีที่แล้ว

      @@DutchmanDavid correct, i stand corrected.

  • @Henrix1998
    @Henrix1998 4 ปีที่แล้ว +17

    Im disappointed he did not try IIV

  • @MRmagicTT
    @MRmagicTT 4 ปีที่แล้ว +1

    Definitely one of the best N̶u̶m̶b̶e̶r̶p̶h̶i̶l̶e̶ Computerphile videos!

  • @hugovangalen
    @hugovangalen 4 ปีที่แล้ว +75

    That's spelled "RegEx", not ReGex...

  • @Alkoluegenial
    @Alkoluegenial 4 ปีที่แล้ว +7

    "I should've known that one."
    "Yeah, right."

  • @ravewulf
    @ravewulf 4 ปีที่แล้ว +4

    I have "looked ahead" and spotted where the next video will go

  • @CatzHoek
    @CatzHoek 4 ปีที่แล้ว +52

    The ultimate test is: Type in something utterly idiotic like "Hello".
    That would be hilarious if taken out of context. GIF that and reply that to someone when you get a text that just says "Hello" or "Hi",

    • @pablorepetto2759
      @pablorepetto2759 4 ปีที่แล้ว +1

      @MichaelKingsfordGray That is _also_ not the ultimate test.

  • @MrMurc2
    @MrMurc2 4 ปีที่แล้ว +6

    I don't really think this is a flaw because 0 in Roman numerals was actually null, you could leave it blank or write nulla.

    • @IronicHavoc
      @IronicHavoc 2 ปีที่แล้ว +1

      The issue is that it would match non roman numerals because every string begins with/contains an empty string. Generally matching unanchored null strings will give you an absurd amount of meaningless results, since there can be null strings within any given string.

  • @macdjord
    @macdjord 4 ปีที่แล้ว +1

    Ha! Yes! I paused the video to watch the extra bit. Before resuming this one, I decided to try writing the regex myself. My solution is character-for-character identical to the given one, except that I put parens around 'M*' to that you could extract the thousands place with a capture group.

  • @JulianOnions
    @JulianOnions 4 ปีที่แล้ว

    Came out nicely!

  • @apenasmeucanal5984
    @apenasmeucanal5984 4 ปีที่แล้ว

    thanks for your content sean!

  • @jan_harald
    @jan_harald 4 ปีที่แล้ว +19

    I'd thing there's a bad practice in the line matching...
    ^aaa|bbb$ would match aaa at the start of the line, OR bbb at the end
    but with more complex cases, anything with an unlimited range matching, it will also match anything beginning with aaa or ending with bbb, where aaa or bbb is not the only thing on the line
    the or stops line matching
    ^aaa$|^bbb$ or ^(aaa|bbb)$ would be better

    • @cubingkid
      @cubingkid 4 ปีที่แล้ว +5

      jan harald That’s how he wrote it in the awk script, he just didn’t mention it when he was writing it out on paper

    • @MccZerk
      @MccZerk 4 ปีที่แล้ว +6

      He enclosed the entire Regex within the ^ and $ inside brackets in the script.

  • @Shyhalu
    @Shyhalu 4 ปีที่แล้ว +31

    Some day I wish to be as insane as the people who regularly build regex expressions.

    • @yearswriter
      @yearswriter 4 ปีที่แล้ว +6

      They are addictive enougth, I for once need to force me to stop

    • @thirdvect0r
      @thirdvect0r 4 ปีที่แล้ว +3

      @Alexander Radev I'm trying to parse [x]html with regex can anybody help?

    • @BaronSamedi1959
      @BaronSamedi1959 4 ปีที่แล้ว +1

      @THIRDVECTOR Nope, it can't be done in standard regexes. You will need Perl's extended regexes, especially those as implemented in Perl 6 or Raku as it is now known.

    • @dougwitherspoon3990
      @dougwitherspoon3990 4 ปีที่แล้ว

      @@thirdvect0r Just use Python & Beautiful Soup

    • @SimonBuchanNz
      @SimonBuchanNz 4 ปีที่แล้ว +2

      @@thirdvect0r Common joke, but only works for validating. If you're parsing you are pulling structure out, meaning you are already using multiple invocations, going back and forth. In this usage, regex is pretty much perfect.

  • @imamalox
    @imamalox 4 ปีที่แล้ว +1

    I have a roman numeral clock that uses IIII for some reason. It was made in Nottingham as well! Obviously they didn't attend you guys' classes.

  • @apreslavague1557
    @apreslavague1557 4 ปีที่แล้ว

    Very interesting !

  • @johnks6733
    @johnks6733 4 ปีที่แล้ว +2

    How would it respond to LIVE IIV or VIIII

  • @nodroGnotlrahC
    @nodroGnotlrahC 4 ปีที่แล้ว

    Does the regex ^$ match the empty string, or not match anything at all?
    Because if it matches the empty string, then ^I[XV]|V?I{0-3}$ also matches the empty string.

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว

      It does. You can see in his regex on screen that he gets around that using something called a lookahead at the beginning of the regex that ensures that the next character is one of m, d, c, l, x, v, or i.

  • @EebstertheGreat
    @EebstertheGreat 4 ปีที่แล้ว +2

    The modern method of writing Roman numerals (which has been around for over 2000 years) where subtraction is implied if a smaller numeral precedes a larger one is a bit inconvenient. For instance, these days, we usually write 4 as IV (or iv), but originally, it would always have been written IIII. The modern notation is more compact (compare CDXCIX = 499 to CCCCLXXXXVIIII, for instance), but it is much harder to compute with. Say I want to add XXVII to XXII, for instance. Originally, I would just have to put the two together (XVIIXXI) and then arrange them in order from greatest to least (XXXXVIIII) to get 49. But in modern notation, I have to transform my XXXX into XL and my IIII into IV to get XLVIV, then recognize the "bad" sequence VIV and transform it to IX to get XLIX. Or what if I want to multiply VI by IIII? I can just put one copy of VI for each I (VIVIVIVI), arrange them in order (VVVVIIII), and combine pairs of Vs into Xs (XXIIII). But with modern notation, it would be VI times IV, and there is no obvious algorithm--I will use multiplication tables instead. Granted, division is annoying either way, but that's sort of just the nature of division. I would instead use an abacus or counters or something.
    You get the impression today that math must have been very difficult with Roman numerals, but in fact it was pretty easy, at least for simple arithmetic with whole numbers. And to the delight of third graders, you didn't even have to memorize multiplication tables.

    • @KuraIthys
      @KuraIthys 4 ปีที่แล้ว

      That's interesting. I did not know there was an older form of notation.
      The older form does indeed seem a lot simpler, even if the result is a lot lengthier...
      And yeah, division is never simple. XD

    • @EebstertheGreat
      @EebstertheGreat 4 ปีที่แล้ว +1

      @@KuraIthys Even now, clock and watch faces with roman numerals usually use IIII for 4 rather than IV. But they do use IX for 9 rather than VIII, I guess for space reasons.

  • @mikefochtman7164
    @mikefochtman7164 4 ปีที่แล้ว +4

    Now I want to see an awk script that recognizes each of those groupings, but actually calculates the value of the number entered and prints it out. :) That would be some nice 'awk'.

    • @speedstyle.
      @speedstyle. 4 ปีที่แล้ว

      I was about to reply with a script, but I don't think there's a nice way of doing that in pure awk since you can't get individual match groups.

    • @varkokonyi
      @varkokonyi 4 ปีที่แล้ว

      @@speedstyle. When I saw there is one single reply, I was sure it's a scary one-liner

  • @shifty6940
    @shifty6940 4 ปีที่แล้ว +3

    Wouldn’t the positive lookahead "(?=[MDCLXVI])" he has at the beginning of his pattern ensure there are no zero length matches? A positive lookahead asserts that the pattern is matched without consuming the characters.

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว +1

      Yes, it does, but you also need to have the start and end symbols (^ and $) to prevent the regex from matching strings like 'IIII' (four I's in a row). Without the start and end symbols, the regex will successfully match 'III' and stop there before ever getting to the fourth 'I' so the regex tester will return true as if it's a valid roman numeral. Or you could put \b symbols on either side of the regex to match word boundaries instead of using the start and end symbols like so: \b(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})\b

    • @novamc7945
      @novamc7945 2 ปีที่แล้ว

      ​@@cavalrycome I know I'm 2 years later, but even that pattern fails to match. It just goes on to prove that regex cannot be applied everywhere.
      If you take the example text:
      MDCCCC
      It will match it. Even though it's invalid roman numeral. The same case for any roman numeral containing 4 of a letter. Example:
      MDDDDC
      MDXXXXIV

    • @cavalrycome
      @cavalrycome 2 ปีที่แล้ว

      @@novamc7945 You're right, my mistake. The problem was with the word-boundary matching \b symbols because they can both match the same left boundary while still satisfying the positive lookahead. You can correctly exclude the bad examples of roman numerals you mention if you use the start and end symbols (^ and $) though like so: ^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$

    • @novamc7945
      @novamc7945 2 ปีที่แล้ว

      @@cavalrycome No, that still matches MDDDD which shouldn't match, at least i dont think so

    • @cavalrycome
      @cavalrycome 2 ปีที่แล้ว

      @@novamc7945 In the regex tester I'm using, MDDDD doesn't match the version I included in my comment from 2 days ago. MD matches, but anything with more than one D after the M doesn't. I'm using phpliveregex to test it. What are you using?

  • @SergeMatveenko
    @SergeMatveenko 4 ปีที่แล้ว

    Roman numerals tutorial is a great thing for Numberphile btw.

  • @hankcohen3419
    @hankcohen3419 4 ปีที่แล้ว

    Perhaps the fact that the empty string is a match is not really a problem because there is no symbol for zero in Roman Numerals.
    I also have an unrelated question from the history of Unix for Professor Brailsford. Why do we call the command interpreter a shell?

  • @superscatboy
    @superscatboy 4 ปีที่แล้ว

    It's worth noting that the modern idea of what is/isn't a valid Roman numeral was quite unknown to the actual ancient Romans that invented and used them.

  • @MrSigmaSharp
    @MrSigmaSharp 4 ปีที่แล้ว

    This great video is up for around half an hour now and there are already two downvotes on it. I imagine these people have the bell icon clicked just to show up and downvote this.
    Prof. Brailsford is the best

  • @AndersJackson
    @AndersJackson 4 ปีที่แล้ว +4

    3:20
    "I should have known that one"
    "Yeah, right"
    😜

  • @ArturdeSousaRocha
    @ArturdeSousaRocha 4 ปีที่แล้ว +7

    So the professor upgraded from XP to Win10?

    • @profdaveb6384
      @profdaveb6384 4 ปีที่แล้ว +5

      Not so! i upgraded from Win7 to Win 10 :-)

  • @kristofersokk1580
    @kristofersokk1580 4 ปีที่แล้ว +1

    At the end, does that complicated regex also allow "MIM" for example?

    • @dealloc
      @dealloc 4 ปีที่แล้ว +3

      No it doesn't match. "MIM" is not a valid roman numeral because "IM" is not subtractive. If you meant 999 it would be CMXCIX which is (100 - 1000) + (10 - 100) + (1 - 10) = 999 - note that all the numbers are subtractive and added together.
      If you look at the regular expression for individual rule, it's easy to see the pattern:
      (C[MD]|D?C{0,3})
      First it matches the subtractive numerals in that group (CM, CD), and then the additive (D, DC, C).

    • @nicktuckwell
      @nicktuckwell 4 ปีที่แล้ว

      ui_wizard is that a recognised convention for Roman numerals? The BBC date stamp their programmes using Roman numerals, and definitely used MIM for 1999.

    • @dealloc
      @dealloc 4 ปีที่แล้ว +1

      @@nicktuckwell Oh, I didn't know that. That's a fun fact! I was curious and looked up to find an explanation and BBC News wrote about this in 1999 whether to use MIM or MCMXCIX. They write that BBC had chosen the longer version.
      The thing with roman numerals is that there are different variants that evolved through history. MIM may just be a variant to allow shortening numerals for large numbers (such as 1999). But universally it is not valid.
      This of course will only work for numerals that are identical, regardless of the direction that it is read (like a palindrome).
      It is all about communication. I'm sure that people who wrote and read numerals could communicate unambiguously with flexible rules.

  • @Sigmath_Bits
    @Sigmath_Bits 4 ปีที่แล้ว +1

    Would /I?[XV]|V?I{1,3}/ not match Roman Numerals 1 to 10 without matching the empty set?

    • @NaviSly
      @NaviSly 4 ปีที่แล้ว

      Yeah but now it matches "VV" too :(
      A solution to not match the empty string is a look-ahead (?=[MDCLXVI]) or the assertion \b.

    • @Sigmath_Bits
      @Sigmath_Bits 4 ปีที่แล้ว

      @@NaviSly I don't see how it matches "VV".
      There are other issues, though, like it won't actually necessarily match "VIII" as it'll first match "V" and ignore the Is. But you could force full match just with /^(?:I?[XV]|V?I{1,3})$/ instead.

    • @NaviSly
      @NaviSly 4 ปีที่แล้ว +1

      @@Sigmath_Bits My bad I failed to put the anchors thus it matched 'VV'... Which is the same issue as "VIII"!

  • @ajreukgjdi94
    @ajreukgjdi94 4 ปีที่แล้ว

    Would there be some form of concatenation for matching a repeated set of chars. So if AB* matches ABBBB... Does "AB"* match ABABABAB....

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว

      To match a sequence of zero or more copies of the string 'AB', the syntax is (AB)*.

  • @MattMangels
    @MattMangels 4 ปีที่แล้ว

    I just wrote my first C++ program which decodes Roman Numerals. So funny that you should post this today.

    • @DutchmanDavid
      @DutchmanDavid 4 ปีที่แล้ว

      Did you use regex::?

    • @MattMangels
      @MattMangels 4 ปีที่แล้ว

      ​@@DutchmanDavid I used regex:: for validation and then I realized the same thing that the Professor did in this video, that it didn't take order into account.

    • @Diggnuts
      @Diggnuts 4 ปีที่แล้ว

      Ha, This week I was working on a museum piece that is going to have a digital tabula with which you can do math with roman numerals. It must be something in the air!
      The conversion functions weren't the problem.. the convoluted neural network for letter recognition definitely was!

  • @horntx
    @horntx 4 ปีที่แล้ว

    I have yet to master the arcane wizardry that is RegEx. Is there a way to recuse them? That seems like it could be a nice scaleable solution.
    RE: ^RE? M*C*X*V*I*$
    where RE is the regex recused with the bit before the M C X V or I
    this would have some holes, like for example MIV would pass this.

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว

      You can use (?R) to recursively embed a copy of the whole pattern inside itself. There are also ways of referencing a part of a pattern multiple times including inside itself.

  • @cmdlp4178
    @cmdlp4178 4 ปีที่แล้ว +3

    11:45 "Here is the AWK script, that I hacked together" :D

  • @leokeuken9425
    @leokeuken9425 4 ปีที่แล้ว

    The one thing not tested is the empty Input for which the start and end of line anchors were put in!

  • @LockeSoriku
    @LockeSoriku 4 ปีที่แล้ว

    Like the markers; easy to read

  • @baracek8797
    @baracek8797 4 ปีที่แล้ว

    You didn't cover what happens when you have nothing (a blank line) on the line for the first case /^(I[XV]|V?I{0,3})$/ what happens if you just hit enter when testing your awk script? I guess it will be covered in the empty matches.

    • @profdaveb6384
      @profdaveb6384 4 ปีที่แล้ว

      yes if i just hit as an input it does indeed say “Pattern P0 matches “ without saying *what* it matches. So it has indeed matched the empty string.

  • @guffaw1711
    @guffaw1711 3 ปีที่แล้ว

    I've seen at least half a dozen of regex expressions to match roman numerals. That got me really concerned and anxious. All I wanted to do was match them in the meta data of music files, so that when a song title includes a roman numeral I can make that one upper case. I had quite a few mismatches with the regex I googled first. That's when I found it that there doesn't seem to be the one best approach.
    While I could try I'm not even attempting to replace the ASCII representations of them to Unicode equivalents, even though it's tempting and I generally like replacing ASCII fauximiles with true Unicode characters.. But nah....

  • @NVSINK
    @NVSINK 4 ปีที่แล้ว

    I used to have that syncmaster monitor. RIP.

  • @JasonCliftJones
    @JasonCliftJones 4 ปีที่แล้ว

    3:40 - Only ever heard that as "Columbus sailed the ocean blue..." before. Hmmm...

  • @BooBaddyBig
    @BooBaddyBig 4 ปีที่แล้ว +3

    It's a little known fact that the fall of the Roman Empire coincided with the invention regular expressions. This lead to the dark ages, and REs were only rediscovered after the invention of arabic numerals. ;)

    • @marsgal42
      @marsgal42 4 ปีที่แล้ว

      I always thought it was because their C programs could never execute successfully.

    • @pauldaniels3176
      @pauldaniels3176 4 ปีที่แล้ว

      @@marsgal42 No, they worked - but only 100 times

  • @ZipplyZane
    @ZipplyZane 4 ปีที่แล้ว

    At first I thought the code at the beginning of each line was going to deal with numbers larger MMMM -- also known as (IV). But then I remembered that parentheses are special in regex.
    As of this comment, this video has (LI)CCXXXIX views.

  • @vyvlad
    @vyvlad 4 ปีที่แล้ว

    "overall you must have something is not a context free assertion" - anything you can decide in constant memory with a single scan through the string is a regular assertion. It might be tricky to make a regex, but you could modify the corresponding FSM to store that extra bit and flip it on the first character - requiring at most twice there number of states.

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว

      There seems to be a tradition within the field of regular expressions of using the terms 'context free' and 'context sensitive' with a different meaning to the way those terms are used in formal language theory. At 10:09 in the video, I think that when he says 'context free', he just means regexes without any lookarounds, and by 'context sensitive', he means regexes that include them.
      Secondly, in terms of formal language theory, I believe modern regex engines (of at least the last thirty years or so) are capable of matching not just regular languages, but all context free languages. There seems to be quite a lot of confusion about this, probably because they're still called 'regular' expressions.

    • @ianzen
      @ianzen 4 ปีที่แล้ว

      cavalrycome Can you given an example of regex parsing the arbitrary nested parenthesis problem?

    • @vyvlad
      @vyvlad 4 ปีที่แล้ว

      @@ianzen '/(\((?>[^()]+|(?1))*\))/'

    • @vyvlad
      @vyvlad 4 ปีที่แล้ว

      @@cavalrycome By either definition matching the intersection of a regular language with non empty strings is easier than context sensitive.

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว

      @@ianzen Here's another solution to add to vlad's that will deal with nested parantheses. This version allows for the mixing of round, curly and square brackets: /^(?P(?P[^(){}\[\]]*)(?:\((?P>nbr)(?P>ab)?(?P>nbr)\)|\{(?P>nbr)(?P>ab)?(?P>nbr)\}|\[(?P>nbr)(?P>ab)?(?P>nbr)\]))*(?P>nbr)$/

  • @dasten123
    @dasten123 4 ปีที่แล้ว +8

    3:05 I think this is what you will see if one of your eyeball gets pulled out while it's still being connected

  • @tzisorey
    @tzisorey 4 ปีที่แล้ว

    Everybody stand back! I know Regular Expressions!
    .
    There's an XKCD for everything.

  • @vzr314
    @vzr314 4 ปีที่แล้ว +1

    Q: Would it be more efficient to write a library which convert and compare integers, in terms of CPU cycles?

    • @oresteszoupanos
      @oresteszoupanos 4 ปีที่แล้ว

      How would it convert? It would need to parse the Roman numeral first ;-)

    • @DutchmanDavid
      @DutchmanDavid 4 ปีที่แล้ว

      A: no, it wouldn't. As Orestes said: they need to be parsed first (because the input is text)

    • @moczikgabor
      @moczikgabor 4 ปีที่แล้ว

      @@DutchmanDavid I haven't tried, but I think, it would be possible to compare without converting it to binary representation first. Might not be faster in CPU cycles, but would be an interesting task.

  • @nbrader
    @nbrader 4 ปีที่แล้ว

    I did something similar with 'IX|IV|V?I{0,3}'. 'IX|IV|V?I?I?I?' also works. They're all the same length. I don't think it's standard to use {0-3} instead of {0,3} (at least my regex engine in notepad++ doesn't work that way).

    • @profdaveb6384
      @profdaveb6384 4 ปีที่แล้ว +1

      0-3 was a mistake. See the pinned comment at the top of this listing

    • @nbrader
      @nbrader 4 ปีที่แล้ว

      @@profdaveb6384 Thanks for the reply and thanks for the videos.

  • @Bratjuuc
    @Bratjuuc 4 ปีที่แล้ว

    What about converting these numbers into usual ones?

    • @Diggnuts
      @Diggnuts 4 ปีที่แล้ว

      That is not a trivial problem, but it can be done. Additionally when you have roman2dec and dec2roman functions, you automatically have a way to check validity. Just convert to decimal and back again. If the results do not match, the initial numeral was invalid.
      This is my Roman2decimal function in JS:
      function Roman2INT(i){
      i = i.toUpperCase();
      var total = 0,
      cn,
      ncn,
      rnTable = new Array();
      rnTable['I'] = 1; rnTable['V'] = 5;
      rnTable['X'] = 10; rnTable['L'] = 50;
      rnTable['C'] = 100; rnTable['D'] = 500;
      rnTable['M'] = 1000;
      for (parseRN = 0 ;parseRN < i.length; parseRN++){
      cn = i[parseRN];
      if (parseRN + 1 != i.length){
      ncn = i[parseRN + 1];
      if (rnTable[cn] >= rnTable[ncn]){
      total += rnTable[cn] ;
      } else {
      total -= rnTable[cn];
      }
      } else {
      total += rnTable[cn] ;
      }
      }
      checkRN = INT2Roman(total);
      if (i == checkRN){
      return total;
      } else {
      return false;
      }
      }

  • @realshaoran4514
    @realshaoran4514 4 ปีที่แล้ว

    What is (?=[MDCLXVI]) for? I don't get the ?= part. I can't find anything useful in the man page.

    • @binaryagenda
      @binaryagenda 4 ปีที่แล้ว +1

      It's an assertion that the start of the string be followed by something matching [MDCLXVI] , i.e. one of those 7 characters. It prevents the pattern matching the empty string.

    • @realshaoran4514
      @realshaoran4514 4 ปีที่แล้ว

      Ah, ok thanks

  • @JoJoModding
    @JoJoModding 4 ปีที่แล้ว +11

    "There must be something is a context-sensitive assertion"..
    Is it, though? REG is closed under set difference, so you can just subtract epsilon and now your automaton will no longer accept nothing. Now build a regex out of that.

    • @joshuakb2
      @joshuakb2 4 ปีที่แล้ว +3

      I[VX]|V|V?I{1,3}

    • @olfmombach260
      @olfmombach260 4 ปีที่แล้ว +1

      Maybe he meant the ^ and $ assertions? They are not regular.
      But I'm unsure either, I was confused too

    • @jazzpi
      @jazzpi 4 ปีที่แล้ว

      If I'm not completely mistaken, regular languages are the languages accepted by FSMs. It's obviously quite easy to build an FSM that only accepts "something", so I'm not sure what the professor meant here.
      Building the RegEx gets a lot more annoying when you account for the empty word, though.

    • @ib9rt
      @ib9rt 4 ปีที่แล้ว +8

      *JoJoModding* The professor means that although having part of a number missing is fine, having every part missing is not fine. Therefore deciding whether "nothing" is a valid match must be decided in the context of what came before and after, you cannot decide it by itself. To make this concrete, "MMI" is valid, as is "MM" (no "I"), as is "I" (no "MM"), but "" is not a valid match (no "I" nor "MM" nor anything else). In other words "" is a valid match if it comes after "MM" or if it comes before "I", but not by itself.

    • @globalincident694
      @globalincident694 4 ปีที่แล้ว +1

      I guess the thing is, he's going to end up with a really complicated assertion that is completely correct apart from the fact that it accepts the empty string. Then the easiest way to fix that is to put (?=.) at the beginning. Lookahead assertions are context sensitive.

  • @Wecoc1
    @Wecoc1 4 ปีที่แล้ว +1

    But how do you get its value on $1?

    • @cavalrycome
      @cavalrycome 4 ปีที่แล้ว +1

      The value in the variable $1 is the value of the first capture group. Capture groups are parts of a pattern enclosed within round brackets.

  • @infirmux
    @infirmux 4 ปีที่แล้ว

    Fantastic. And this is kindergarten level RegEx :) And RegEx problems :D

  • @clickrick
    @clickrick 4 ปีที่แล้ว +1

    Unix v7? Presumably on a Vax.
    We had v6 on a PDP 11/34 in 256k of RAM. Ah, happy days!

    • @profdaveb6384
      @profdaveb6384 4 ปีที่แล้ว +1

      Yes - and before that we ran V7 on a PDP 11/70

  • @hrnekbezucha
    @hrnekbezucha 4 ปีที่แล้ว +1

    Let's not forget about the people who write four as IIII and eight as IIX etc.

    • @Alien426
      @Alien426 4 ปีที่แล้ว +2

      Like Activision who have a game called "Call Of Duty: Black Ops IIII".

  • @mhoover
    @mhoover 4 ปีที่แล้ว +1

    I learned it as Columbus sailed the ocean blue.

  • @kalelsoffspring
    @kalelsoffspring 4 ปีที่แล้ว

    Wait, that final expression would say that something like "CMCD" is valid. But, it's not.... is it?

    • @5hirtandtieler
      @5hirtandtieler 4 ปีที่แล้ว +1

      I was confused as well, but the C’s can be seen as subtracting. So:
      CM = 100 less than 1000 = 900
      +
      CD = 100 less than 500 = 400
      = 1300
      But you’re right, it’s technically invalid since youd actually write it as MCCC

    • @profdaveb6384
      @profdaveb6384 4 ปีที่แล้ว

      Yes if you follow the llink in the Info block to the EXTRA BITS for this video it does a revision lesson on Roman Nos and this includes CM CD XL XC IX and IV

    • @kalelsoffspring
      @kalelsoffspring 4 ปีที่แล้ว

      @@profdaveb6384 Sure, CM as 900 is valid. And yes, CD as 400 is valid. But combining them as CMCD for 1300 isn't valid. You'd instead use MCCC. Are you trying to tell me that CMCD is a valid way to represent 1300?

    • @profdaveb6384
      @profdaveb6384 4 ปีที่แล้ว +1

      @@kalelsoffspring Yes I agree 1300 is MCCC and *not* CMCD . The regex I'm using in the video only covers I - IX but in the construct I[XV] that is part of the regex this says "do we have an I followed by an X or a V" standing for 9 or 4 respectively. If so you have to make a choice between X or V. You can't have both. As one develops the RE leftwards we see X[LC] . If so you can chose either of XL or XC if they are what you need. But you don't have to use them and you absolutely can't have both of them present. Same goes for CM and CD.

    • @kalelsoffspring
      @kalelsoffspring 4 ปีที่แล้ว

      @@profdaveb6384 actually, I simply mistook the positive lookahead (seen at 12:10) at the beginning to be capturing the text. Ignoring that part, the rest is fine and I don't see it matching CMCD, ignore me haha

  • @t71024
    @t71024 4 ปีที่แล้ว

    OMG the dreaded Dragon Book!
    Luckily, the Romans didn't have access to a RE compiler so they were sometimes caught writing things to be frowned upon such as, IIII or even IC.

  • @morgansearle3912
    @morgansearle3912 4 ปีที่แล้ว

    1944? Must love the Sabaton song 'Uprising'.

  • @philipcoltharp918
    @philipcoltharp918 4 ปีที่แล้ว +1

    Is there a "AND" in regex?

    • @SaisBlade
      @SaisBlade 4 ปีที่แล้ว

      Yes, just a regular string :P
      You can think of the regex 'foo' as 'f' AND THEN 'o' AND THEN 'o'. If any one of them os missing, the string wont match.

    • @philipcoltharp918
      @philipcoltharp918 4 ปีที่แล้ว

      @@SaisBlade there is another way. But I dont quite understand it yet. There's a (?=) syntax in some regex versions, that, I think, forces a range of characters to match two or more regular expressions.

  • @Hendiadyoin1
    @Hendiadyoin1 4 ปีที่แล้ว

    Have I miss seen that or does it not match MIC

    • @Tahgtahv
      @Tahgtahv 4 ปีที่แล้ว +2

      You're looking for MXCIX

    • @thijsvandenhout6279
      @thijsvandenhout6279 4 ปีที่แล้ว +1

      You can only place I before V

    • @The_Jzoli
      @The_Jzoli 4 ปีที่แล้ว +1

      @@thijsvandenhout6279 uuuh, and X

  • @shitabtanzim2037
    @shitabtanzim2037 3 ปีที่แล้ว

    WOW

  • @bradrjones1
    @bradrjones1 4 ปีที่แล้ว

    Have it be known he uses pico and that's totally fine.

  • @trevor2453
    @trevor2453 4 ปีที่แล้ว

    Dr. Brailsford's monitor may be older than me.

  • @2211djolo
    @2211djolo 4 ปีที่แล้ว +1

    And what about "IM" ?

    • @qwertyTRiG
      @qwertyTRiG 4 ปีที่แล้ว +1

      And 1999 as MIM. I'm not sure the expression as written would catch that.
      But would you write 999 as IM or as CMLXXXVIII? I've always seen 49 as XLVIII, not as IL.

  • @frognik79
    @frognik79 4 ปีที่แล้ว +1

    Stabilization please.

  • @SteveGouldinSpain
    @SteveGouldinSpain 4 ปีที่แล้ว +1

    Is it just an ironic coincidence that the prof has lint on his jumper?

  • @nimasamii
    @nimasamii 4 ปีที่แล้ว +1

    I?[XV]|V?I{1,3} should match all roman numbers from 1 - 10 and not match on the empty set.

    • @peterdegelaen
      @peterdegelaen 4 ปีที่แล้ว

      Yes, you are right. But you really do need the ^ in the beginning and the $ at the end. Otherwise your regex would match e.g. IVII which is an invalid Roman numeral.

  • @farrongoth6712
    @farrongoth6712 4 ปีที่แล้ว +10

    Surely the empty match is 0? 😉

    • @Tinker001
      @Tinker001 4 ปีที่แล้ว +1

      Now go discuss the concept of zero with an ancient roman...

  • @siddharthjoshi1656
    @siddharthjoshi1656 4 ปีที่แล้ว

    How do I know if I've got the prerequisites?

  • @DrMedikill
    @DrMedikill 4 ปีที่แล้ว

    This is like my 3rd video wathcing you and I just notcied you use PICO like me. Guess I'm a bit old-fashioned.

  • @liquidsonly
    @liquidsonly 4 ปีที่แล้ว +1

    Every form with a required date, from now on, is going to filled in, in Roman numerals. Let my local council deal with that. The forms they send do not specify Arabic (or indeed Indean) numerals.

  • @BassChopsBri
    @BassChopsBri 4 ปีที่แล้ว +2

    What about the elephant in the room?

  • @ShinmenTakezo1234
    @ShinmenTakezo1234 4 ปีที่แล้ว

    Lol this is recommended to me just totally out of my knowledge but I didn't quit because of his voice.

    • @DutchmanDavid
      @DutchmanDavid 4 ปีที่แล้ว

      This isn't the first regex video, so you've missed a few :)

  • @hj45lp
    @hj45lp 4 ปีที่แล้ว +2

    ❤️ awk ❤️

    • @lxathu
      @lxathu 4 ปีที่แล้ว

      It's my first love on *x but should have received a port of, at least, the \d and \D shorthands from perl long ago. POSIX character classes are annoyingly diffuse when simple things should be dealt with.

  • @Jrakula10
    @Jrakula10 4 ปีที่แล้ว

    isnt 0 nothing at all, so isnt nothing a valid expression?

  • @MCLooyverse
    @MCLooyverse 3 ปีที่แล้ว +1

    In JavaScript (and in any easy number parser I've built), "" (the empty string) is mapped to 0, which I think is acceptable. Certainly, it's the only way I know of to write 0 with Roman numerals.

  • @loscoz
    @loscoz 4 ปีที่แล้ว

    how about iiv does that pass

    • @DutchmanDavid
      @DutchmanDavid 4 ปีที่แล้ว

      Nope. IIV (Roman numbers are always in capital letters) isn't a Roman numeral (you may only put one 'subtractive' number before a higher one - That's why 8 is VIII and never IIX). There are also never 4 repeating characters. That's why clocks that use a Roman IIII to indicate 4 completely wrong.

  • @TheGreatAtario
    @TheGreatAtario 4 ปีที่แล้ว

    Pico? *Pico??*

  • @Endelin
    @Endelin 4 ปีที่แล้ว +2

    I[VX] | V | V?I{1-3} should handle the empty case

    • @GammaFn.
      @GammaFn. 4 ปีที่แล้ว +1

      True, but that doesn't generalize to larger numbers (where you want "X" to match, for example)

    • @renschranger
      @renschranger 4 ปีที่แล้ว

      Yes but only for the group 1 to 9. If you want to use these groups to work together for different decimal places it will not work out this way. `(X[LC] | L | L?X{1,3})(I[VX] | V | V?I{1,3})` does not match `XX`.

    • @PeterAuto1
      @PeterAuto1 4 ปีที่แล้ว

      @@renschranger "(X[LC] | L | L?X{1,3})(I[VX] | V?I{0,3}) | (I[VX] | V | V?I{1,3})" I think this would do

  • @DFPercush
    @DFPercush 4 ปีที่แล้ว +1

    +1 for pico/nano instead of vim :p

    • @jdratlif
      @jdratlif 4 ปีที่แล้ว

      That's something that stood out to me as well. :)

  • @Guergeiro
    @Guergeiro 4 ปีที่แล้ว +2

    Honestly, using Nano as an editor...

    • @HebaruSan
      @HebaruSan 4 ปีที่แล้ว +1

      Nope, Pico

    • @DutchmanDavid
      @DutchmanDavid 4 ปีที่แล้ว

      I'm almost surprised he doesn't use emacs.

  • @greenix21
    @greenix21 4 ปีที่แล้ว

    Neat

  • @giga-chicken
    @giga-chicken 3 ปีที่แล้ว

    He uses the nano editor.

  • @BrechtPallemans
    @BrechtPallemans 4 ปีที่แล้ว +4

    add a regexMatch.length()>0 in your code :p

  • @Diggnuts
    @Diggnuts 4 ปีที่แล้ว

    Odd way to tackle it. It is nice to know if the numeral is valid, but surely at one point you'd have to do a conversion to actually use it? Well, if you have a roman2decimal function and a decimal2roman function, you automatically gain the ability to check if the numeral is valid. All you have to do is convert to decimal and back to roman. If the results differ, the numeral was wrong!

    • @Diggnuts
      @Diggnuts 4 ปีที่แล้ว

      Seeing as we are talking roman numerals, there are only a handful of numbers with the standard convention only being able to go up to 3999.
      Also, both the decoding and encoding are completely linear and won't take up much more cycles than running it through rege, so what you win with per-validating, you actually lose in conversion and seeing as you might as well want the ability to convert the numbers in either direction, it makes sense to validate at the end.
      I have written such a library and it is perfectly functional and efficient.

    • @Diggnuts
      @Diggnuts 4 ปีที่แล้ว

      Just checked. The fastest method is.... A lookup table.. Then the conversion and the least efficient way, assuming you not only want to check for validity but also want a conversion, Regex... I'm not really surprised.

  • @BrunoVanVaerenbergh
    @BrunoVanVaerenbergh 4 ปีที่แล้ว +1

    "yeah right" lol

  • @user-vn7ce5ig1z
    @user-vn7ce5ig1z 4 ปีที่แล้ว +3

    2:32 - "MIIM"? 🤨 Per Latin pronunciation, wouldn't that be pronounced meem? 🤔 This is a nerd-meme?
    3:39 - Hmm, interesting. In North America, we say "Columbus sailed the ocean blue".
    4:11 - You'd essentially have to code the rules for Roman numerals into the regex code. 😕
    8:03 - Shouldn't that be a comma, not a hypen? 😕 I can't find it as a new/obscure notation, so I assume it's a -typo- writeo. 🤔
    16:44 - -Can't you just wrap the whole pattern in "^(…)+$" to force it to match at least one of them?- Nope, because the "{0,3}" parts match the empty-string, so having an empty string would fulfill that and thus fulfill the "(…)+" requirement. The solution is then to have a non-empty-string check before the regex match.

  • @charlescox290
    @charlescox290 4 ปีที่แล้ว +1

    Ok, it bothered me that he didn't put the grouping symbols in his hands written expression, but he included it in the code. I guess he he just forgot it when he have written it. I'm surprised he didn't look at it and realize it looked odd.

  • @JochCool
    @JochCool 4 ปีที่แล้ว +1

    This is my solution before looking at his. I know it's very long but I couldn't think of a better way.
    (?i)\b(?=[IVXLCDM])M{0,3}(C(C{0,2}|D|M)|DC{0,3})?(X(X{0,2}|L|C)|LX{0,3})?(I(I{0,2}|V|X)|VI{0,3})?\b
    Breaking it down:
    (?i) makes it case insensitive
    \b at either end to not match roman numerals in the middle of words
    (?=[IVXLCDM]) there must be at least 1 roman numeral
    M{0,3} the thousands place
    (C(C{0,2}|D|M)|DC{0,3})? the hundreds place. I copypasted this for the tens and ones place. Basically this means: it either starts with a C or a D, followed by more Cs (or you can also have CD and CM).

  • @aitchpea6011
    @aitchpea6011 4 ปีที่แล้ว +1

    Prof. Brailsford uses Nano? Not vim or emacs? Woohoo, validation for my text editor choices!

    • @HugoNikanor
      @HugoNikanor 4 ปีที่แล้ว +1

      Sorry, it's pico.

    • @aitchpea6011
      @aitchpea6011 4 ปีที่แล้ว

      @@HugoNikanor Close enough. It's not vim or emacs, lol

    • @Monk-E
      @Monk-E 4 ปีที่แล้ว

      @@aitchpea6011 nano works fine I like

  • @aggaming9925
    @aggaming9925 2 ปีที่แล้ว

    Hii

  • @donwald3436
    @donwald3436 4 ปีที่แล้ว +3

    {0-3} 😂 Oh well he figured it out.

  • @UsernameThomaSS
    @UsernameThomaSS 4 ปีที่แล้ว

    I'm never typing 'hello' again...

  • @noxabellus
    @noxabellus 4 ปีที่แล้ว +1

    Y'all do surely realize that using regex in a real parser is a terrible idea, right?

    • @DutchmanDavid
      @DutchmanDavid 4 ปีที่แล้ว +1

      Correct. The regex will become large and unreadable, which means people will miss edge cases or bugs, but as an excercise it's OK.

    • @noxabellus
      @noxabellus 4 ปีที่แล้ว +1

      It's also extremely inefficient

    • @joshuaclark9425
      @joshuaclark9425 3 ปีที่แล้ว

      ​@@noxabellus surely that depends on what you are trying to parse.

  • @clickrick
    @clickrick 4 ปีที่แล้ว

    1944 is presumably the year that Prof. Brailsford started working as a professor. Amirite?

    • @profdaveb6384
      @profdaveb6384 4 ปีที่แล้ว +1

      Help!! To be taken on as a Professor one needs (probably) a completed PhD at an age (say) of 24. so that means I'd have had to be born in 1920 and I'm now aged 100? Er, no.

  • @DrDress
    @DrDress 4 ปีที่แล้ว

    4:50 What the h*** happes to his hand!?