11:40 - Here I think that the first and third * aren't supposed to be there, the expression should be (b+Ɛ)(ab)*(a+Ɛ) . I believe that the language he was aiming for was "an alternating sequence of 'a's and 'b's". Also (b+Ɛ)* is the same as just b*, which would make adding the Ɛ redundant.
I think you are mistaken. The * allows any number of 'a's (or none) before the sequence of alternating 'ab's, followed by any number of 'b's (or none). Without the *, you must supply only either zero or one of each.
@@BeheadedKamikazeThat is what we want, though. Either 0 or 1 b, followed by any quantity of ab, then 0 or 1 a. That would allow for an alternating sequence of a and b, starting and ending with either a or b.
I honestly don't understand all the hate for regex. Granted, I'm not a full-time SW developer (I'm an electrical engineer), but as long as you're matching something simple, it's perfectly fine. If it starts to become a mess, because I want to match 50 things, then I'll just switch to writing parser code instead
I use regular expressions all the time for quick work and I tend to avoid them in programming unless they make sense to use. It is much easier to read and understand a de-serialized class/struct containing the data from a value. That being said, I use regex quite often for redundant file modifications ALL the time. for example in notepad++ we can assume the data "Test " Test1 Test2 ....etc If I select "Test (\d)" and then replace it with "$1" it will replace "Test 1" with "1" and "Test 2" with "2", because you can grab the group values from the regex. This is of course extremely simplified, but it is very handy!
Interesting video, but there are several problems: 9:46 "(b+c)*(a(b+c)*a)*(b+c)*" does not include ALL words with an even number of "a"s, such as "abababa". The final "(b+c)*" should be moved inside the preceding parethesis: "(b+c)*(a(b+c)*a(b+c)*)*" 11:37 "(b+Ɛ)*(ab)*(a+Ɛ)*" includes words that do not have alternating "a"s and "b"s, such as "bbaa". The first and last parenthesis should not have the asterisk: "(b+Ɛ)(ab)*(a+Ɛ)"
I have a lot of respect for regular expressions - they're incredibly powerful tools in the right hands. Sadly I rarely use them in my work, but when I've needed them, OMG what a difference it made!
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski, Netscape engineer
This is one of the stupidest quotes I know from the field of IT. Honestly, if someone uses this quote in all seriousness, I think they are either too lazy to learn RegEx or simply not smart enough to do it. A regular expression may look like a horrible mess, I grant that, but so does any code, even well written, to someone who doesn't understand the language. If someone thinks they understand RegEx, but also think it's a problem, they don't in fact understand RegEx.
@@xeuszzz Devil's advocate: "Honestly, if someone doesn't write all of their code in the native assembly for their CPU architecture, they're simply not smart enough to do it."
@@UmaiKayu I find your analogue quite unfair, so let me fix it for you: "Honestly, if someone finds native assembly language a mess, they are either too lazy or not smart enough to learn it." It's still a bad analogue, but now it's just a forgivable reductio ad absurdum instead of some utterly pointless misrepresentation of the actual argument.
@@xeuszzz I was hoping we could agree the appropriate solution to increasing problem complexity is using sufficiently high level language constructs, as "just think real hard" scales poorly.
@@UmaiKayu but writing in assembly makes a task slower, regex makes a task 100x quicker. 99.9% of the time I use regex I'm not writing it into code, I'm using it to find strings, find and replace, extract text from websites, find files on my computer, query etc. etc. - in those cases I am solving a problem with regex and I now have 0 problems
10:55 - The "game" is called Regex Golf. You try to match set A while NOT matching set B in the fewest characters possible. ^(?!(..+)\1+$) for example, exclusively matches strings with length equal to any prime number. For reasons that I *barely* understand let alone explain.
... I think I can explain that (at least part of it). ?! is an assertion that the subsequent pattern must _not_ match, and (..+) is a sequence of any two or more characters, also flagging it for repeat as "group 1", with \1 being a reference to "group 1". Note that when dealing with variable-length keywords (e.g. + meaning "one or more") the regex engine must, in principle, brute-force every possible permutation of lengths to find a match (and whether it should prioritize longer or shorter matches is an iceberg unto itself). So the patterns inside this regex are essentially trying to "factor" the input string into two-or-more groups of two-or-more characters each. If no such combination exists then the input string must have a length that is unfactorable (i.e. a prime number), the negative assertion succeeds, and the expression as a whole is declared a match.
Im pretty sure the RE at 0:31 is binary numbers that are multiples of 3. My reasoning is that the construction 10(1+00)*01 is built specifically to keep parity. Since 1, 2, 4, 8, 16, 32, ... Is +1, -1, +1, -1, +1, -1 from a multiple of 3, 11 is the smallest number where the pariry cancels out and as long as you add an even number of 0's between you're fine. However, you could in the case of 1001, go to 10101 because all 3 bits will have the same parity and cancel out due to the zeros in between. This is the reason it's not just 1(1+00)*1. The ... + 0 + 11 represent both base cases and the final klenne star on top is because if you concatenate two multiples of 3 you gev a multiple of 3. I figured since nobody put an answer in the comments yet I would shoot my shot in case I missed something. Was definitely a fun puzzle!
I love and hate RegEx at the same time. It's so goddamn useful and elegant but so semantically dense it's hard to wrap my mind around it when trying to use it.
I find some aspects of regex to be relatively straightforward, and fairly easy to get right on the first try. But things like negative lookahead and lookbehind, and nested groups boggle my mind to no end.
I remember Thorsten teaching this during my second year of University (im on my third now), this kind of stuff is really interesting and VERY useful if you want to make your own programming language
I have learned to use "|" and not "+" as an "or" operation in RE. I think it is easier to read, but that is just me. I also define "a+" as equal to "a(a*)", so it can be one or more and not just zero and more, like "a*". That is useful.
@9:58 the regular expression ((b+c)*(a(b+c)*a)*(b+c)* does not match all words with an even number of a's. However this should do it (b+c)*(a(b+c)*a(b+c)*)* The expression in the video requires that all the a's except possibly the first and the last appear in pairs.
@@smuecke "abababa" Note that after the 2nd "a" is matched, the original regex can only match another "a" or (b+c) to the end of the string. If you still don't see then try looking at this related expression: (a(b+c)*a)(a(b+c)*a)* It's not the same expression but the requirement that a's, other than the first and the last, must come in consecutive pairs should be more obvious.
Had a test the other day, one of the questions was to write a regex that would allow an odd number of a specific symbol. Can't believe I didn't think of that simple trick. This is amazing as it helps me deepen my understanding of the branch of CS. Thank you real much (plus the German accent is super)
I think a lot of the difficulty of regular expressions in practice is that they get specified using the same character set that the words they are trying to recognize are from, and that there's a short form of (a+b+...+z) but it's hard to read. This video would be a lot harder to follow if Sigma contained ( and ) and + and *, which aren't good for examples but are kind of necessary if you want to use regular expressions to tokenize programs in a compiler.
(Spoiler below ...) I reckon they are the multiples of three in binary. Seems to be that way in practice at least, and I can see why 0 and 11 are in the regex (sort of) but I haven’t explained to myself why the other element captures all the "other" multiples of 3. Would welcome any insight
to everyone wanting to get into regular expression in practice, I can only encourage to write you own little template-language like f.e. handlebars. I did this ~8 years ago to get more into it and since then I am very good in that field
I love RE's and they've been a part of my job for decades. But I'm glad I didn't have to learn them by starting from math's formalism (although I like math pretty much, too) because otherwise our close relation wouldn't have became so strong.
Unless I'm mistaken, the regex that is supposed to match alternating sequences of A and B (ABABAB or BABABA etc) would also match the empty string or just single A or single B - three edge cases that I don't think qualify though it is arguable. In the notation that is used there, I feel ((ba(ba)*(b+Ɛ))+(ab(ab)*(a+Ɛ))) would be a better match for the spoken description.
Look into how RegEx handles greedy vs non greedy matching. What you described here is called greedy behavior. It's too long to get into in a youtube comment.
@@napukapu I know all about greedy and non-greedy, as I use RegExes extensively in my day job. This is not that concept. It is a case of the regular expression and spoken description not being the same. And I'm explicitly using the (rather strange, for people familiar with Perl compatible RegExes) notation used in the video.
I think yours would match babbb, but it's an easy fix. We can change the trailing b* and a* to (b + epsilon) and (a + epsilon) respectively. Did I parse that right?
As others have pointed out, the regex given for the language over alphabet {a,b,c} of words containing an even number of a's is not correct. Other's have posted a few different regexes which work, but I think the following is simpler: (b+c+a(b+c)*a)* This clearly always forces the a's to appear in pairs, so it only includes words with an even number of a's. But it also allows any word on {b,c} to appear before the first a, after the last a and between any pair of a's.
An even number of the letter 'a' in an alphabet of {a, b, c} is "((b+c)*a(b+c)*a(b+c)*)*", isn't it? Even more elegant is the one from @BitJam, which is "(b+c)*(a(b+c)*a(b+c)*)*", don't you agree? The video location is 9:57. Let me know if I am wrong. I am curious as to see the conversion from RegEx to NFA.
You can subdivide regular expressions into the categories of real expressions, rational expressions, imaginary expressions, irregular expressions, and complex expressions. Most of them are in the last two categories.
As a frequent Regex user in coding, I found this hard to follow due to the non-standard operators Prof. Altenkirch uses. That is, he uses `(a+b)` instead of `(a|b)` for the "or" or alternation operator and `(a+ε)` instead of `a?` for the optional operator. Is this coming from somewhere, like a historical or mathematical variant of Regex rather than the more common modern coding styles used by `grep`, Python, or PCRE? Also, as others have pointed out, he should have written `(b+ε)(ab)*(b+ε)` for the alternating example, which would be `b?(ab)+a?` in more standard regex. FWIW, I find regular expressions are almost always more readable than their parser alternatives, but then anything you use all the time becomes second nature.
Am I missing something, or should it be "h(e|a)llo" rather than "h(e+a)llo" - the latter looks for h, followed by one or more e's, followed by allo. Or is he not talking about PCREs?
He's talking about regular expressions as a concept, so he's not bound by the RegEx syntax, RegEx is just a standardised syntax spec for writing regular expressions.
0:34 if you use Bourne shell bracket-expansion you can get an idea of what set this is, but I don't see much of a pattern there, apart from "Well, it starts 10, ... Try this: echo {,10{,{1,00},{1,00}{1,00}}{01,0,11}} which gives 1001 100 1011 10101 1010 10111 100001 10000 100011 101101 10110 101111 1010001 101000 1010011 1000101 100010 1000111 10000001 1000000 10000011 and you can push it out as far as you like by duplicating the expression and leaving out the first comma,..
Here's some tediously boring trivia: In the C language specification, there's no definition for a decimal integer constant with a zero value. At all. In other words, in C it is not possible to express the number zero as a base-10 literal. Maybe you're wondering how that makes any sense. You've seen lines like "int i = 0;", so how is that not a zero? Because in C, "0" is an octal. And if you think about how to match integers as regexp, it makes perfect sense.
I'm pretty sure the main reason most people don't like Regular Expressions is that there are about 12 common syntaxes, and few things specify which they use.
The sequence defined in the Regex is possibly OEIS A182632: "Toothpick sequence on the hexagonal net starting from a node." - 0, 3, 9, 21, 33, 45... Sorry if this was covered somewhere in the video, I didn't see the explanation.
+ does not mean "OR" in regex, that would be the pipe character (|). Plus means "the preceding must occur and may be followed by". So quite confusing to anyobe vaguely familiar with regex.
He's not writing in any specific syntax. If you stop to search, there are many implementations of regular expressions syntax, even one being incompatible with another. You're just used to modern languages implementations which tends to follow some sort of standard, but even tho there are nuances from implementation to implementation.
Someone should write a regex training program that gives you a crossword puzzle and requires you to write a regex that it will check against a dictionary file and give you a list of words that fit. I'd only allow you to choose the word from the list your RE gives you. It should also be able to let you specify exactly which 'regex' you are using. Is this already out there somewhere? I think it would be very useful. No, I can't write it myself.
RegexGolf looks like it's the opposite of what I'm thinking of, @@negativeseven. With RegexGolf you get two lists and have to match everything in one and nothing in the other. What I'm thinking of is where you have a set of letters in specific positions and you want to compare them to everything in a dictionary. More like grep golf...
Usually, I am using POSIX Basic Regular Syntax (BRE) to express RegEx like "[bc]*(a[bc]*a[bc]*)*", which would be the equivalent to "(b+c)*(a(b+c)*a(b+c)*)*". Which is the standard used in this video?
In the places where I've seen + supported, it's like * but 1-to-many instead of 0-to-many, so "(b+c)*" (to pick on just part of your example) would be "(bb*c)*" in BRE. I haven't quite gotten through this video yet, so I'm not ceratain, but that's how I was interpreting the initial challenge... (Will respond again if I find something else as I watch.)
Ahh, no, 6:18 shows he's using it quite differently. I don't know where this syntax comes from, but yes, in seeing it defined that way, I think your equivalency is valid. But yeah, where on earth is this syntax from? The world of pure theory, I guess... because it's not PCRE or anything else I know.
Funny thing, I opened explorer++ on my PC today to find all files of a particular name and type, and one of the options was "include regular expressions." I was like wtf are those? So I look it up and couldn't find anything I could understand with my novice brain. So thanks for this, and being recommended to me in short order xD
This seems like an odd regex syntax to me. In all the examples + is an or, which I've never heard about. Is this actually used anywhere (as in, can you actually write those things into some program to highlight/replace those patterns) or is it regex "pseudocode"/mathematical notation? Genuinely curious, if someone more knowledgeable in theoretical stuff could enlighten me it would be great.
@@halfsourlizard9319 that's what I'm familiar with. They don't really teach finite automata or regex theory here, at least not at the level I was studying. Thanks!
I often use regex, and I am extremely confused by whatever notation he's using. For example, he says that "h(e+a)llo" is "H, either E or A, L L O" but that's not what it would match as a regex expression. That would be "H, 1 or more Es, A L L O." As a regex expression for what he's saying, that would be "h[ea]llo", so I really have no idea what syntax he's using. Repeat this throughout pretty much all the examples throughout the video. Just about any expression he gives, I'm sitting here thinking "That is *absolutely* not how you would write that."
Maybe I'm not used to this form of grammar since I'm used to doing them in programing where + is a token like * and ? that represents at least one to infinity of the token that precedes it, also that the pipe | is used for OR instead of +. That being said couldn't the abab... or bababa... regex just simply be (ba+ab)* I know at least in my code ((ab|ba)*) will return any string of ababa.... or bababa..
There's a slight miss with the RE for an even number of a:s. Every pairing of a:s can only have b:s and c:s between themselves, and not between other pairings. So (using b for every (b+c)*) babaabab is allowed, but not babababab. A corrected RE would be (b+c)*(a(b+c)*a(b+c)*)*, ((b+c)*a(b+c)*a)*(b+c)*, or (a(b+c)*a+b+c)*.
I tend to think the best way to approach a new topic is to start with a more high level and practical view and only later delve into the deeper theory and relationships, but then again I’m not a designer of databases.
Regular expression are not technically Type 3 regular languages descriptors. There is a construct that allows you to search for exact match of a previously matched pattern. That requires having memory and a dfa doesn't have any.
Because most regex engines don't actually implement regular languages and are just misusing the term. PCRE style regex for example are actually significantly more powerful than even CFGs and are potentially turing complete or at least linear bounded automatons.
Could you make a video where you have someone talk about the difference of compile time vs runtime and why would you prefer something over the other for which situations? I've heard the terminology passed around regarding C++ or Rust or Swift, talking about how you can make something runtime calculated or compile time calculated and I don't quite understand the advantages of each.
Compile time computation just means that the compiler precomputes a value while processing your code. It then stores the value in the program. When a program calls the function, the assembly under the hood is simply some mov or return operation as opposed to going through n operations for that function. Basically, the compiler builds a cheat cheat ahead of time so the program can perform a look up instead of the actual computation. Bonus points if the result of your function is some value that gets called so many times it ends up in the cpu cache.
I guess I’m in the minority? I love regexes, but I get they can become difficult to follow after a certain length. However, they make certain pattern lookups so consistent in a succinct form . Makes life easier in combination with interpreted languages where you don’t have the convenience of optimizing out the steps to a non-regex solution. I prefer using regexes as simple filters for an unknown input. It’s very handy for input sanitization.
Maybe it's just me, but I feel like the follow-through from initial presentation to complete explanation isn't really there. I mean, he explained the formal language and all the components of a regular expression, but I don't feel like I ever really understood _how_ all the definitions give rise to the useful tool it is. I've used regex before for string search and in programming, I'm just having a hard time connecting the practical concept with the all the formal math presented in this video.
Total side thing, but could I just put a humble plug in to request that mice get cleaned before filming? (See e.g. 5:40.) It's less that I find it gross as that I find it distracting to the eye. 😬
I'm currently writing a POSIX-compliant parser for creating Linux utilities and the most frustrating aspect of the code base is the regex. I attempt to avoid regex at all costs, because there are usually simpler and more elegant ways to get the job done, but some portions of the code do need regex. Every time I write unit tests for complex regex, I have to prepare myself for a world of pain, suffering, and disappointment.
regulations expressions as in regular languages aren't the same as the default regex package in most languages. there is extensions like the greedy flag and backtrack refs that go beyond regular languages in the Chomsky hierarchy.
I like to think that they had this store room full of boxes of dot matrix printer paper and someone asked, "How do we get rid of all this stuff?" Viola, wildly popular YT channel is born.
Regular expressions... something that looks like your cat just walked across your keyboard, but somehow magically work when you have them just right.
well i normally called it chicken feet patterns...
Put an AI on it. It'll be able to work out if its still a cat
11:40 - Here I think that the first and third * aren't supposed to be there, the expression should be (b+Ɛ)(ab)*(a+Ɛ) . I believe that the language he was aiming for was "an alternating sequence of 'a's and 'b's". Also (b+Ɛ)* is the same as just b*, which would make adding the Ɛ redundant.
Glad someone else posted this. Was thinking this myself
Yeah I'm pretty sure thats correct as klein star always includes the empty word in the resultant language.
I think you are mistaken. The * allows any number of 'a's (or none) before the sequence of alternating 'ab's, followed by any number of 'b's (or none). Without the *, you must supply only either zero or one of each.
"((ab)+a?|(ba)+b?)"
@@BeheadedKamikazeThat is what we want, though. Either 0 or 1 b, followed by any quantity of ab, then 0 or 1 a. That would allow for an alternating sequence of a and b, starting and ending with either a or b.
"Are you from Germany?", "Yes!, suprise to many, I am from Germany". that was a good laugh, good sense of humor.
As if the name and accent didn't give him away.
What accent 😂
I honestly thought he was Scandinavian. 😅
@@Yxcellfound the American
@@earlgrey8611 haha, you got me.
To be fair, though, with a name like "Thorsten" it's no wonder I thought he was Scandinavian.
I honestly don't understand all the hate for regex. Granted, I'm not a full-time SW developer (I'm an electrical engineer), but as long as you're matching something simple, it's perfectly fine. If it starts to become a mess, because I want to match 50 things, then I'll just switch to writing parser code instead
Because it's usually more important that code is readable than writeable.
Regular expressions are not known for clarity and speed.
I use regular expressions all the time for quick work and I tend to avoid them in programming unless they make sense to use. It is much easier to read and understand a de-serialized class/struct containing the data from a value. That being said, I use regex quite often for redundant file modifications ALL the time.
for example in notepad++ we can assume the data "Test " Test1
Test2
....etc
If I select "Test (\d)" and then replace it with "$1" it will replace "Test 1" with "1" and "Test 2" with "2", because you can grab the group values from the regex. This is of course extremely simplified, but it is very handy!
It gets messy really quickly and mistakes are... always going to happen.
Honestly my problem with RE is that there is usually a better solution in python.
Interesting video, but there are several problems:
9:46 "(b+c)*(a(b+c)*a)*(b+c)*" does not include ALL words with an even number of "a"s, such as "abababa". The final "(b+c)*" should be moved inside the preceding parethesis: "(b+c)*(a(b+c)*a(b+c)*)*"
11:37 "(b+Ɛ)*(ab)*(a+Ɛ)*" includes words that do not have alternating "a"s and "b"s, such as "bbaa". The first and last parenthesis should not have the asterisk: "(b+Ɛ)(ab)*(a+Ɛ)"
I have a lot of respect for regular expressions - they're incredibly powerful tools in the right hands. Sadly I rarely use them in my work, but when I've needed them, OMG what a difference it made!
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski, Netscape engineer
This is one of the stupidest quotes I know from the field of IT. Honestly, if someone uses this quote in all seriousness, I think they are either too lazy to learn RegEx or simply not smart enough to do it. A regular expression may look like a horrible mess, I grant that, but so does any code, even well written, to someone who doesn't understand the language. If someone thinks they understand RegEx, but also think it's a problem, they don't in fact understand RegEx.
@@xeuszzz Devil's advocate: "Honestly, if someone doesn't write all of their code in the native assembly for their CPU architecture, they're simply not smart enough to do it."
@@UmaiKayu I find your analogue quite unfair, so let me fix it for you: "Honestly, if someone finds native assembly language a mess, they are either too lazy or not smart enough to learn it." It's still a bad analogue, but now it's just a forgivable reductio ad absurdum instead of some utterly pointless misrepresentation of the actual argument.
@@xeuszzz I was hoping we could agree the appropriate solution to increasing problem complexity is using sufficiently high level language constructs, as "just think real hard" scales poorly.
@@UmaiKayu but writing in assembly makes a task slower, regex makes a task 100x quicker. 99.9% of the time I use regex I'm not writing it into code, I'm using it to find strings, find and replace, extract text from websites, find files on my computer, query etc. etc. - in those cases I am solving a problem with regex and I now have 0 problems
10:55 - The "game" is called Regex Golf. You try to match set A while NOT matching set B in the fewest characters possible.
^(?!(..+)\1+$) for example, exclusively matches strings with length equal to any prime number. For reasons that I *barely* understand let alone explain.
... I think I can explain that (at least part of it). ?! is an assertion that the subsequent pattern must _not_ match, and (..+) is a sequence of any two or more characters, also flagging it for repeat as "group 1", with \1 being a reference to "group 1".
Note that when dealing with variable-length keywords (e.g. + meaning "one or more") the regex engine must, in principle, brute-force every possible permutation of lengths to find a match (and whether it should prioritize longer or shorter matches is an iceberg unto itself). So the patterns inside this regex are essentially trying to "factor" the input string into two-or-more groups of two-or-more characters each. If no such combination exists then the input string must have a length that is unfactorable (i.e. a prime number), the negative assertion succeeds, and the expression as a whole is declared a match.
Im pretty sure the RE at 0:31 is binary numbers that are multiples of 3.
My reasoning is that the construction 10(1+00)*01 is built specifically to keep parity. Since 1, 2, 4, 8, 16, 32, ... Is +1, -1, +1, -1, +1, -1 from a multiple of 3, 11 is the smallest number where the pariry cancels out and as long as you add an even number of 0's between you're fine.
However, you could in the case of 1001, go to 10101 because all 3 bits will have the same parity and cancel out due to the zeros in between. This is the reason it's not just 1(1+00)*1.
The ... + 0 + 11 represent both base cases and the final klenne star on top is because if you concatenate two multiples of 3 you gev a multiple of 3.
I figured since nobody put an answer in the comments yet I would shoot my shot in case I missed something. Was definitely a fun puzzle!
Oh, are those 1's? I was waiting for him to explain what a lambda or caret character meant in a regex.
I love and hate RegEx at the same time. It's so goddamn useful and elegant but so semantically dense it's hard to wrap my mind around it when trying to use it.
just do regexcrosswords and don't stop until you've done every single one. it will take 1 weekend only and you will be able to use it for life
I use tdd to get a grip on what I want from a set random strings. Re are really difficult.
They don't look elegant to my eyes. They're a mess.
I'm also dense, that's why I find regexes fairly understandable.
I find some aspects of regex to be relatively straightforward, and fairly easy to get right on the first try. But things like negative lookahead and lookbehind, and nested groups boggle my mind to no end.
I remember Thorsten teaching this during my second year of University (im on my third now), this kind of stuff is really interesting and VERY useful if you want to make your own programming language
I have learned to use "|" and not "+" as an "or" operation in RE. I think it is easier to read, but that is just me. I also define "a+" as equal to "a(a*)", so it can be one or more and not just zero and more, like "a*". That is useful.
@9:58 the regular expression ((b+c)*(a(b+c)*a)*(b+c)* does not match all words with an even number of a's. However this should do it (b+c)*(a(b+c)*a(b+c)*)* The expression in the video requires that all the a's except possibly the first and the last appear in pairs.
Can you give a counterexample of a string with an even number of a's that the first expression does not match?
@@smuecke "abababa" Note that after the 2nd "a" is matched, the original regex can only match another "a" or (b+c) to the end of the string.
If you still don't see then try looking at this related expression:
(a(b+c)*a)(a(b+c)*a)*
It's not the same expression but the requirement that a's, other than the first and the last, must come in consecutive pairs should be more obvious.
Had a test the other day, one of the questions was to write a regex that would allow an odd number of a specific symbol. Can't believe I didn't think of that simple trick. This is amazing as it helps me deepen my understanding of the branch of CS. Thank you real much (plus the German accent is super)
I think a lot of the difficulty of regular expressions in practice is that they get specified using the same character set that the words they are trying to recognize are from, and that there's a short form of (a+b+...+z) but it's hard to read. This video would be a lot harder to follow if Sigma contained ( and ) and + and *, which aren't good for examples but are kind of necessary if you want to use regular expressions to tokenize programs in a compiler.
The numbers of times I've used regex and the number of times I've googled how to write regex are the same.
Is that right at 11:40? Looks like bbbbababaaaaa would match. I don't think the first and third *s were intended.
Yeah, just commented the same thing, then loaded some more comments and found yours lol.
the syntax of these regexes is so different from what I use in javascript etc.
+ here means OR
+ in JS means 1 or more occurrences
yup, confused me for a sec
I honestly wonder if he forgot the syntax and didn't bother to refresh his memory before making the video
@@fraggletThis is the standard syntax when using regex’s in a theoretical context.
I love regex! It's so useful. I don't understand how Windows folks get by without it. It's built into so many linux/unix tools.
So what is the short English description of the regular expression at 0:30?
(Spoiler below ...)
I reckon they are the multiples of three in binary. Seems to be that way in practice at least, and I can see why 0 and 11 are in the regex (sort of) but I haven’t explained to myself why the other element captures all the "other" multiples of 3. Would welcome any insight
to everyone wanting to get into regular expression in practice, I can only encourage to write you own little template-language like f.e. handlebars. I did this ~8 years ago to get more into it and since then I am very good in that field
function T(t, c, r, _) {
try {
return eval(
// set context
"with(c||{}){_='';" +
// match html-tags
t[r = "replace"](/(?:
|^)\s*(
I love RE's and they've been a part of my job for decades.
But I'm glad I didn't have to learn them by starting from math's formalism (although I like math pretty much, too) because otherwise our close relation wouldn't have became so strong.
Man this series is really taking me back to my compilers class in University.
As a fellow German, I like how he doesn't even try to hide his accent
0:06 made me check my Outlook :D
sorry about that - we did realise and mute it! -Sean
@@Computerphiledont worry he just underwent a deterministic state change
Unless I'm mistaken, the regex that is supposed to match alternating sequences of A and B (ABABAB or BABABA etc) would also match the empty string or just single A or single B - three edge cases that I don't think qualify though it is arguable. In the notation that is used there, I feel ((ba(ba)*(b+Ɛ))+(ab(ab)*(a+Ɛ))) would be a better match for the spoken description.
Look into how RegEx handles greedy vs non greedy matching. What you described here is called greedy behavior. It's too long to get into in a youtube comment.
@@napukapu I know all about greedy and non-greedy, as I use RegExes extensively in my day job. This is not that concept. It is a case of the regular expression and spoken description not being the same. And I'm explicitly using the (rather strange, for people familiar with Perl compatible RegExes) notation used in the video.
I think yours would match babbb, but it's an easy fix. We can change the trailing b* and a* to (b + epsilon) and (a + epsilon) respectively. Did I parse that right?
@@stro5179Yes, correct. It should be that (now fixed in an edit). I was thinking about the ? in perl regexes.
As others have pointed out, the regex given for the language over alphabet {a,b,c} of words containing an even number of a's is not correct. Other's have posted a few different regexes which work, but I think the following is simpler: (b+c+a(b+c)*a)*
This clearly always forces the a's to appear in pairs, so it only includes words with an even number of a's. But it also allows any word on {b,c} to appear before the first a, after the last a and between any pair of a's.
An even number of the letter 'a' in an alphabet of {a, b, c} is "((b+c)*a(b+c)*a(b+c)*)*", isn't it? Even more elegant is the one from @BitJam, which is "(b+c)*(a(b+c)*a(b+c)*)*", don't you agree? The video location is 9:57. Let me know if I am wrong. I am curious as to see the conversion from RegEx to NFA.
You can subdivide regular expressions into the categories of real expressions, rational expressions, imaginary expressions, irregular expressions, and complex expressions. Most of them are in the last two categories.
I was blown away the other day, when someone had made one of the advent of code in regular expresions :D that was really cool.
As a frequent Regex user in coding, I found this hard to follow due to the non-standard operators Prof. Altenkirch uses. That is, he uses `(a+b)` instead of `(a|b)` for the "or" or alternation operator and `(a+ε)` instead of `a?` for the optional operator.
Is this coming from somewhere, like a historical or mathematical variant of Regex rather than the more common modern coding styles used by `grep`, Python, or PCRE?
Also, as others have pointed out, he should have written `(b+ε)(ab)*(b+ε)` for the alternating example, which would be `b?(ab)+a?` in more standard regex.
FWIW, I find regular expressions are almost always more readable than their parser alternatives, but then anything you use all the time becomes second nature.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Amen.
luddite cope
This guy looks like he is the installation wizard.
He reminds me of Brent Spiner's character (Dr. Brackish Okun) who is the lead scientist at Area 51 in Independence Day
I always thought regex is something magical but the more and more I use it, this assumption hasn't changed it is still as magical as regex can be
Am I missing something, or should it be "h(e|a)llo" rather than "h(e+a)llo" - the latter looks for h, followed by one or more e's, followed by allo. Or is he not talking about PCREs?
Also, if that's just an extended regexp (e.g. "grep -E") the parens are unnecessary.
He's talking about regular expressions as a concept, so he's not bound by the RegEx syntax, RegEx is just a standardised syntax spec for writing regular expressions.
@@Eagle3302PL ambiguous syntax is not the same as describing abstract concepts, not even when the ambiguous syntax happens to be pseudocode.
just academics.. not using regular symbols for regular expressions is one of their specialities
h[ea]llo
A regex for the language of words not containing 'ab', over the alphabet {a,b,c}, is the expression (aa*c+b+c)*a*
This can be further simplified to the expression (b+a*c)*a*
spent a long time being confused about the man talking about regex while not writing regex before realising the math is more general
I'm a simple person. I see RegEx and I like the video.
THORNSTEN!!!! This guy is a Genius!!
0:34 if you use Bourne shell bracket-expansion you can get an idea of what set this is, but I don't see much of a pattern there, apart from "Well, it starts 10, ... Try this: echo {,10{,{1,00},{1,00}{1,00}}{01,0,11}} which gives 1001 100 1011 10101 1010 10111 100001 10000 100011 101101 10110 101111 1010001 101000 1010011 1000101 100010 1000111 10000001 1000000 10000011 and you can push it out as far as you like by duplicating the expression and leaving out the first comma,..
Here's some tediously boring trivia:
In the C language specification, there's no definition for a decimal integer constant with a zero value. At all. In other words, in C it is not possible to express the number zero as a base-10 literal.
Maybe you're wondering how that makes any sense. You've seen lines like "int i = 0;", so how is that not a zero?
Because in C, "0" is an octal. And if you think about how to match integers as regexp, it makes perfect sense.
10:30 Exercise: is the RE that does NOT contain sequence 'ab' given by (b+c)*((a*c)*(b+c)*)*a* ? Correct? Thank you for your comments :)
I've checked out your solution and it seems right to me.
After seeing your solution I came up with my own: ((a*c)+b+c)*(a*+epsilon).
Please cover Parsing Expression Grammars/Generators next!
I second this
I'm pretty sure the main reason most people don't like Regular Expressions is that there are about 12 common syntaxes, and few things specify which they use.
Kind of odd that he's not using the usual syntax for regexps. What he writes as h(e+a)llo is usually written h[ae]llo
In a theoretical context, this is the standard.
The sequence defined in the Regex is possibly OEIS A182632: "Toothpick sequence on the hexagonal net starting from a node." - 0, 3, 9, 21, 33, 45... Sorry if this was covered somewhere in the video, I didn't see the explanation.
+ does not mean "OR" in regex, that would be the pipe character (|).
Plus means "the preceding must occur and may be followed by".
So quite confusing to anyobe vaguely familiar with regex.
He's not writing in any specific syntax. If you stop to search, there are many implementations of regular expressions syntax, even one being incompatible with another.
You're just used to modern languages implementations which tends to follow some sort of standard, but even tho there are nuances from implementation to implementation.
@@_garicas Alright. I know about the nuances, this radical deviation from the norm is new to me though.
Someone should write a regex training program that gives you a crossword puzzle and requires you to write a regex that it will check against a dictionary file and give you a list of words that fit.
I'd only allow you to choose the word from the list your RE gives you.
It should also be able to let you specify exactly which 'regex' you are using.
Is this already out there somewhere? I think it would be very useful. No, I can't write it myself.
I think someone wrote a safe cracking game which had to do with RegEx but I might be mistaken.
RegexGolf is somewhat close to that
RegexGolf looks like it's the opposite of what I'm thinking of, @@negativeseven.
With RegexGolf you get two lists and have to match everything in one and nothing in the other.
What I'm thinking of is where you have a set of letters in specific positions and you want to compare them to everything in a dictionary. More like grep golf...
Regex crossword exists
@@negativeseven Is that available outside of the one you can find online with google and the telehack version? I love me some regex golf.
This is the most complicated explanation of regex I've ever heard
I thought + meant 1 or more of the preceeding group and | meant this one or that one
Usually, I am using POSIX Basic Regular Syntax (BRE) to express RegEx like "[bc]*(a[bc]*a[bc]*)*", which would be the equivalent to "(b+c)*(a(b+c)*a(b+c)*)*". Which is the standard used in this video?
In the places where I've seen + supported, it's like * but 1-to-many instead of 0-to-many, so "(b+c)*" (to pick on just part of your example) would be "(bb*c)*" in BRE. I haven't quite gotten through this video yet, so I'm not ceratain, but that's how I was interpreting the initial challenge... (Will respond again if I find something else as I watch.)
Ahh, no, 6:18 shows he's using it quite differently. I don't know where this syntax comes from, but yes, in seeing it defined that way, I think your equivalency is valid.
But yeah, where on earth is this syntax from? The world of pure theory, I guess... because it's not PCRE or anything else I know.
I don't even try to analyse the regular expressions that I encounter in code. If one must be changed, I just start again from scratch.
For me a+ is aa* and the pipe symbol is used for or between expressions.
Funny thing, I opened explorer++ on my PC today to find all files of a particular name and type, and one of the options was "include regular expressions." I was like wtf are those? So I look it up and couldn't find anything I could understand with my novice brain. So thanks for this, and being recommended to me in short order xD
This seems like an odd regex syntax to me. In all the examples + is an or, which I've never heard about. Is this actually used anywhere (as in, can you actually write those things into some program to highlight/replace those patterns) or is it regex "pseudocode"/mathematical notation?
Genuinely curious, if someone more knowledgeable in theoretical stuff could enlighten me it would be great.
It's maths notation; in most implementations a pipe (or an escaped pipe) is used.
@@halfsourlizard9319 that's what I'm familiar with.
They don't really teach finite automata or regex theory here, at least not at the level I was studying.
Thanks!
@@helleye311 And, of course, in most implementation, + is used for >0 repetition, rather than disjunction/alternative.
s/implementation/implementations/
I often use regex, and I am extremely confused by whatever notation he's using. For example, he says that "h(e+a)llo" is "H, either E or A, L L O" but that's not what it would match as a regex expression. That would be "H, 1 or more Es, A L L O." As a regex expression for what he's saying, that would be "h[ea]llo", so I really have no idea what syntax he's using. Repeat this throughout pretty much all the examples throughout the video. Just about any expression he gives, I'm sitting here thinking "That is *absolutely* not how you would write that."
Maybe I'm not used to this form of grammar since I'm used to doing them in programing where + is a token like * and ? that represents at least one to infinity of the token that precedes it, also that the pipe | is used for OR instead of +. That being said couldn't the abab... or bababa... regex just simply be (ba+ab)* I know at least in my code ((ab|ba)*) will return any string of ababa.... or bababa..
I work with regexs daily but this is deep
;)
Okay I just got to say, anyone who looks like this obviously could be trusted to talk about computer science
There's a slight miss with the RE for an even number of a:s.
Every pairing of a:s can only have b:s and c:s between themselves, and not between other pairings.
So (using b for every (b+c)*) babaabab is allowed, but not babababab.
A corrected RE would be (b+c)*(a(b+c)*a(b+c)*)*, ((b+c)*a(b+c)*a)*(b+c)*, or (a(b+c)*a+b+c)*.
I tend to think the best way to approach a new topic is to start with a more high level and practical view and only later delve into the deeper theory and relationships, but then again I’m not a designer of databases.
The example about any even number of a's near 9:40 is wrong though. It shouldn't be
(b+c)*(a(b+c)*a)*(b+c)*
but
(b+c)*(a(b+c)*a(b+c)*)*
imo.
I usually use IrregularExpressions
Regular expression are not technically Type 3 regular languages descriptors. There is a construct that allows you to search for exact match of a previously matched pattern. That requires having memory and a dfa doesn't have any.
Because most regex engines don't actually implement regular languages and are just misusing the term. PCRE style regex for example are actually significantly more powerful than even CFGs and are potentially turing complete or at least linear bounded automatons.
I thought that the plus sign + is cardinality symbol for "at least one". Not "or" which i thought was the vertical bar |.
RegEx is the closest you get to being a grey bearded wizard writing out spells in ancient runes to get the database to import.
Could you make a video where you have someone talk about the difference of compile time vs runtime and why would you prefer something over the other for which situations? I've heard the terminology passed around regarding C++ or Rust or Swift, talking about how you can make something runtime calculated or compile time calculated and I don't quite understand the advantages of each.
Compile time computation just means that the compiler precomputes a value while processing your code. It then stores the value in the program. When a program calls the function, the assembly under the hood is simply some mov or return operation as opposed to going through n operations for that function. Basically, the compiler builds a cheat cheat ahead of time so the program can perform a look up instead of the actual computation. Bonus points if the result of your function is some value that gets called so many times it ends up in the cpu cache.
Honestly guys, this explanation was very confusing. It manages to be structured and extremely unstructured at the same time.
ChatGPT is a godsend for regex
Only if you already know regular expressions, because if you don't you won't be able to spot its mistakes.
That sounds like a nightmare to debug when something goes wrong. "Sorry for the confusion" feedback loop incoming
regex101 ftw
@@cottonfooOr write tests.
@@cottonfoo You just test them with an online regex tester
It is weird the use of + as an or. The regex normally use + for one or more instance
Professor Thorsten Altenkirch looks like Dave Grohl and Taylor Hawkins have morphed into 1 person and quit Foo Fighters for Foo(x) Pythons.
I guess I’m in the minority? I love regexes, but I get they can become difficult to follow after a certain length. However, they make certain pattern lookups so consistent in a succinct form . Makes life easier in combination with interpreted languages where you don’t have the convenience of optimizing out the steps to a non-regex solution.
I prefer using regexes as simple filters for an unknown input. It’s very handy for input sanitization.
doesnt + in regex mean "repeat at least once"?
you should try to get a video on the Vesuvius challenge reading the herculaneum scrolls
I'm pretty good at regular expressions and this made me forget all of it... his use of + instead of | hurt my head.
Maybe it's just me, but I feel like the follow-through from initial presentation to complete explanation isn't really there. I mean, he explained the formal language and all the components of a regular expression, but I don't feel like I ever really understood _how_ all the definitions give rise to the useful tool it is. I've used regex before for string search and in programming, I'm just having a hard time connecting the practical concept with the all the formal math presented in this video.
My favorite regular expression is horror when I try to fix production pattern.
I was always mystified by regular expressions. I still am. 😎
Regular expressions. Walter wallis the perl poet knows all about them
Edit: had to fix spelling error added 3 l's in all
What is an irregular expression?
Total side thing, but could I just put a humble plug in to request that mice get cleaned before filming? (See e.g. 5:40.) It's less that I find it gross as that I find it distracting to the eye. 😬
I'm currently writing a POSIX-compliant parser for creating Linux utilities and the most frustrating aspect of the code base is the regex. I attempt to avoid regex at all costs, because there are usually simpler and more elegant ways to get the job done, but some portions of the code do need regex. Every time I write unit tests for complex regex, I have to prepare myself for a world of pain, suffering, and disappointment.
regulations expressions as in regular languages aren't the same as the default regex package in most languages.
there is extensions like the greedy flag and backtrack refs that go beyond regular languages in the Chomsky hierarchy.
I use a few simple regular expressions on occasion. This made my head hurt. 🤯
I thought + meant one or more of previous character?
"Yes im from Germany, are you surprised?!" 😆
through path of exile i found out abouut regex and i love it, want to become an expert
Both fascinating and ptsd triggering.
Nice video, thanks :)
I love regular expressions. I use them loads at work. They can also be used to generate fractal art.
This seems to be teaching more than just regex its teaching set theory of regex, etc
„Are you from Germany?“ - How do you not recognize that accent.
Regex for no "ab" using the notation from the video: a + (b + c + ac)*
how does this you give you "baa"?
I don't want AI to take my job, but I do want AI to generate RegEx for me.
I've used regex before and thought I understood it, now I'm realising that I never will understand it and never want to use it again.
Regular expressions are in the domain of write once, read never languages.
Psst, use an LLM based helper to write RE for you. Also good for non trivial CSS 😉
Where are you guys even getting dot matrix paper in 2023 lmfao
I like to think that they had this store room full of boxes of dot matrix printer paper and someone asked, "How do we get rid of all this stuff?" Viola, wildly popular YT channel is born.
For me galavanting(\saround){0,1} \sin\sfar\saway\splaces is a regular expression.
0:06 This made me check my emails