As dry as the topic is, this talk was amazing. It was clearly structured, most topics built on the previos ones and the wording was easy to understand. I watched it at a far too late hour, but understood most of it, although struggeling at the DFA explanation. If someone asks me about utf-8 and parsing, I will recommend this talk.
Well, the first thing you should always do when someone asks about parsing utf-8 is check whether they're not doing so for a bogus reason. Decoding utf-8 into codepoints is something that's rarely needed, instead you can usually simply treat utf-8 strings as byte-strings that satisfy a certain grammar (which needs to be validated for untrusted inputs, but otherwise isn't too important).
As far as error handling is concerned, invalid code units cannot simply be dropped. They must be substituted with U+FFFD or the conversion must halt. The security implications are covered here : unicode.org/reports/tr36/#Ill-Formed_Subsequences Thank you for the great presentation!
@@bit2shift It wouldn't matter. The enums don't need to be in public headers so they don't need to be scoped anyway. But even if you do scope your enums in C, foo__bar is hardly any different from foo::bar. Sure you would always need to use foo__bar and could never just do bar (actually, you could have a macro to bring unscoped variants into a function scope), but some consider such context sensitivity ugly. Best to forget about such petty "what's ugly and what's not" discourse and just focus on generating fast assembly.
@@pskocik yes, let the compiler do what it knows pretty well. Doesn't matter if the enums are hidden from the public interface or not. The point is that unscoped enums can very easily lead to subtle bugs.
I would've liked to see some kind of analysis of why his code is so much more efficient -- especially compared to the other DFA implementation. I suspect his performance advantage may evaporate once he handles the cases that the other implementations deal with.
if the code handles majority of cases and reliably identify exceptions without losing speed, then the default implementation can be used as fallback. It's done all the time when letting one (non-CPU) piece of hardware handle the fast path, and punt the exceptional cases to the CPU. In this case - if the UTF-8 is well formed, it can be handled fast.
00:15:00 The conditions of the if-else ladder here confused me and after some digging, I found it is just to check the first 1, 3 , 4 and 5 bits and it would be more unerstandable to do this with just an increasing last points of
I actually tested many 'optimized' UTF-8 decoders the other day, unfortunately, many could not correctly handle overlong or incorrect codepoints (even if they claimed they could), report error positions for invalid/corrupted bytes accurately, or beat a naive UTF-8 decoder when decoding mostly ASCII (Due to branch mispredicts). I like the presentation of this topic, but unfortunately, most of these 'optimized' decoders just aren't practical for production software.
"DFAs can recognize simple regular expressions" thats kinda backwards, regular expressions were first defined as a way to describe regular languages, which are, by definition, accepted by DFAs. The problem is that Perl then implemented regexes via backtracking and started adding features that are not regular, but easy to implement when you have a backtracking solver...
As dry as the topic is, this talk was amazing. It was clearly structured, most topics built on the previos ones and the wording was easy to understand. I watched it at a far too late hour, but understood most of it, although struggeling at the DFA explanation.
If someone asks me about utf-8 and parsing, I will recommend this talk.
Well, the first thing you should always do when someone asks about parsing utf-8 is check whether they're not doing so for a bogus reason. Decoding utf-8 into codepoints is something that's rarely needed, instead you can usually simply treat utf-8 strings as byte-strings that satisfy a certain grammar (which needs to be validated for untrusted inputs, but otherwise isn't too important).
As far as error handling is concerned, invalid code units cannot simply be dropped. They must be substituted with U+FFFD or the conversion must halt. The security implications are covered here : unicode.org/reports/tr36/#Ill-Formed_Subsequences
Thank you for the great presentation!
mm_set and mm_set1 intrinsics compile to RAM references. You should use _mm_setzero_si128 to zero registers.
Skip to somewhere around 14:00 if you already know how UTF-8 and friends work to get straight to talking about the code.
Extremely interesting presentation. IMHO it's C (but really high C quality) and not C++, but this doesn't remove anything to the quality of the talk.
Actually, the C version of the code shown would be a lot uglier considering the enums would leak out of struct scopes.
@@bit2shift It wouldn't matter. The enums don't need to be in public headers so they don't need to be scoped anyway. But even if you do scope your enums in C, foo__bar is hardly any different from foo::bar. Sure you would always need to use foo__bar and could never just do bar (actually, you could have a macro to bring unscoped variants into a function scope), but some consider such context sensitivity ugly. Best to forget about such petty "what's ugly and what's not" discourse and just focus on generating fast assembly.
@@pskocik yes, let the compiler do what it knows pretty well.
Doesn't matter if the enums are hidden from the public interface or not. The point is that unscoped enums can very easily lead to subtle bugs.
I would've liked to see some kind of analysis of why his code is so much more efficient -- especially compared to the other DFA implementation. I suspect his performance advantage may evaporate once he handles the cases that the other implementations deal with.
if the code handles majority of cases and reliably identify exceptions without losing speed, then the default implementation can be used as fallback. It's done all the time when letting one (non-CPU) piece of hardware handle the fast path, and punt the exceptional cases to the CPU. In this case - if the UTF-8 is well formed, it can be handled fast.
00:15:00
The conditions of the if-else ladder here confused me and after some digging, I found it is just to check the first 1, 3 , 4 and 5 bits and it would be more unerstandable to do this with just an increasing last points of
I actually tested many 'optimized' UTF-8 decoders the other day, unfortunately, many could not correctly handle overlong or incorrect codepoints (even if they claimed they could), report error positions for invalid/corrupted bytes accurately, or beat a naive UTF-8 decoder when decoding mostly ASCII (Due to branch mispredicts). I like the presentation of this topic, but unfortunately, most of these 'optimized' decoders just aren't practical for production software.
I'm surprised that Microsoft had the best results among the competition.
Want to have strlen() results to feel the speed.
"DFAs can recognize simple regular expressions" thats kinda backwards, regular expressions were first defined as a way to describe regular languages, which are, by definition, accepted by DFAs. The problem is that Perl then implemented regexes via backtracking and started adding features that are not regular, but easy to implement when you have a backtracking solver...
Energy consumption should also be compared.
Energy consumption of SIMD instructions are typically going to be lower than for equivalent SISD instructions over the same data.